Mathematical Foundations

This document provides a complete mathematical treatment of all algorithms and techniques used in NeuroShard's distributed LLM training system. Every equation is explained with intuition and derivation.

1. Training Objective

1.1 Language Modeling Loss

NeuroShard trains a causal language model to predict the next token. Given a sequence of tokens $x_{1}, x_{2}, \dots, x_{T}$ , the objective is to maximize:

L (θ) = - \frac{1}{T} \sum_{t = 1}^{T} \log P_{θ} (x_{t} | x_{< t})

Where:

$θ$ = model parameters
$x_{t}$ = token at position $t$
$x_{< t}$ = all tokens before position $t$
$P_{θ}$ = probability distribution from the model

1.2 Cross-Entropy Loss

The model outputs logits $z \in R^{V}$ (where $V$ is vocabulary size), converted to probabilities via softmax:

P (x_{t} = k | x_{< t}) = \frac{\exp (z_{k})}{\sum_{j = 1}^{V} \exp (z_{j})} = softmax (z)_{k}

The cross-entropy loss for a single token with true label $y$ is:

L_{CE} = - \log P (y) = - z_{y} + \log \sum_{j = 1}^{V} \exp (z_{j})

Gradient with respect to logits:

\frac{\partial L_{CE}}{\partial z_{k}} = P (k) - 1 [k = y] = softmax (z)_{k} - 1 [k = y]

This elegant result shows the gradient is simply the difference between predicted probability and the one-hot target.

2. The DiLoCo Algorithm

DiLoCo (Distributed Low-Communication training) is a two-level optimization algorithm that reduces communication by orders of magnitude.

2.1 Algorithm Overview

Inner Loop (local, no communication):

θ_{t + 1}^{(i)} = θ_{t}^{(i)} - η_{inner} \cdot g_{t}^{(i)}

Where $g_{t}^{(i)} = \nabla_{θ} L (θ_{t}^{(i)}, B_{t}^{(i)})$ is the gradient on node $i$ for batch $B_{t}^{(i)}$ .

Pseudo-Gradient Computation (after $H$ inner steps):

Δ θ^{(i)} = θ_{0}^{(i)} - θ_{H}^{(i)} = \sum_{t = 0}^{H - 1} η_{inner} \cdot g_{t}^{(i)}

Aggregation (across $N$ nodes):

\bar{Δ θ} = Aggregate (Δ θ^{(1)}, Δ θ^{(2)}, \dots, Δ θ^{(N)})

Outer Loop (Nesterov momentum update):

θ_{new} = θ_{0} + η_{outer} \cdot Nesterov (\bar{Δ θ})

2.2 Why Pseudo-Gradients Approximate True Gradients

Over $H$ inner steps, the pseudo-gradient accumulates:

Δ θ = η_{inner} \sum_{t = 0}^{H - 1} g_{t}

By the law of large numbers, as $H \to \infty$ :

\frac{1}{H} \sum_{t = 0}^{H - 1} g_{t} \overset{}{\to} E [g] = \nabla L (θ)

Therefore:

Δ θ \approx H \cdot η_{inner} \cdot \nabla L (θ)

The pseudo-gradient points in the same direction as the true gradient, scaled by $H \cdot η_{inner}$ .

2.3 Convergence Guarantee

Under standard assumptions (L-smooth loss, bounded variance $σ^{2}$ ):

E [L (θ_{T})] - L (θ^{*}) \leq O (\frac{1}{\sqrt{T \cdot H}})

This matches the convergence rate of synchronous SGD while requiring $H \times$ less communication.

3. The Inner Optimizer: AdamW

The inner loop uses AdamW, which combines Adam's adaptive learning rates with decoupled weight decay.

3.1 Algorithm

Given gradient $g_{t} = \nabla_{θ} L (θ_{t})$ :

Moment estimates:

m_{t} = β_{1} \cdot m_{t - 1} + (1 - β_{1}) \cdot g_{t} (first moment / mean)

v_{t} = β_{2} \cdot v_{t - 1} + (1 - β_{2}) \cdot g_{t}^{2} (second moment / variance)

Bias correction:

{\hat{m}}_{t} = \frac{m_{t}}{1 - β_{1}^{t}}, {\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}}

Update with decoupled weight decay:

θ_{t + 1} = θ_{t} - η \cdot (\frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} + λ \cdot θ_{t})

3.2 Hyperparameters

Parameter	Symbol	Default	Purpose
Learning rate	$η$	$10^{- 4}$	Step size
First moment decay	$β_{1}$	0.9	Gradient momentum
Second moment decay	$β_{2}$	0.95	Variance estimation
Epsilon	$ϵ$	$10^{- 8}$	Numerical stability
Weight decay	$λ$	0.1	L2 regularization

3.3 Intuition

First moment ( $m_{t}$ ): Exponential moving average of gradients → provides momentum
Second moment ( $v_{t}$ ): Exponential moving average of squared gradients → adapts learning rate per parameter
Bias correction: Compensates for initialization at zero (important early in training)
Decoupled weight decay: Unlike L2 regularization, applies decay directly to weights, not through gradients

4. The Outer Optimizer: Nesterov Momentum

The outer loop applies Nesterov accelerated gradient descent to pseudo-gradients.

4.1 Standard Momentum

Classical momentum:

v_{t} = μ \cdot v_{t - 1} + Δ θ_{t}

θ_{t + 1} = θ_{t} + η \cdot v_{t}

4.2 Nesterov Momentum (Look-Ahead)

Nesterov momentum evaluates the gradient at a "look-ahead" point:

v_{t} = μ \cdot v_{t - 1} + Δ θ_{t}

θ_{t + 1} = θ_{t} + η \cdot (μ \cdot v_{t} + Δ θ_{t})

Expanded form:

θ_{t + 1} = θ_{t} + η \cdot μ \cdot (μ \cdot v_{t - 1} + Δ θ_{t}) + η \cdot Δ θ_{t}

4.3 Why Nesterov Works Better

The key insight is that Nesterov momentum makes a correction based on where momentum will take us, not where we currently are:

Standard:   θ → θ + μv → evaluate gradient → update
Nesterov:   θ → θ + μv → evaluate gradient at look-ahead → correct update

This "look-ahead" property provides:

Faster convergence near minima
Better handling of curved loss surfaces
Automatic slowdown when overshooting

4.4 Implementation

python

# Nesterov momentum update
v = μ * v + Δθ                    # Update velocity
θ = θ + η * (μ * v + Δθ)          # Apply with look-ahead

This is equivalent to:

θ_{t + 1} = θ_{t} + η \cdot μ^{2} \cdot v_{t - 1} + η \cdot (1 + μ) \cdot Δ θ_{t}

5. Byzantine-Tolerant Aggregation

When aggregating gradients from potentially malicious nodes, we need robust methods.

5.1 Problem Formulation

Given $N$ gradient contributions ${Δ θ^{(1)}, \dots, Δ θ^{(N)}}$ where up to $f$ may be Byzantine (arbitrary), find an aggregate $\bar{Δ θ}$ such that training converges.

5.2 Simple Mean (Vulnerable)

\bar{Δ θ} = \frac{1}{N} \sum_{i = 1}^{N} Δ θ^{(i)}

Vulnerability: A single Byzantine node can set $Δ θ^{(bad)} = M$ for arbitrarily large $M$ , corrupting the mean.

5.3 Coordinate-Wise Median

For each parameter $j$ :

{\bar{Δ θ}}_{j} = median (Δ θ_{j}^{(1)}, \dots, Δ θ_{j}^{(N)})

Robustness: Tolerates up to $⌊ (N - 1) / 2 ⌋$ Byzantine nodes.

Limitation: High variance compared to mean; ignores correlation between coordinates.

5.4 Trimmed Mean

Remove the top and bottom $α$ fraction of values, then average:

{\bar{Δ θ}}_{j} = \frac{1}{N - 2 k} \sum_{i = k + 1}^{N - k} Δ θ_{j, sorted}^{(i)}

Where $k = ⌊ α \cdot N ⌋$ .

Default: $α = 0.1$ (remove top 10% and bottom 10%)

Robustness: Tolerates up to $α$ fraction of Byzantine nodes.

5.5 Krum

Select the gradient closest to the majority.

Score function (for each gradient $i$ ):

S (i) = \sum_{j \in N_{i}} ∥ Δ θ^{(i)} - Δ θ^{(j)} ∥^{2}

Where $N_{i}$ is the set of $(N - f - 2)$ nearest neighbors of $i$ .

Selection:

i^{*} = \arg min_{i} S (i)

\bar{Δ θ} = Δ θ^{(i^{*})}

Robustness: Provably robust when $N \geq 2 f + 3$ .

Theorem (Blanchard et al., 2017): If at most $f$ of $N$ gradients are Byzantine, Krum selects a gradient $Δ θ^{(i^{*})}$ such that:

∥ Δ θ^{(i^{*})} - \nabla L ∥^{2} \leq (2 f + 2) \cdot σ^{2}

where $σ^{2}$ is the variance of honest gradients.

5.6 Multi-Krum

Average the top $m$ gradients by Krum score:

\bar{Δ θ} = \frac{1}{m} \sum_{i \in M} Δ θ^{(i)}

Where $M$ contains the $m = N - f$ indices with lowest Krum scores.

Benefit: Lower variance than Krum while maintaining robustness.

5.7 Geometric Median

Find the point minimizing sum of Euclidean distances:

\bar{Δ θ} = \arg min_{x} \sum_{i = 1}^{N} ∥ x - Δ θ^{(i)} ∥_{2}

Weiszfeld Algorithm (iterative solution):

x^{(t + 1)} = \frac{\sum_{i = 1}^{N} \frac{Δ θ^{(i)}}{∥ x^{(t)} - Δ θ^{(i)} ∥_{2}}}{\sum_{i = 1}^{N} \frac{1}{∥ x^{(t)} - Δ θ^{(i)} ∥_{2}}}

Robustness: Optimal breakdown point of $⌊ (N - 1) / 2 ⌋$ .

5.8 Comparison Table

Method	Byzantine Tolerance	Variance	Complexity
Mean	0	Lowest	$O (N)$
Median	$(N - 1) / 2$	High	$O (N \log N)$
Trimmed Mean	$α N$	Low	$O (N \log N)$
Krum	$(N - 3) / 2$	Very High	$O (N^{2} d)$
Multi-Krum	$(N - 3) / 2$	Medium	$O (N^{2} d)$
Geometric Median	$(N - 1) / 2$	Low	$O (N \cdot iter)$

Where $d$ is the number of parameters.

6. Gradient Validation

Before aggregation, incoming gradients are validated.

6.1 Cosine Similarity Check

Measures alignment between submitted gradient $g_{s}$ and reference gradient $g_{r}$ :

\cos (g_{s}, g_{r}) = \frac{g_{s} \cdot g_{r}}{∥ g_{s} ∥_{2} \cdot ∥ g_{r} ∥_{2}}

Rejection criterion: $\cos (g_{s}, g_{r}) < τ$ (default $τ = 0.3$ )

Intuition: Honest gradients should point in similar directions (same optimization target). Anti-correlated gradients suggest malicious intent.

6.2 Magnitude Ratio Check

ρ = \frac{∥ g_{s} ∥_{2}}{∥ g_{r} ∥_{2}}

Rejection criterion: $ρ > ρ_{max}$ or $ρ < ρ_{min}$ (default: 10× range)

Intuition: Gradients should have similar scale. Extreme magnitudes suggest scaling attacks.

6.3 Variance Ratio Check

\frac{Var (g_{s})}{Var (g_{r})} > V_{max}

Intuition: Abnormally high variance suggests noise injection.

7. Gradient Compression

For bandwidth efficiency, gradients are compressed before transmission.

7.1 Top-K Sparsification

Keep only the $k$ largest magnitude elements:

TopK (g, k) = {(i, g_{i}) : i \in argtopk (| g |, k)}

Sparsity: $k = ⌊ 0.1 \cdot d ⌋$ (keep 10%)

Error bound: The approximation error is bounded by the sum of discarded elements:

∥ g - TopK (g, k) ∥_{2}^{2} = \sum_{i \notin TopK} g_{i}^{2}

7.2 Quantization

Map floating-point values to integers:

q (x) = round (x \cdot \frac{2^{b - 1} - 1}{max | x |})

Dequantization:

\hat{x} = q (x) \cdot \frac{max | x |}{2^{b - 1} - 1}

Quantization error (per element):

| x - \hat{x} | \leq \frac{max | x |}{2^{b} - 2}

For 8-bit quantization with $max | x | = 1$ :

| x - \hat{x} | \leq \frac{1}{254} \approx 0.4 %

7.3 Why Compression Works

Theorem (Stich et al., 2018): SGD with compressed gradients converges at rate:

E [L (θ_{T})] - L (θ^{*}) \leq O (\frac{1}{\sqrt{T}} + \frac{ω}{T})

Where $ω$ is the compression ratio. The extra $ω / T$ term vanishes asymptotically.

Intuition:

SGD gradients are already noisy (mini-batch variance)
Compression error is much smaller than mini-batch noise
Averaging across nodes cancels compression errors (Central Limit Theorem)

8. Model Architecture Mathematics

8.1 RMS Normalization

Root Mean Square Layer Normalization:

RMSNorm (x) = \frac{x}{RMS (x)} \cdot γ

Where:

RMS (x) = \sqrt{\frac{1}{d} \sum_{i = 1}^{d} x_{i}^{2} + ϵ}

Compared to LayerNorm:

LayerNorm (x) = \frac{x - μ}{σ} \cdot γ + β

RMSNorm omits the mean subtraction and bias, making it:

~10% faster to compute
More stable for very deep networks
Empirically equivalent performance

8.2 Rotary Position Embeddings (RoPE)

RoPE encodes position through rotation in 2D subspaces.

Rotation matrix for position $m$ and frequency $θ_{i}$ :

R_{θ_{i}, m} = (\begin{matrix} \cos (m θ_{i}) & - \sin (m θ_{i}) \\ \sin (m θ_{i}) & \cos (m θ_{i}) \end{matrix})

Frequency schedule:

θ_{i} = 10000^{- 2 i / d}

Application to query/key vectors (treating pairs of dimensions):

RoPE (x, m) = (\begin{matrix} R_{θ_{0}, m} \\ R_{θ_{1}, m} \\ ⋱ \end{matrix}) x

Key property (relative position awareness):

⟨ RoPE (q, m), RoPE (k, n) ⟩ = ⟨ q, R_{θ, n - m} k ⟩

The attention score depends only on the relative position $(n - m)$ , not absolute positions.

Complex number formulation (equivalent, more elegant):

RoPE (x, m) = x ⊙ e^{i m θ}

Where $x$ is viewed as complex numbers and $⊙$ is element-wise multiplication.

8.3 Grouped Query Attention (GQA)

Standard multi-head attention has $H$ heads for Q, K, and V. GQA uses fewer KV heads.

Projections:

Q = x W_{Q} \in R^{B \times L \times H \times d_{h}}

K = x W_{K} \in R^{B \times L \times G \times d_{h}}

V = x W_{V} \in R^{B \times L \times G \times d_{h}}

Where $G < H$ is the number of KV groups.

Head expansion (repeat KV heads to match query heads):

K^{'} = repeat (K, H / G), V^{'} = repeat (V, H / G)

Attention computation:

Attention (Q, K^{'}, V^{'}) = softmax (\frac{Q K^{' T}}{\sqrt{d_{h}}}) V^{'}

Memory savings: KV cache reduced by factor $H / G$ (e.g., 4× for $H = 8, G = 2$ ).

8.4 Scaled Dot-Product Attention

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}} + M) V

Where:

$Q \in R^{L_{q} \times d_{k}}$ = queries
$K \in R^{L_{k} \times d_{k}}$ = keys
$V \in R^{L_{k} \times d_{v}}$ = values
$M$ = causal mask ( $- \infty$ for future positions)

Why scale by $\sqrt{d_{k}}$ ?

If $q, k$ have unit variance, then:

Var (q \cdot k) = d_{k}

Scaling by $\sqrt{d_{k}}$ restores unit variance:

Var (\frac{q \cdot k}{\sqrt{d_{k}}}) = 1

This prevents softmax saturation (extreme probabilities) which would cause vanishing gradients.

8.5 SwiGLU Activation

A gated linear unit with SiLU (Swish) activation:

SwiGLU (x) = SiLU (x W_{gate}) ⊙ (x W_{up})

Where:

SiLU (x) = x \cdot σ (x) = \frac{x}{1 + e^{- x}}

Full FFN block:

FFN (x) = ((SiLU (x W_{gate}) ⊙ (x W_{up})) W_{down}

Why gating helps:

Allows the network to selectively pass information
Smoother gradients than ReLU
Empirically better performance for LLMs

Comparison of activations:

Activation	Formula	Gradient
ReLU	$max (0, x)$	$1 [x > 0]$
GELU	$x \cdot Φ (x)$	Smooth
SiLU/Swish	$x \cdot σ (x)$	$σ (x) (1 + x (1 - σ (x)))$

9. Transformer Forward Pass

9.1 Single Layer

For input $x \in R^{B \times L \times d}$ :

# Pre-norm attention
h = x + Attention(RMSNorm(x))

# Pre-norm FFN  
out = h + FFN(RMSNorm(h))

Mathematically:

h = x + Attention (RMSNorm (x))

out = h + FFN (RMSNorm (h))

9.2 Full Forward Pass

# Embedding
h_0 = Embed(tokens)

# Transformer layers
for l in range(L):
    h_{l+1} = TransformerBlock_l(h_l)

# Output
logits = LMHead(RMSNorm(h_L))

9.3 Parameter Count

For a model with:

$d$ = hidden dimension
$L$ = number of layers
$H$ = attention heads
$G$ = KV heads
$d_{h}$ = head dimension = $d / H$
$d_{f f}$ = FFN intermediate dimension
$V$ = vocabulary size

Per-layer parameters:

Component	Parameters
Q projection	$d \times d$
K projection	$d \times (G \cdot d_{h})$
V projection	$d \times (G \cdot d_{h})$
O projection	$d \times d$
Gate projection	$d \times d_{f f}$
Up projection	$d \times d_{f f}$
Down projection	$d_{f f} \times d$
RMSNorm (×2)	$2 d$

Total:

P = V \cdot d + L \cdot (2 d^{2} + 2 d \cdot G \cdot d_{h} + 3 d \cdot d_{f f} + 2 d) + d + V \cdot d

Simplified (assuming $G = H / 4$ , $d_{f f} = 4 d$ , tied embeddings):

P \approx V \cdot d + L \cdot (2.5 d^{2} + 12 d^{2}) \approx V \cdot d + 14.5 \cdot L \cdot d^{2}

10. Backpropagation Through Transformers

10.1 Gradient Flow

The gradient of loss with respect to layer $l$ input:

\frac{\partial L}{\partial h_{l}} = \frac{\partial L}{\partial h_{l + 1}} \cdot (I + \frac{\partial {Block}_{l}}{\partial h_{l}})

The residual connection ( $I$ ) ensures gradients flow directly, preventing vanishing gradients.

10.2 Gradient Clipping

Before applying gradients, clip the global norm:

g^{'} = {\begin{cases} g & if ∥ g ∥_{2} \leq c \\ \frac{c \cdot g}{∥ g ∥_{2}} & otherwise \end{cases}

Where $c$ is the maximum norm (default: 1.0).

Purpose: Prevents exploding gradients from destabilizing training.

11. Complete Training Algorithm

Putting it all together:

Algorithm: NeuroShard DiLoCo Training

Inputs:

Model $f_{θ}$ with parameters $θ$
Inner optimizer (AdamW) with learning rate $η_{inner}$
Outer optimizer (Nesterov) with learning rate $η_{outer}$ , momentum $μ$
Inner steps $H$ , nodes $N$ , aggregation function $Agg$

For each outer step $k = 1, 2, \dots$ :

Save initial weights: $θ_{0}^{(i)} \leftarrow θ$ for all nodes $i$

Inner loop (on each node $i$ independently):

for t = 0 to H-1:
    Sample batch B_t^{(i)}
    Compute loss: L = CrossEntropy(f_θ(B_t), labels)
    Compute gradient: g_t = ∇_θ L
    Clip gradient: g_t = clip(g_t, max_norm)
    Update: θ = AdamW(θ, g_t)

Compute pseudo-gradient:
$Δ θ^{(i)} = θ_{0}^{(i)} - θ^{(i)}$
Compress (optional):
$Δ θ_{compressed}^{(i)} = Quantize (TopK (Δ θ^{(i)}))$
Exchange via gossip protocol

Validate each received gradient:

for each peer gradient Δθ^{(j)}:
    if cosine_sim(Δθ^{(j)}, Δθ^{(i)}) < τ: reject
    if magnitude_ratio out of bounds: reject

Aggregate:
$\bar{Δ θ} = TrimmedMean ({Δ θ^{(i)}}_{valid})$

Outer update (Nesterov):

v = μ * v + Δθ_bar
θ = θ_0 + η_outer * (μ * v + Δθ_bar)

Broadcast new $θ$ to all nodes

12. Convergence Analysis

12.1 Assumptions

L-smoothness: $∥ \nabla L (θ) - \nabla L (ϕ) ∥ \leq L ∥ θ - ϕ ∥$
Bounded variance: $E [∥ g - \nabla L ∥^{2}] \leq σ^{2}$
Bounded gradients: $∥ \nabla L (θ) ∥ \leq G$

12.2 Main Result

Theorem: Under the above assumptions, DiLoCo with $N$ nodes, $H$ inner steps, and appropriate learning rates achieves:

\frac{1}{T} \sum_{k = 1}^{T} E [∥ \nabla L (θ_{k}) ∥^{2}] \leq O (\frac{L (θ_{0}) - L^{*}}{η T H} + \frac{η L σ^{2}}{N} + η^{2} L^{2} H σ^{2})

Optimal learning rate: $η^{*} = O (\sqrt{\frac{N}{T H L σ^{2}}})$

Resulting convergence rate:

O (\sqrt{\frac{L (L_{0} - L^{*}) σ^{2}}{N T H}})

This shows:

Linear speedup with $N$ nodes ✓
Convergence improves with more inner steps $H$ ✓
Same asymptotic rate as synchronous SGD ✓

13. Summary of Key Equations

Concept	Equation
Cross-Entropy Loss	$L = - \log softmax (z)_{y}$
AdamW Update	$θ = θ - η (\hat{m} / \sqrt{\hat{v}} + λ θ)$
Nesterov Momentum	$θ = θ + η (μ v + Δ θ)$
Pseudo-Gradient	$Δ θ = θ_{0} - θ_{H}$
Trimmed Mean	$\bar{x} = mean (x_{k + 1 : n - k})$
Krum Score	$S (i) = \sum_{j \in N_{i}} \| g_{i} - g_{j} \|^{2}$
RMSNorm	$\hat{x} = x / RMS (x) \cdot γ$
RoPE	$RoPE (x, m) = x ⊙ e^{i m θ}$
Attention	$softmax (Q K^{T} / \sqrt{d_{k}}) V$
SwiGLU	$SiLU (x W_{g}) ⊙ (x W_{u})$

References

DiLoCo: Douillard et al., "DiLoCo: Distributed Low-Communication Training of Language Models" (2023)
AdamW: Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (2019)
Nesterov: Nesterov, "A method for solving the convex programming problem with convergence rate O(1/k²)" (1983)
Krum: Blanchard et al., "Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent" (2017)
RoPE: Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021)
GQA: Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models" (2023)
SwiGLU: Shazeer, "GLU Variants Improve Transformer" (2020)
Gradient Compression: Stich et al., "Sparsified SGD with Memory" (2018)

Mathematical Foundations ​

1. Training Objective ​

1.1 Language Modeling Loss ​

1.2 Cross-Entropy Loss ​

2. The DiLoCo Algorithm ​

2.1 Algorithm Overview ​

2.2 Why Pseudo-Gradients Approximate True Gradients ​

2.3 Convergence Guarantee ​

3. The Inner Optimizer: AdamW ​

3.1 Algorithm ​

3.2 Hyperparameters ​

3.3 Intuition ​

4. The Outer Optimizer: Nesterov Momentum ​

4.1 Standard Momentum ​

4.2 Nesterov Momentum (Look-Ahead) ​

4.3 Why Nesterov Works Better ​

4.4 Implementation ​

5. Byzantine-Tolerant Aggregation ​

5.1 Problem Formulation ​

5.2 Simple Mean (Vulnerable) ​

5.3 Coordinate-Wise Median ​

5.4 Trimmed Mean ​

5.5 Krum ​

5.6 Multi-Krum ​

5.7 Geometric Median ​

5.8 Comparison Table ​

6. Gradient Validation ​

6.1 Cosine Similarity Check ​

6.2 Magnitude Ratio Check ​

6.3 Variance Ratio Check ​

7. Gradient Compression ​

7.1 Top-K Sparsification ​

7.2 Quantization ​

7.3 Why Compression Works ​

8. Model Architecture Mathematics ​

8.1 RMS Normalization ​

8.2 Rotary Position Embeddings (RoPE) ​

8.3 Grouped Query Attention (GQA) ​

8.4 Scaled Dot-Product Attention ​

8.5 SwiGLU Activation ​

9. Transformer Forward Pass ​

9.1 Single Layer ​

9.2 Full Forward Pass ​

9.3 Parameter Count ​

10. Backpropagation Through Transformers ​

10.1 Gradient Flow ​

10.2 Gradient Clipping ​

11. Complete Training Algorithm ​

Algorithm: NeuroShard DiLoCo Training ​

12. Convergence Analysis ​

12.1 Assumptions ​

12.2 Main Result ​

13. Summary of Key Equations ​

References ​

Mathematical Foundations

1. Training Objective

1.1 Language Modeling Loss

1.2 Cross-Entropy Loss

2. The DiLoCo Algorithm

2.1 Algorithm Overview

2.2 Why Pseudo-Gradients Approximate True Gradients

2.3 Convergence Guarantee

3. The Inner Optimizer: AdamW

3.1 Algorithm

3.2 Hyperparameters

3.3 Intuition

4. The Outer Optimizer: Nesterov Momentum

4.1 Standard Momentum

4.2 Nesterov Momentum (Look-Ahead)

4.3 Why Nesterov Works Better

4.4 Implementation

5. Byzantine-Tolerant Aggregation

5.1 Problem Formulation

5.2 Simple Mean (Vulnerable)

5.3 Coordinate-Wise Median

5.4 Trimmed Mean

5.5 Krum

5.6 Multi-Krum

5.7 Geometric Median

5.8 Comparison Table

6. Gradient Validation

6.1 Cosine Similarity Check

6.2 Magnitude Ratio Check

6.3 Variance Ratio Check

7. Gradient Compression

7.1 Top-K Sparsification

7.2 Quantization

7.3 Why Compression Works

8. Model Architecture Mathematics

8.1 RMS Normalization

8.2 Rotary Position Embeddings (RoPE)

8.3 Grouped Query Attention (GQA)

8.4 Scaled Dot-Product Attention

8.5 SwiGLU Activation

9. Transformer Forward Pass

9.1 Single Layer

9.2 Full Forward Pass

9.3 Parameter Count

10. Backpropagation Through Transformers

10.1 Gradient Flow

10.2 Gradient Clipping

11. Complete Training Algorithm

Algorithm: NeuroShard DiLoCo Training

12. Convergence Analysis

12.1 Assumptions

12.2 Main Result

13. Summary of Key Equations

References