Skip to content

How It Works

Deep dive into the NeuroShard architecture and how decentralized training actually works.

The Big Picture

NeuroShard is fundamentally different from traditional AI systems:

Traditional AINeuroShard
Model trained in data centerModel trained across global network
Company owns the modelNetwork owns the model
Fixed architectureDynamic architecture that grows
Centralized inferenceDistributed inference
Opaque training dataVerifiable Genesis Dataset

The Training Pipeline

1. Genesis Data Sharding

Training data comes from the Genesis Dataset — a cryptographically verified manifest of high-quality, open-source datasets (FineWeb, RedPajama, etc.).

Why this matters:

  • Any peer can verify a Driver's work by downloading the same shard
  • If a Driver sends garbage, the hash of their output will mismatch
  • Forces Drivers to actually process real training data

2. Forward Pass (Pipeline Parallelism)

When training or doing inference, data flows through the network:

3. Backward Pass (Training)

During training, gradients flow backwards:

4. DiLoCo: Distributed Low-Communication Training

Here's the magic that makes decentralized training practical:

Traditional Distributed Training:

  • Sync gradients after EVERY step
  • Requires low-latency, high-bandwidth connection
  • Impossible on residential internet

DiLoCo (What NeuroShard Uses):

  • Each node trains independently for N steps (default: 500)
  • Only sync pseudo-gradients periodically
  • 500x less communication!
python
# DiLoCo Inner Loop
w0 = copy(weights)           # Save initial weights
for step in range(500):      # Train locally
    loss = forward(batch)
    loss.backward()
    optimizer.step()

# Compute Pseudo-Gradient
delta = w0 - weights         # What we learned

# Sync with Peers (rarely!)
aggregated_delta = gossip_aggregate(delta)

# Apply Outer Update
weights = w0 + outer_lr * aggregated_delta

5. Robust Aggregation

When gradients are synchronized, we can't trust all nodes. NeuroShard uses Byzantine-tolerant aggregation:

MethodDescriptionRobustness
MeanSimple averageVulnerable to poisoning
Trimmed MeanRemove top/bottom 10%, then averageGood
Coordinate MedianMedian of each parameterGood
KrumSelect gradient closest to majorityExcellent
Multi-KrumWeighted combination of top-kExcellent

Example attack scenario:

Honest Nodes: gradient = [0.1, 0.2, 0.1, 0.15]
Malicious Node: gradient = [100, -100, 50, -50]  # Poisoning attempt

Simple Mean: [25.1, -24.7, 12.8, -12.4]  # Poisoned!
Trimmed Mean: [0.125, 0.175, 0.1, 0.125]  # Safe!

Dynamic Architecture

No Fixed Model Sizes

Traditional LLMs have fixed sizes (7B, 13B, 70B). NeuroShard's model grows organically:

Network SizeArchitectureParams
10 nodes (40GB)16 layers x 1024 dim350M
50 nodes (300GB)24 layers x 2048 dim2.7B
100 nodes (800GB)32 layers x 3072 dim9.2B
500 nodes (4TB)48 layers x 5120 dim47B
1000 nodes (8TB)64 layers x 7168 dim123B

Scaling Laws

Architecture follows empirical scaling laws from GPT-3/Chinchilla research:

WidthM0.6DepthM0.4

Where M is total network memory. Width grows faster than depth (empirically more efficient).

Architecture Upgrades

The network automatically recalculates optimal architecture:

  1. Every 50 nodes joining
  2. If improvement >= 30%
  3. New nodes use new architecture
  4. Existing nodes migrate gradually

Proof of Neural Work

Every contribution is cryptographically verified:

FieldExample Value
node_idabc123...
timestamp1699999999.123456
proof_type"training"
tokens_processed50000
training_batches120
layers_held24
has_embeddingtrue
has_lm_headfalse
signature[ECDSA signature]

Verification steps:

  1. Signature Check: ECDSA with secp256k1 (same as Bitcoin)
  2. Timestamp: Must be within 5 minutes
  3. Replay Prevention: Signature never seen before
  4. Rate Limiting: Max 120 proofs/hour
  5. Plausibility: Claimed work is physically possible

Network Resilience

Swarm Routing

Unlike brittle pipeline parallelism, NeuroShard uses multipath routing:

If Node D hangs:

  • Automatically route to Node E
  • No crash, no restart
  • 200ms failover

Activation Buffering

Each node maintains queues to maximize GPU utilization:

Network latency is hidden by always having work ready.

Speculative Checkpointing

Hot snapshots every 2 minutes:

  • Model weights
  • Optimizer state
  • DiLoCo buffers

On crash: recover in <30 seconds (vs. full restart).

Economic Incentives

Every action has economic consequences:

ActionResult
Compute gradientsEarn NEURO
Process inferenceEarn NEURO
Submit invalid proofLose NEURO (slashed)
Stake NEUROEarn bonus multiplier
Become ValidatorEarn validation fees

This aligns incentives: the most profitable strategy is to honestly contribute compute.

Next Steps

Released under the MIT License.