Troubleshooting

Solutions to common issues when running a NeuroShard node.

Startup Issues

"Wallet token required!"

[ERROR] Wallet token required!

Cause: No token provided.

Solution:

bash

neuroshard --token YOUR_WALLET_TOKEN

Get your token at neuroshard.com/dashboard.

"Invalid mnemonic"

[WARNING] Invalid mnemonic - treating as raw token

Cause: Mnemonic phrase is incorrect.

Solution:

Verify all 12 words are correct and in order
Ensure words are separated by single spaces

Use quotes around the mnemonic:

bash

neuroshard --token "word1 word2 word3 ..."

"No GPU detected"

[NODE] No GPU detected, using CPU

For NVIDIA GPUs:

bash

# Check if NVIDIA driver is installed
nvidia-smi

# Install CUDA-enabled PyTorch
pip uninstall torch
pip install torch --index-url https://download.pytorch.org/whl/cu121

For Apple Silicon:

bash

# Ensure PyTorch is installed correctly
pip install torch
python -c "import torch; print(torch.backends.mps.is_available())"

Memory Issues

"Out of memory"

RuntimeError: MPS backend out of memory
RuntimeError: CUDA out of memory

Solutions:

Limit memory usage:

bash

neuroshard --token YOUR_TOKEN --memory 4096

Reduce batch size (automatic on OOM, but you can force lower):

bash

# If you see "Reduced batch size to X"
# The node auto-recovers, no action needed

Close other applications using GPU memory

Use CPU if GPU memory is too limited:

bash

CUDA_VISIBLE_DEVICES="" neuroshard --token YOUR_TOKEN

"System memory at X%, skipping training"

Cause: System RAM is critically low.

Solution:

bash

# Free up memory
sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

# Or limit neuroshard memory
neuroshard --token YOUR_TOKEN --memory 2048

Network Issues

"Connection refused"

Failed to forward to peer http://...: Connection refused

Causes & Solutions:

Firewall blocking ports:

bash

sudo ufw allow 8000/tcp
sudo ufw allow 9000/tcp

Peer is offline: Normal, the network will route around it

Port already in use:

bash

neuroshard --token YOUR_TOKEN --port 8001

"Tracker connection failed"

Failed to connect to tracker

Solutions:

Check internet connection

Verify tracker URL:

bash

curl https://neuroshard.com/api/tracker/peers

Use a different tracker if available

"No peers found"

Cause: Network is bootstrapping or you're the first node.

Solution: Wait a few minutes. The tracker will provide peers as they join.

Training Issues

"Data not ready"

RuntimeError: Data not ready - shard still loading

Cause: Genesis data loader is still downloading.

Solution: Wait 30-60 seconds. The node will retry automatically.

"Genesis loader init failed"

[GENESIS] ERROR: Failed to initialize loader

Solutions:

Check disk space:
bash
```
df -h
```
1

Increase storage limit:

bash

neuroshard --token YOUR_TOKEN --max-storage 200

Check write permissions:
bash
```
ls -la ~/.neuroshard/
```
1

Training Loss Not Decreasing

Causes:

Early network stage: Expected behavior when model is small
Gradient poisoning: Rare, network defenses should handle it
Learning rate issues: Currently fixed, no user action needed

Monitor:

bash

curl http://localhost:8000/api/stats | jq '.current_loss'

Dashboard Issues

Dashboard Not Opening

[NODE] Could not open browser

Solution: Open manually at http://localhost:8000/

Dashboard Shows Stale Data

Solution: Refresh the page. The dashboard auto-refreshes every 5 seconds.

API Returns 404

curl: (52) Empty reply from server

Solution: Wait for the node to fully initialize (10-30 seconds after startup).

Checkpoint Issues

"No checkpoint found, starting fresh"

[NODE] No checkpoint found at dynamic_node_XXXX.pt, starting fresh

Cause: No previous checkpoint exists for this wallet.

Solution: This is normal on first run. Checkpoints are saved every 10 steps.

"Architecture mismatch!"

[NODE] Architecture mismatch! Checkpoint is incompatible.
[NODE]   Saved: 15L × 704H, heads=11/1
[NODE]   Current: 17L × 770H, heads=11/1
[NODE]   Starting fresh (architecture was upgraded)

Cause: The model architecture changed due to:

Network capacity increased (more nodes joined)
Memory fluctuation caused different architecture calculation

Solutions:

Normal behavior - when network genuinely upgrades, old checkpoints are incompatible
If happening frequently on restarts, update to latest version with memory tier rounding fix:
bash
```
pip install --upgrade neuroshard
```
1
This rounds memory to 500MB tiers, preventing small fluctuations from changing architecture.

Clear old checkpoints and start fresh:

bash

rm -rf ~/.neuroshard/checkpoints/*
rm -rf ~/.neuroshard/training_logs/*

"Checkpoint layer mismatch"

[WARNING] Checkpoint layer mismatch, starting fresh

Cause: Your assigned layers changed (network rebalanced).

Solution: Normal behavior. Common layers will be loaded if possible.

Checkpoint Not Loading After Restart

Symptoms: Training restarts from step 1 instead of resuming.

Causes & Solutions:

Architecture changed - Check logs for "Architecture mismatch"
Different node_id - Old versions used machine-specific IDs. Update to v0.0.20+ which uses wallet_id

Checkpoint corrupted - Delete and restart:

bash

rm ~/.neuroshard/checkpoints/dynamic_node_*.pt

Checkpoint Corrupted

RuntimeError: Failed to load checkpoint

Solution:

bash

# Delete corrupted checkpoint
rm ~/.neuroshard/checkpoints/dynamic_node_*.pt

# Clear tracker state too (optional, but recommended)
rm ~/.neuroshard/training_logs/*.json

# Restart node
neuroshard --token YOUR_TOKEN

GlobalTrainingTracker State Preserved But Model Lost

If you see logs like:

[NODE] Restored tracker state: 120 steps, avg_loss=0.4872
[NODE] No checkpoint found, starting fresh

Cause: The tracker state (loss history) persisted, but model weights didn't (architecture changed).

Solution: This is actually fine! The tracker history helps you see long-term trends even when the model architecture changes.

Performance Issues

Low GPU Utilization

Causes & Solutions:

Small batch size: Expected with limited memory
Network bottleneck: Increase --diloco-steps
Data loading: Genesis loader might be slow initially

High CPU Usage

Normal during training. To limit:

bash

neuroshard --token YOUR_TOKEN --cpu-threads 4

Node Seems Slow

Check GPU utilization:

bash

nvidia-smi  # For NVIDIA
# Or Activity Monitor on macOS

Check if training is active:

bash

curl http://localhost:8000/api/stats | jq '.total_training_rounds'

Getting Help

Check Logs

bash

# Live logs (if running in foreground)
# Check console output

# Systemd logs
journalctl -u neuroshard -f

# Docker logs
docker logs -f neuroshard

Report Issues

Check our Discord for community support
Include:
- NeuroShard version (neuroshard --version)
- OS and Python version
- Full error message
- Steps to reproduce

Community Support

Website Community
Discord Community

Troubleshooting ​

Startup Issues ​

"Wallet token required!" ​

"Invalid mnemonic" ​

"No GPU detected" ​

Memory Issues ​

"Out of memory" ​

"System memory at X%, skipping training" ​

Network Issues ​

"Connection refused" ​

"Tracker connection failed" ​

"No peers found" ​

Training Issues ​

"Data not ready" ​

"Genesis loader init failed" ​

Training Loss Not Decreasing ​

Dashboard Issues ​

Dashboard Not Opening ​

Dashboard Shows Stale Data ​

API Returns 404 ​

Checkpoint Issues ​

"No checkpoint found, starting fresh" ​

"Architecture mismatch!" ​

"Checkpoint layer mismatch" ​

Checkpoint Not Loading After Restart ​

Checkpoint Corrupted ​

GlobalTrainingTracker State Preserved But Model Lost ​

Performance Issues ​

Low GPU Utilization ​

High CPU Usage ​

Node Seems Slow ​

Getting Help ​

Check Logs ​

Report Issues ​

Community Support ​

See Also ​

Troubleshooting

Startup Issues

"Wallet token required!"

"Invalid mnemonic"

"No GPU detected"

Memory Issues

"Out of memory"

"System memory at X%, skipping training"

Network Issues

"Connection refused"

"Tracker connection failed"

"No peers found"

Training Issues

"Data not ready"

"Genesis loader init failed"

Training Loss Not Decreasing

Dashboard Issues

Dashboard Not Opening

Dashboard Shows Stale Data

API Returns 404

Checkpoint Issues

"No checkpoint found, starting fresh"

"Architecture mismatch!"

"Checkpoint layer mismatch"

Checkpoint Not Loading After Restart

Checkpoint Corrupted

GlobalTrainingTracker State Preserved But Model Lost

Performance Issues

Low GPU Utilization

High CPU Usage

Node Seems Slow

Getting Help

Check Logs

Report Issues

Community Support

See Also