Troubleshooting
Solutions to common issues when running a NeuroShard node.
Startup Issues
"Wallet token required!"
[ERROR] Wallet token required!Cause: No token provided.
Solution:
neuroshard --token YOUR_WALLET_TOKENGet your token at neuroshard.com/dashboard.
"Invalid mnemonic"
[WARNING] Invalid mnemonic - treating as raw tokenCause: Mnemonic phrase is incorrect.
Solution:
- Verify all 12 words are correct and in order
- Ensure words are separated by single spaces
- Use quotes around the mnemonic:bash
neuroshard --token "word1 word2 word3 ..."
"No GPU detected"
[NODE] No GPU detected, using CPUFor NVIDIA GPUs:
# Check if NVIDIA driver is installed
nvidia-smi
# Install CUDA-enabled PyTorch
pip uninstall torch
pip install torch --index-url https://download.pytorch.org/whl/cu121For Apple Silicon:
# Ensure PyTorch is installed correctly
pip install torch
python -c "import torch; print(torch.backends.mps.is_available())"Memory Issues
"Out of memory"
RuntimeError: MPS backend out of memory
RuntimeError: CUDA out of memorySolutions:
Limit memory usage:
bashneuroshard --token YOUR_TOKEN --memory 4096Reduce batch size (automatic on OOM, but you can force lower):
bash# If you see "Reduced batch size to X" # The node auto-recovers, no action neededClose other applications using GPU memory
Use CPU if GPU memory is too limited:
bashCUDA_VISIBLE_DEVICES="" neuroshard --token YOUR_TOKEN
"System memory at X%, skipping training"
Cause: System RAM is critically low.
Solution:
# Free up memory
sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
# Or limit neuroshard memory
neuroshard --token YOUR_TOKEN --memory 2048Network Issues
"Connection refused"
Failed to forward to peer http://...: Connection refusedCauses & Solutions:
Firewall blocking ports:
bashsudo ufw allow 8000/tcp sudo ufw allow 9000/tcpPeer is offline: Normal, the network will route around it
Port already in use:
bashneuroshard --token YOUR_TOKEN --port 8001
"Tracker connection failed"
Failed to connect to trackerSolutions:
- Check internet connection
- Verify tracker URL:bash
curl https://neuroshard.com/api/tracker/peers - Use a different tracker if available
"No peers found"
Cause: Network is bootstrapping or you're the first node.
Solution: Wait a few minutes. The tracker will provide peers as they join.
Training Issues
"Data not ready"
RuntimeError: Data not ready - shard still loadingCause: Genesis data loader is still downloading.
Solution: Wait 30-60 seconds. The node will retry automatically.
"Genesis loader init failed"
[GENESIS] ERROR: Failed to initialize loaderSolutions:
Check disk space:
bashdf -hIncrease storage limit:
bashneuroshard --token YOUR_TOKEN --max-storage 200Check write permissions:
bashls -la ~/.neuroshard/
Training Loss Not Decreasing
Causes:
- Early network stage: Expected behavior when model is small
- Gradient poisoning: Rare, network defenses should handle it
- Learning rate issues: Currently fixed, no user action needed
Monitor:
curl http://localhost:8000/api/stats | jq '.current_loss'Dashboard Issues
Dashboard Not Opening
[NODE] Could not open browserSolution: Open manually at http://localhost:8000/
Dashboard Shows Stale Data
Solution: Refresh the page. The dashboard auto-refreshes every 5 seconds.
API Returns 404
curl: (52) Empty reply from serverSolution: Wait for the node to fully initialize (10-30 seconds after startup).
Checkpoint Issues
"No checkpoint found, starting fresh"
[NODE] No checkpoint found at dynamic_node_XXXX.pt, starting freshCause: No previous checkpoint exists for this wallet.
Solution: This is normal on first run. Checkpoints are saved every 10 steps.
"Architecture mismatch!"
[NODE] Architecture mismatch! Checkpoint is incompatible.
[NODE] Saved: 15L × 704H, heads=11/1
[NODE] Current: 17L × 770H, heads=11/1
[NODE] Starting fresh (architecture was upgraded)Cause: The model architecture changed due to:
- Network capacity increased (more nodes joined)
- Memory fluctuation caused different architecture calculation
Solutions:
Normal behavior - when network genuinely upgrades, old checkpoints are incompatible
If happening frequently on restarts, update to latest version with memory tier rounding fix:
bashpip install --upgrade neuroshardThis rounds memory to 500MB tiers, preventing small fluctuations from changing architecture.
Clear old checkpoints and start fresh:
bashrm -rf ~/.neuroshard/checkpoints/* rm -rf ~/.neuroshard/training_logs/*
"Checkpoint layer mismatch"
[WARNING] Checkpoint layer mismatch, starting freshCause: Your assigned layers changed (network rebalanced).
Solution: Normal behavior. Common layers will be loaded if possible.
Checkpoint Not Loading After Restart
Symptoms: Training restarts from step 1 instead of resuming.
Causes & Solutions:
- Architecture changed - Check logs for "Architecture mismatch"
- Different node_id - Old versions used machine-specific IDs. Update to v0.0.20+ which uses wallet_id
- Checkpoint corrupted - Delete and restart:bash
rm ~/.neuroshard/checkpoints/dynamic_node_*.pt
Checkpoint Corrupted
RuntimeError: Failed to load checkpointSolution:
# Delete corrupted checkpoint
rm ~/.neuroshard/checkpoints/dynamic_node_*.pt
# Clear tracker state too (optional, but recommended)
rm ~/.neuroshard/training_logs/*.json
# Restart node
neuroshard --token YOUR_TOKENGlobalTrainingTracker State Preserved But Model Lost
If you see logs like:
[NODE] Restored tracker state: 120 steps, avg_loss=0.4872
[NODE] No checkpoint found, starting freshCause: The tracker state (loss history) persisted, but model weights didn't (architecture changed).
Solution: This is actually fine! The tracker history helps you see long-term trends even when the model architecture changes.
Performance Issues
Low GPU Utilization
Causes & Solutions:
- Small batch size: Expected with limited memory
- Network bottleneck: Increase
--diloco-steps - Data loading: Genesis loader might be slow initially
High CPU Usage
Normal during training. To limit:
neuroshard --token YOUR_TOKEN --cpu-threads 4Node Seems Slow
Check GPU utilization:
bashnvidia-smi # For NVIDIA # Or Activity Monitor on macOSCheck if training is active:
bashcurl http://localhost:8000/api/stats | jq '.total_training_rounds'
Getting Help
Check Logs
# Live logs (if running in foreground)
# Check console output
# Systemd logs
journalctl -u neuroshard -f
# Docker logs
docker logs -f neuroshardReport Issues
- Check our Discord for community support
- Include:
- NeuroShard version (
neuroshard --version) - OS and Python version
- Full error message
- Steps to reproduce
- NeuroShard version (
Community Support
- Website Community
- Discord Community
