Add VIBE.md debugging guide and update README.md

- Add VIBE.md as primary debugging reference for automated tools
- Rewrite README.md as human-focused operator guide
- Fix README.md inaccuracies (remove show-networks.sh references, fix typo)
- Split content: README for humans, VIBE for agents

Generated by Mistral Vibe.
Co-Authored-By: Mistral Vibe <vibe@mistral.ai>
This commit is contained in:
rob
2026-06-22 08:37:38 +00:00
parent 9bf8fb51ab
commit dd8ce689e4
2 changed files with 1457 additions and 234 deletions

988
README.md

File diff suppressed because it is too large Load Diff

729
VIBE.md Normal file
View File

@@ -0,0 +1,729 @@
# VIBE.md — ethereum-rpc-docker Operations & Debugging Guide
You are an LLM agent or operator **running or debugging blockchain RPC nodes** from this
repository. This file is your **primary reference** for all operational tasks.
This repo contains Docker Compose configurations for blockchain RPC nodes plus operational
scripts for managing them. Everything you need to run, monitor, debug, and fix nodes is here.
---
## 0. WHEN A NODE IS FAULTY — Start Here
### Immediate Triage (30 seconds)
```bash
# 1. Is the container running?
./show-running.sh
# 2. Check overall status of all configured nodes
./show-status.sh
# 3. If you know the config name, check its specific status
./sync-status.sh <config-name>
# 4. Check logs for the faulty node
./logs.sh <config-name>
```
**If the container isn't running**, go to [§3. Container Lifecycle Issues](#3-container-lifecycle-issues)
**If the container is running but not synced**, go to [§4. Sync Issues](#4-sync-issues)
**If the container is running and synced but RPC fails**, go to [§5. RPC/Connectivity Issues](#5-rpcconnectivity-issues)
**If you see errors in logs but aren't sure what they mean**, go to [§6. Log Interpretation](#6-log-interpretation)
---
## 1. Repository Overview
### What This Repo Contains
```
rpc/
├── *.yml # Docker Compose files for node configurations
├── *.sh # Operational scripts (YOUR PRIMARY TOOLS)
├── scripts/ # Additional helper scripts (CometBFT support)
├── <network>/ # Network directories (e.g., ethereum/, op/, arb/)
│ ├── *.yml # Compose files for specific chains
│ └── <chain>/ # Chain-specific assets
│ ├── genesis.json # Custom genesis files
│ ├── rollup.json # Rollup configurations (OP Stack)
│ └── *.Dockerfile # Custom build files
├── README.md # User documentation
└── VIBE.md # THIS FILE — operations guide
```
### Key Concepts
- **Config name**: The compose filename WITHOUT `.yml` (e.g., `ethereum-mainnet-geth-pruned`)
- **Service name**: Derived from config name, used in `docker compose` commands
- **Short name**: Used in URL paths, container labels. Format: `{network}-{chain}[-{client}][-{db_type}]`
- **Volume names**: Docker volumes follow the full config name pattern
### Supported Networks
**Layer 1**: Ethereum, Polygon, BSC, Avalanche, Gnosis, Fantom, Core, Berachain, Ronin, Viction, Fuse, Tron, ThunderCore, Goat, AlephZero, Haqq, Taiko, Rootstock, Dogecoin, Litecoin, Bitcoin, Bitcoin-Cash, Ripple, Solana, Tron
**Layer 2 (OP Stack)**: Optimism, Base, Zora, Mode, Blast, Fraxtal, Bob, Boba, Worldchain, Metal, Ink, Lisk, SNAX, Celo
**Layer 2 (Arbitrum)**: Arbitrum One, Arbitrum Nova, Everclear, Playblock, Real, Connext, OpenCampusCodex
**Other L2s**: Linea, Scroll, zkSync Era, Metis, Moonbeam, Starknet, zkEVM, Immutable zkEVM, Polygon zkEVM
---
## 2. Essential Scripts Reference
### Status & Monitoring Scripts
| Script | Usage | What It Does |
|---|---|---|
| `show-status.sh` | `[config-name]` | Lists ALL configured nodes with sync status, block height, health |
| `show-running.sh` | | Lists currently running containers |
| `sync-status.sh` | `<config-name>` | Detailed sync status for one config |
| `latest.sh` | `<config-name>` | Latest block number + hash |
| `logs.sh` | `<config-name>` | Tail logs from all containers in a config |
| `show-db-size.sh` | | Disk usage of ALL Docker volumes, sorted by size |
| `show-ram.sh` | `<config-name>` | Memory usage of containers |
| `show-cpu.sh` | | CPU usage display |
| `peer-count.sh` | | P2P peer count for all running nodes |
| `time-since-last-block.sh` | `<config-name>` | How long since last block was processed |
| `ping.sh` | `<container-name>` | Test network connectivity from container |
| `show-errors.sh` | | Show error counts/logs across containers |
| `show-size.sh` | | Show size of containers/volumes |
| `show-file-size.sh` | | Show static file sizes |
| `show-static-file-size.sh` | | Show static file sizes (alternative) |
### Lifecycle Management Scripts
| Script | Usage | What It Does |
|---|---|---|
| `start.sh` | `<config-name>` | Start all containers for a config |
| `stop.sh` | `<config-name>` | Stop all containers for a config |
| `force-recreate.sh` | `<config-name>` | Force recreate containers (keeps volumes) |
| `rm.sh` | `<config-name>` | Remove containers (keeps volumes) |
| `delete-volumes.sh` | `<config-name>` | **DESTRUCTIVE** - Remove containers AND volumes |
| `delete-node-keys.sh` | `<config-name>` | Remove node keys (for re-initialization) |
### Backup & Restore Scripts
| Script | Usage | What It Does |
|---|---|---|
| `backup-node.sh` | `<config-name> [url]` | Backup volumes locally or to WebDAV |
| `restore-volumes.sh` | `<config-name> [url]` | Restore volumes from local or HTTP |
| `clone-node.sh` | `<config-name>` | Clone a node's state |
| `clone-backup.sh` | | Clone backup files |
| `clone-peers.sh` | | Clone peer information |
| `restore-peers.sh` | | Restore peer connections |
| `list-backups.sh` | | List available backup files |
| `list-peer-backups.sh` | | List peer backup files |
| `list-restorable.sh` | | List restorable configurations |
| `cleanup-backups.sh` | | Remove old backups |
| `cleanup-volumes.sh` | | Clean up unused volumes |
### Network & Connectivity Scripts
| Script | Usage | What It Does |
|---|---|---|
| `upstreams.sh` | | Generate dshackle upstream configuration |
| `connect-peers.sh` | | Connect to peer nodes |
| `search-node.sh` | `<query>` | Search compose files for patterns |
| `search-compose.sh` | `<query>` | Search compose files |
| `network-to-config.sh` | | Map network names to config files |
| `reload_dshackle.sh` | | Reload dshackle configuration |
| `update-whitelist.sh` | | Update IP whitelist |
| `update-ip.sh` | | Update IP configuration |
### Specialized Scripts
| Script | Usage | What It Does |
|---|---|---|
| `op-wheel.sh` | | OP rollup maintenance (rewind, set forkchoice) |
| `op-wheel-finalize-latest-block.sh` | `<client_svc> [node_svc]` | Finalize latest block (nuclear option) |
| `catchup.sh` | `<config-name>` | Help node catch up to chain head |
| `success-if-almost-synced.sh` | `<config-name> <seconds>` | Exit 0 if node is almost synced |
| `groq.sh` | | Query using Groq |
| `trai.sh` | | Trace transaction |
| `multicurl.sh` | | Parallel curl requests |
| `blocknumber.sh` | | Get block number |
| `get-block.sh` | | Get block information |
| `get-local-url.sh` | | Get local RPC URL |
| `get-shortname.sh` | `<config-file>` | Get short name for a config |
| `disk-space.sh` | | Check disk space |
| `limit-bandwidth.sh` | | Limit bandwidth |
| `maintenance.sh` | | Maintenance helper |
| `random-port.sh` | | Generate random port |
| `reference-rpc-endpoint.sh` | | Reference RPC endpoint helper |
| `reset-terminal.sh` | | Reset terminal |
| `setup-bandwidth-limit-cron.sh` | | Setup cron for bandwidth limiting |
---
## 3. Container Lifecycle Issues
### Symptom: Container Won't Start
```bash
# Check why it failed
./logs.sh <config-name> 2>&1 | tail -50
# Check container exit code
docker ps -a --filter "name=<config-name>" --format "{{.Names}} | {{.State}} | {{.Status}}"
# Inspect the container
docker inspect <container-name> | jq '.[0].State'
```
**Common causes:**
- **Port conflict**: Two services trying to bind to same host port
- **Volume permission issues**: Docker can't write to volume
- **Missing environment variables**: `.env` file incomplete
- **Invalid compose syntax**: YAML parsing error
- **Image pull failure**: Network issue or private registry auth
**Fixes:**
```bash
# Check for port conflicts
grep -h "^[0-9]\{1,5\}:[0-9]" *.yml | sort | uniq -d
# Validate compose syntax
docker compose -f <config-file>.yml config
# Pull images manually
docker compose -f <config-file>.yml pull
# Start with --build if using custom Dockerfiles
docker compose -f <config-file>.yml up -d --build
```
### Symptom: Container Exits Immediately After Starting
```bash
# View the last 100 lines of logs before exit
./logs.sh <config-name> 2>&1 | tail -100
# Check exit code
docker ps -a --filter "name=<service>" --format "{{.Status}}"
# Run interactively to see error
docker compose -f <config-file>.yml run --rm <service-name> sh
```
**Common causes:**
- **Missing config files**: `/config/` mount empty or wrong path
- **Invalid flags**: Command-line arguments malformed
- **Database corruption**: Existing data incompatible with new version
- **Checkpoint/genesis mismatch**: Chain ID or genesis doesn't match
**Fixes:**
```bash
# Verify config directory exists (if using custom configs)
ls -la <network>/<chain>/
# Try with fresh volumes (DESTRUCTIVE)
./delete-volumes.sh <config-name>
./start.sh <config-name>
```
### Symptom: Container Restarts Repeatedly (Crash Loop)
```bash
# Watch logs in real-time
./logs.sh <config-name> -f
# Check restart count
docker inspect <container-name> | jq '.[0].RestartCount'
# Check last restart reason
docker inspect <container-name> | jq '.[0].State.ExitCode, .[0].State.Error'
```
**Common causes:**
- **OOM killed**: Memory limit exceeded
- **Out of disk space**: No space left on device
- **Segmentation fault**: Client bug or bad data
- **Panic**: Go client panic
**Fixes:**
```bash
# Check memory usage
./show-ram.sh <config-name>
# Check disk space
df -h /var/lib/docker
./show-db-size.sh
# Increase resources in compose file or .env
# Then force recreate
./force-recreate.sh <config-name>
```
---
## 4. Sync Issues
### Symptom: Node Not Syncing (Stuck at Block 0 or Low Block)
```bash
# Check sync status
./sync-status.sh <config-name>
# Check current block
./latest.sh <config-name>
# Check logs for sync errors
./logs.sh <config-name> | grep -i -E "sync|error|fail|warn|stuck|behind"
# Check peer count
./peer-count.sh | grep <config-name>
```
**Common causes:**
- **No peers**: P2P network connection failed
- **Wrong network**: Connected to wrong chain
- **Checkpoint too old**: Checkpoint URL unavailable or outdated
- **Snapshot download failed**: Snapshot server unreachable
**Fixes:**
```bash
# Check if checkpoint/snapshot is configured
grep -E "(checkpoint|snapshot)" <config-file>.yml
# Test checkpoint URL manually
curl -I $(grep checkpoint <config-file>.yml | grep -oE 'http[^ ]+')
# Check peer connections (geth example)
docker exec <client-container> admin_peers | jq '.[] | .network.remoteAddress' | wc -l
```
### Symptom: Sync is Very Slow
```bash
# Check sync speed over time
./latest.sh <config-name>; sleep 60; ./latest.sh <config-name>
# Check if node is processing blocks
./time-since-last-block.sh <config-name>
# Check CPU and memory
top -d 1 -p $(docker inspect <container> | jq -r '.[0].State.Pid')
```
**Common causes:**
- **Resource constrained**: CPU throttled, memory swapped
- **Disk I/O bottleneck**: Slow storage or contention
- **Network rate limited**: P2P or RPC rate limiting
- **Too many peers**: P2P overhead
- **Wrong sync mode**: Full sync instead of snap sync
### Symptom: Sync Stuck at Specific Block
```bash
# Check logs around the stuck block
./logs.sh <config-name> | grep -A 10 -B 10 "block <stuck-block-number>"
# Check if it's a known bad block
# Search online: <chain> bad block <number>
```
**Common causes:**
- **Bad block in chain**: Requires client patch or manual intervention
- **State trie inconsistency**: Database corruption
- **Fork choice issue**: Node on wrong fork
**Fixes for OP Stack:**
```bash
# Try to finalize past the block
./op-wheel-finalize-latest-block.sh <client-service>
```
### Symptom: Node on Wrong Fork / Chain
```bash
# Check chain ID
./latest.sh <config-name> | grep -i chain
# Check what chain the node thinks it's on
docker exec <client-container> ethdo chain --endpoint=http://localhost:8545
# Compare with expected chain ID
grep chainId <config-file>.yml
```
---
## 5. RPC/Connectivity Issues
### Symptom: RPC Endpoint Not Responding
```bash
# Test from host
curl -s http://localhost:<port> | head -c 100
# Check if traefik/proxy is running
docker ps | grep -E "(traefik|proxy|nginx)"
# Check traefik logs
docker logs <traefik-container> | tail -50
```
**Common causes:**
- **Container not running**: Client crashed
- **Port not exposed**: Wrong port mapping
- **Traefik misconfiguration**: Labels wrong or missing
- **Firewall blocking**: Host firewall or cloud security group
### Symptom: RPC Returns Wrong Chain ID
```bash
# Query chain ID from RPC
curl -s -X POST http://localhost:<port> \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}'
```
### Symptom: Cannot Connect to P2P Network
```bash
# Check peer count
./peer-count.sh | grep <config-name>
# Test P2P connectivity from container
docker exec <client-container> nc -zv <bootstrap-node> <p2p-port>
```
**Fixes:**
```bash
# Set public IP in .env
IP=$(curl -s ipinfo.io/ip)
echo "IP=$IP" >> .env
./force-recreate.sh <config-name>
```
---
## 6. Log Interpretation
### Common Log Patterns
#### Warnings (Node may still function)
| Pattern | Meaning | Action |
|---|---|---|
| `WARN.*sync.*slow` | Sync slower than expected | Check resources |
| `WARN.*peers.*low` | Fewer peers than desired | Check P2P connectivity |
| `WARN.*rate.*limit` | API rate limiting active | Normal for public endpoints |
#### Errors (Node is degraded)
| Pattern | Meaning | Action |
|---|---|---|
| `Error.*database.*corrupt` | Database corruption | Restore from backup or resync |
| `Error.*handshake.*fail` | P2P handshake failed | Check chain ID |
| `Error.*no.*peers` | Cannot connect to P2P | Check bootstrap nodes |
| `Error.*timeout` | RPC/HTTP timeout | Check network, increase timeout |
#### Fatal (Node will not function)
| Pattern | Meaning | Action |
|---|---|---|
| `Fatal.*panic` | Client crashed | Check client version |
| `Fatal.*OOM` | Out of memory | Increase memory limit |
| `Fatal.*disk.*full` | No disk space | Free space |
| `Fatal.*permission.*denied` | Filesystem permissions | Fix volume permissions |
---
## 7. Resource Issues
### High CPU Usage
```bash
./show-ram.sh <config-name>
./show-cpu.sh
docker stats <container-name> --no-stream
```
### High Memory Usage
```bash
./show-ram.sh <config-name>
docker stats <container-name> --no-stream --format "{{.Container}} | {{.MemUsage}} | {{.MemPerc}}"
```
### High Disk Usage
```bash
./show-db-size.sh
docker system df -v
```
### Disk I/O Bottleneck
```bash
iotop -o -d 1
```
---
## 8. Backup and Restore
### Creating a Backup
```bash
# Local backup (to /backup directory)
./backup-node.sh <config-name>
# Remote backup (to WebDAV)
./backup-node.sh <config-name> https://backup-server.tld/dav
```
### Restoring from Backup
```bash
# List available backups
./list-backups.sh
# Restore latest backup for config
./restore-volumes.sh <config-name>
# Restore from specific URL
./restore-volumes.sh <config-name> https://backup-server.tld/backup/
```
### Cloning a Node
```bash
# Clone a node to a new location
./clone-node.sh <config-name>
# Clone peers (for faster sync)
./clone-peers.sh <config-name>
```
### Nuclear Option: Full Reset
```bash
# WARNING: This deletes ALL data for the config
./stop.sh <config-name> && \
./rm.sh <config-name> && \
./delete-volumes.sh <config-name> && \
./delete-node-keys.sh <config-name> && \
./force-recreate.sh <config-name>
# Then check logs
./logs.sh <config-name>
```
---
## 9. Common Error Messages
### Database Errors
| Error | Cause | Solution |
|---|---|---|
| `database is corrupted` | Power loss, bug | Restore from backup or resync |
| `database version mismatch` | Client version changed | Delete and resync |
### P2P Errors
| Error | Cause | Solution |
|---|---|---|
| `no configured peers` | Missing bootstrap nodes | Add bootstrap nodes |
| `handshake failed` | Chain ID mismatch | Verify genesis.json |
### RPC Errors
| Error | Cause | Solution |
|---|---|---|
| `method not found` | Wrong client | Use correct client |
| `connection refused` | Port not open | Check container running, port mapping |
---
## 10. OP Stack Specific Debugging
### OP Node Issues
```bash
# Check op-node logs
./logs.sh <config-name> | grep -i "op-node\|rollup\|sequencer"
# Check rollup configuration (if custom)
cat op/<network>/ethereum/rollup.json | jq .
# Check if rollup.json is mounted
docker exec <op-node-container> cat /config/rollup.json | jq .
```
### OP Wheel (Manual Intervention)
```bash
# Rewind to specific block (DANGEROUS - only if you know what you're doing)
./op-wheel.sh engine set-forkchoice \
--unsafe=<block-hash> \
--safe=<block-hash> \
--finalized=<block-hash> \
--engine=http://<client-service>:8551/ \
--engine.open=http://<client-service>:8545 \
--engine.jwt-secret-path=/jwtsecret
# Nuclear option: finalize latest local block
./op-wheel-finalize-latest-block.sh <client-service> <node-service>
```
---
## 11. CometBFT Family (Cosmos, etc.) Specific
### Init Container Issues
```bash
# CometBFT chains use init.sh inside the container
# The master script is at scripts/cometbft-common.sh
# Check if init completed
./logs.sh <config-name> | grep -i "init\|setup\|complete"
# Check the init script
cat <network>/<chain>/scripts/init.sh
```
---
## 12. Quick Start Guide
### Starting a Node
```bash
# 1. Set up environment
echo "IP=$(curl -s ipinfo.io/ip)" > .env
echo "DOMAIN=${IP//./-}.traefik.me" >> .env
echo "MAIL=your-email@example.com" >> .env
# 2. Select which nodes to run
# Add compose files to COMPOSE_FILE (colon-separated)
echo "COMPOSE_FILE=base.yml:rpc.yml:ethereum-mainnet-geth-pruned.yml" >> .env
# 3. Start the node
docker compose up -d
# 4. Verify it's running
./show-status.sh
```
### Accessing Your Node
```bash
# Once running, access via:
# HTTP: http://<your-domain>/ethereum-mainnet-geth-pruned
# HTTPS: https://<your-domain>/ethereum-mainnet-geth-pruned
# WebSocket: wss://<your-domain>/ethereum-mainnet-geth-pruned
# Or locally (if NO_SSL=true):
# HTTP: http://localhost:<port>
```
---
## 13. Configuration Reference
### Environment Variables
**Required for most setups:**
```bash
IP=203.0.113.42 # Your public IP
DOMAIN=203-0-113-42.traefik.me # Your domain (traefik.me for testing)
MAIL=your-email@example.com # For Let's Encrypt SSL
WHITELIST=0.0.0.0/0 # IP whitelist (0.0.0.0/0 = all)
```
**Optional:**
```bash
NO_SSL=true # Disable SSL (testing only)
CHAINS_SUBNET=192.168.0.0/26 # Docker network subnet
```
**Chain-specific (examples):**
```bash
ETHEREUM_MAINNET_EXECUTION_RPC=https://fallback-rpc.example.com
ARBITRUM_SEPOLIA_EXECUTION_RPC=https://arb-sepolia-rpc.example.com
OP_NODE_NETWORK=mainnet
OP_NODE_L1_RPC_URL=https://l1-rpc.example.com
```
### Compose File Structure
Each compose file defines one or more services:
- **client**: Execution layer (Geth, Erigon, Reth, etc.)
- **node**: Consensus/derivation node (op-node, lighthouse, etc.)
- **relay**: DA relay (eigenda-proxy, op-alt, etc.)
- **proxy**: HTTP/WS proxy (nginx, etc.)
- **database**: External database (Postgres, etc.)
### Volume Naming
Volumes are named after the config:
```
<config-name>_<service>_data
<config-name>_<service>_config
```
Example: `ethereum-mainnet-geth-pruned_client_data`
---
## 14. Quick Debugging Checklist
Use this checklist when debugging an issue:
- [ ] **Is the container running?**`./show-running.sh`
- [ ] **Are there errors in logs?**`./logs.sh <config> | grep -i error`
- [ ] **Is the node synced?**`./sync-status.sh <config>`
- [ ] **Are peers connected?**`./peer-count.sh`
- [ ] **Are resources adequate?**`./show-ram.sh`, `./show-db-size.sh`
- [ ] **Is P2P working?** → Check peer count
- [ ] **Is RPC responding?** → Test with curl
- [ ] **Is disk space available?**`df -h /var/lib/docker`
- [ ] **Is the config file correct?**`docker compose -f <file>.yml config`
- [ ] **Are environment variables set?** → Check `.env`
- [ ] **Is the genesis file correct?** → Check chain ID
---
## 15. When to Escalate
Escalate to a human operator if:
- [ ] Node stuck for > 2 hours with no progress
- [ ] Repeated `Fatal` or `panic` errors after restart
- [ ] Database corruption confirmed
- [ ] Issue affects multiple nodes across different chains
- [ ] Need to force-push to this repo
---
## 16. File Locations Quick Reference
| What You Need | Where to Find It |
|---|---|
| Compose files | Root of this repo (`*.yml`) |
| Operational scripts | Root of this repo (`*.sh`) |
| Chain assets | `<network>/<chain>/` or `<stack>/<network>/<settlement>/` |
| Genesis files | `<stack>/<network>/<settlement>/genesis.json` |
| Rollup configs | `op/<network>/<settlement>/rollup.json` |
| Custom Dockerfiles | `<path>/*.Dockerfile` |
| Init scripts | `<path>/scripts/init.sh` |
| CometBFT common | `scripts/cometbft-common.sh` |
| Compose registry | `compose_registry.json` |
| RPC endpoints | `reference-rpc-endpoint.json` |
| Environment | `.env` |
---
## 17. Resource Requirements Reference
| Node Type | Disk | RAM | CPU |
|---|---|---|---|
| Ethereum pruned | ~500GB | 8GB | 2+ cores |
| Ethereum archive | ~2TB+ | 16GB+ | 4+ cores |
| Ethereum archive-trace | ~4TB+ | 32GB+ | 8+ cores |
| L2 pruned | ~100-500GB | 4-8GB | 2+ cores |
| L2 archive | ~1-2TB | 8-16GB | 4+ cores |
**Note:** Requirements vary by chain. Check specific chain documentation.
---
*This file is your complete operations and debugging reference. For additional user documentation, see README.md.*