status sweep: cap check-health per node (timeout) so one stuck node can't wedge fleet rpc-update

A hung check-health.sh (aztec-testnet, looping on an unresponsive reference RPC) blocked show-status.sh's parallel 'wait' for 3.5h, hanging the whole fleet rpc-update and holding the deploy lock. Each curl was bounded (-m 3) and the retry loop capped (3x), but the call itself wasn't time-bounded. - sync-status.sh: wrap each check-health.sh call in 'timeout ${HC_TIMEOUT:-30}' (-> exit 124 + 'timeout' status on overrun). - show-status.sh: wrap the whole per-node sync-status.sh call in 'timeout ${SYNC_TIMEOUT:-60}' so the parallel wait can never block forever. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 03:52:28 +00:00
parent 9ee59cf9fa
commit aefcd41a88
2 changed files with 14 additions and 7 deletions
--- a/show-status.sh
+++ b/show-status.sh
@@ -19,7 +19,9 @@ any_failure=false
 # Function to run the command and handle the result
 check_sync_status() {
    local part=$1
-    result=$("$BASEPATH/sync-status.sh" "${part%.yml}")
+    # Cap the whole per-node branch (belt-and-suspenders over check-health's own cap), so no single
+    # node can ever block the 'wait' below — that is what wedged the fleet rpc-update for hours.
+    result=$(timeout "${SYNC_TIMEOUT:-60}" "$BASEPATH/sync-status.sh" "${part%.yml}")

    code=0