The container ran as user 0:994 and accessed the docker socket via group
membership, but the host docker group GID is auto-assigned and varies per
host (e.g. uk-8 is 988, not 994), so the hardcoded gid silently breaks
telegraf's docker input wherever it differs (uk-8 was in a restart loop:
permission denied on /var/run/docker.sock). Run as root (0:0) with
entrypoint [telegraf] to skip the image's gosu privilege-drop, so telegraf
reads the socket as its owner regardless of the host docker gid. Works
uniformly fleet-wide; no regression on hosts where the gid happened to match.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
netdev (netlink) was reading the container's own veth (idle) -> node_network_*
showed ~0 on every host. Host network lets it see real host interfaces, so
bandwidth (egress vs tariff quota) works. Removed networks/expose (exclusive
with network_mode:host).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds diskIO to cadvisor --enable_metrics so per-container disk read/write
(container_fs_reads/writes_bytes_total) is exposed — lets node attribution
name which node is the noisy IO neighbor, not just flag the host.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
prometheus-scrape.metrics_path was '/' which made prometheus-docker-sd scrape
the HTML root and fail with 'INVALID is not a valid start token', leaving the
targets up=0. Fixes per-container (cadvisor) + host (node_exporter) metrics so
they can be wired into the DRPC insights MCP for per-node resource attribution.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>