Shadow: auto-fired post-run analysis (metrics + message reliability) by radiken · Pull Request #285 · vacp2p/10ksim

radiken · 2026-05-28T20:39:42Z

Makes a Shadow run analyze itself like a k8s run, mirroring the connmanager experiment (deploy + analysis + plots in one deployment.py invocation, defined in the experiment).

Depends on #276

analysis/metrics/shadow_metrics.py: split each peer's /metrics snapshots, import into a throwaway dockerized VictoriaMetrics (synthesized timestamps + k8s-style labels), run the existing Scrapper -> same per-metric CSVs.
shadow_gossipsub._run_analysis(): fires after the run (best-effort, never fails it) -> bandwidth CSVs (above) + message reliability via Nimlibp2pAnalyzer on the flattened logs/ (FileStack).
scrapper.py: kube_config made optional (unused port-forward path).

Needs Docker locally for the ephemeral VM.

First end-to-end Shadow integration in 10ksim. A new `shadow-gossipsub` experiment runs N nim libp2p peers + a publisher host inside a single Shadow process, mirroring the smoke we validated by hand. Same shape as a k8s run from the user's POV: `python deployment.py shadow-gossipsub` or driven by `Multiple` for scale runs. Layout: - src/deployments/shadow/builders.py: pure data rendering. render_shadow_yaml builds the inline sim config (1_gbit_switch network, N peer hosts running ./main with SHADOWENV=true, one publisher host running traffic_sync.py). build_configmap packs shadow.yaml + traffic_sync.py (read from 10ksim's publisher_headless source) into a k8s ConfigMap. build_shadow_job builds the runner Job (init container fetches the dynamic-linked test-node binary into an emptyDir, main container is radiken/dst-shadow-base with the required seccomp Unconfined + SYS_PTRACE security context). - src/deployments/shadow/runtime.py: wait_for_job_complete polls the k8s Job status (the BaseExperiment wait_for_rollout helper doesn't understand Jobs). pull_shadow_logs collects Shadow's stdout + every per-host stdout/stderr into the experiment's output folder. - src/deployments/experiments/libp2p/shadow_gossipsub.py: the experiment class. ExpConfig with image tags overridable per run for reproducibility. A wrinkle worth documenting in the runner command: once the Job's pod hits Phase=Succeeded, kubectl exec and kubectl cp both refuse, so we can't tar shadow.data out after the fact. Instead the runner container, as its last act, cats every per-host stdout/stderr into its own stdout with unique =BEGIN===<path>= ... =END= markers. pull_shadow_logs parses these out of kubectl logs (which works on completed pods). Avoids PVCs and sleep hacks. Validated against a real cluster: 10 peers, 5 published messages, 100% delivery, mesh formed in ~65s simulated time, full per-host logs landed in the run output folder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addressing self-review of PR #276 before requesting reviewers. runtime.py: - Raise a new ShadowLogParseError when the SHADOW_DONE_EXIT marker is missing, instead of returning silently. The experiment's events.log was recording `logs_pulled` even when no per-host files were extracted. - Move the trailing newline into the marker regexes (`===\n?`) so the match end()s past it cleanly. Avoids the +1 hack and tolerates kubectl output without a final newline. - Don't double-strip trailing newlines from per-host log bodies: previous rstrip(b"\n") would consume legitimate blank lines that were in the source file. - Log a warning instead of info when Shadow itself exits non-zero. Today the Job's Failed condition still surfaces this; the warning is defense for any future bash wrapper change that might mask the rc. kube_utils.py: - get_cleanup_resources + cleanup_resources now know about ConfigMap. Without this the Shadow experiment leaks one CM per run (silently dropped by the `try/except KeyError` in get_cleanup_resources). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`out/*` has a middle separator so gitignore anchors it to the .gitignore location (repo root). The actual experiment output lives at `src/deployments/experiments/out/` (per BaseExperiment._setup_log_paths) so the rule was never matching the right place; every experiment run showed `?? src/deployments/experiments/out/` in git status. `out/` (trailing slash, no middle separator) is unanchored and matches the directory at any depth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Subclasses Multiple to drive shadow-gossipsub at 10/30/100 nodes. Sets a 60s delay between runs (Shadow finishes in seconds wall clock, so the default 120s is overkill). Validated via --dry-run: all three iterations dispatch cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Shadow wrote every per-host stdout/stderr and metrics file to the pod's stdout (wrapped in markers) so we could parse them out of kubectl logs. That floods VictoriaLogs at scale: ~200KB/peer, so ~20MB at 100 peers and ~2GB extrapolated to 10k. Mount a Longhorn PVC at /sim/run so Shadow writes shadow.data there; after the Job finishes a short-lived reader pod mounts the same claim and we kubectl cp the tree out. Pod stdout now carries only Shadow's own progress output (~1MB at 100 peers, down from 20MB). Also makes the storeMetrics cadence configurable (METRICS_INTERVAL_S, 15s default for Shadow) so the last scrape captures post-traffic counters, and teaches kube_utils cleanup about PersistentVolumeClaim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fold a flatten step into pull_shadow_logs: after copying shadow.data off the PVC, concatenate each host's stdout+stderr into logs/<host>.log (skipping Shadow's .shimlog noise). Shadow runs the same test-node binary as k8s, so the log lines are identical and the mesh_analysis FileStack / FileReader read the result with no analyzer changes. Every run now drops a flat, ready-to-analyze logs/ dir. Scoped to logs; metrics handled separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Shadow can't be scraped live, so storeMetrics appends per-peer /metrics snapshots to files. To analyze bandwidth with the existing PromQL pipeline, spin up a throwaway dockerized VictoriaMetrics, split each peer file into snapshots, and import them with synthesized timestamps (snapshot order x interval) plus pod/namespace labels. The VM is then queryable exactly like the lab one, so the same rate(libp2p_network_bytes_total[...]) queries work. Verified end-to-end on a 10-peer run: VM totals match a direct sum of the last snapshot per peer, and rate() returns throughput. Logs side handled separately by the FileStack flatten. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

scrape_run_metrics() imports a Shadow run into the throwaway VM, then runs the existing Scrapper with the libp2p metric set against it and dumps the same per-metric CSVs the k8s path produces under <run_dir>/metrics/. Imports tag each series with the labels a k8s scrape adds (pod, instance, job, node) so the existing PromQL + extract_field analysis reads a Shadow run unchanged: present metrics (network in/out, peers, open streams, gossipsub peers, nim gc) dump CSVs; cAdvisor metrics (container_*) have no data in the sim and skip gracefully. Scrapper's kube_config is now optional since the port-forward path is unused. Verified on a 10-peer run: produces libp2p-in/out (per-peer throughput by time), libp2p-peers, open-streams, low/high-peers, nim-gc CSVs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- shadow_metrics main(): list produced files directly; the rglob("*.csv") matched nothing since CSVs are written without a .csv suffix (k8s convention). - Drop the unused query() helper. - Fix stale header label list and the Notion doc name in shadow_gossipsub. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#278 moved the inter-run delay to config.delay and the target experiment to config.name (set in model_post_init), dropping delay_between_exps and the experiment_name field. Update MultiShadowGossipsub to match: set config.name + config.delay in model_post_init and register Multiple.add_args. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Analyze a Shadow run's per-peer Prometheus dumps like a k8s run: import them into a throwaway dockerized VictoriaMetrics (synthesized timestamps + k8s-style pod/instance/job/node labels), then run the existing Scrapper against it for the same per-metric CSVs (libp2p-in/out, peers, open-streams, etc.). scrapper.py: kube_config made optional (the port-forward path is unused), so analysis doesn't need cluster access. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Keep this PR to the run path (experiment + builders + runtime). The ephemeral-VictoriaMetrics metrics analysis and the Scrapper change move to #285. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Collapse the rationale/"why" comment blocks and multi-line docstrings to short single-liners; keep only the non-obvious gotchas (SHADOWENV literal, daemon stop_time). No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Collapse the module header and multi-line docstrings to terse one-liners. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add _run_analysis() to shadow_gossipsub, called at the end of _run on success: bandwidth CSVs via the ephemeral-VM metrics path + message reliability via Nimlibp2pAnalyzer on the flattened logs. Best-effort (each wrapped) so a Docker or analysis hiccup never fails the run. Now `deployment.py shadow-gossipsub` does deploy + analysis in one shot, like the connmanager experiment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

radiken and others added 15 commits May 26, 2026 22:55

Merge remote-tracking branch 'origin/master' into Alan/shadow-experiment

6066612

make format

561f80a

Merge remote-tracking branch 'origin/master' into Alan/shadow-experiment

b5244b8

Move Shadow metrics analysis to its own PR (#285)

13e21cd

Keep this PR to the run path (experiment + builders + runtime). The ephemeral-VictoriaMetrics metrics analysis and the Scrapper change move to #285. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

radiken mentioned this pull request May 28, 2026

Shadow simulator: run GossipSub experiments #276

Open

radiken marked this pull request as ready for review May 28, 2026 20:47

radiken requested review from AlbertoSoutullo and PearsonWhite May 28, 2026 20:49

radiken self-assigned this May 28, 2026

radiken added the enhancement New feature or request label May 28, 2026

radiken added this to DST May 28, 2026

radiken linked an issue May 28, 2026 that may be closed by this pull request

Shadow #220

Open

radiken moved this to In review in DST May 28, 2026

radiken mentioned this pull request May 28, 2026

Shadow runner base image with dynamic test-node build and configurable metrics interval vacp2p/dst-libp2p-test-node#28

Open

radiken and others added 4 commits May 28, 2026 22:00

shadow: trim verbose comments in shadow_metrics

20531ac

Collapse the module header and multi-line docstrings to terse one-liners. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge branch 'Alan/shadow-experiment' into Alan/shadow-analysis

97ae243

radiken changed the title ~~Shadow metrics analysis via ephemeral VictoriaMetrics~~ Shadow: auto-fired post-run analysis (metrics + message reliability) May 28, 2026

radiken marked this pull request as draft May 29, 2026 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shadow: auto-fired post-run analysis (metrics + message reliability)#285

Shadow: auto-fired post-run analysis (metrics + message reliability)#285
radiken wants to merge 19 commits into
masterfrom
Alan/shadow-analysis

radiken commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

radiken commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

radiken commented May 28, 2026 •

edited

Loading