Skip to content

Shadow: auto-fired post-run analysis (metrics + message reliability)#285

Draft
radiken wants to merge 19 commits into
masterfrom
Alan/shadow-analysis
Draft

Shadow: auto-fired post-run analysis (metrics + message reliability)#285
radiken wants to merge 19 commits into
masterfrom
Alan/shadow-analysis

Conversation

@radiken
Copy link
Copy Markdown
Contributor

@radiken radiken commented May 28, 2026

Makes a Shadow run analyze itself like a k8s run, mirroring the connmanager experiment (deploy + analysis + plots in one deployment.py invocation, defined in the experiment).

Depends on #276

  • analysis/metrics/shadow_metrics.py: split each peer's /metrics snapshots, import into a throwaway dockerized VictoriaMetrics (synthesized timestamps + k8s-style labels), run the existing Scrapper -> same per-metric CSVs.
  • shadow_gossipsub._run_analysis(): fires after the run (best-effort, never fails it) -> bandwidth CSVs (above) + message reliability via Nimlibp2pAnalyzer on the flattened logs/ (FileStack).
  • scrapper.py: kube_config made optional (unused port-forward path).

Needs Docker locally for the ephemeral VM.

radiken and others added 15 commits May 26, 2026 22:55
First end-to-end Shadow integration in 10ksim. A new `shadow-gossipsub`
experiment runs N nim libp2p peers + a publisher host inside a single Shadow
process, mirroring the smoke we validated by hand. Same shape as a k8s run
from the user's POV: `python deployment.py shadow-gossipsub` or driven by
`Multiple` for scale runs.

Layout:
- src/deployments/shadow/builders.py: pure data rendering. render_shadow_yaml
  builds the inline sim config (1_gbit_switch network, N peer hosts running
  ./main with SHADOWENV=true, one publisher host running traffic_sync.py).
  build_configmap packs shadow.yaml + traffic_sync.py (read from 10ksim's
  publisher_headless source) into a k8s ConfigMap. build_shadow_job builds
  the runner Job (init container fetches the dynamic-linked test-node binary
  into an emptyDir, main container is radiken/dst-shadow-base with the
  required seccomp Unconfined + SYS_PTRACE security context).
- src/deployments/shadow/runtime.py: wait_for_job_complete polls the k8s Job
  status (the BaseExperiment wait_for_rollout helper doesn't understand Jobs).
  pull_shadow_logs collects Shadow's stdout + every per-host stdout/stderr
  into the experiment's output folder.
- src/deployments/experiments/libp2p/shadow_gossipsub.py: the experiment
  class. ExpConfig with image tags overridable per run for reproducibility.

A wrinkle worth documenting in the runner command: once the Job's pod hits
Phase=Succeeded, kubectl exec and kubectl cp both refuse, so we can't tar
shadow.data out after the fact. Instead the runner container, as its last
act, cats every per-host stdout/stderr into its own stdout with unique
=BEGIN===<path>= ... =END= markers. pull_shadow_logs parses these out of
kubectl logs (which works on completed pods). Avoids PVCs and sleep hacks.

Validated against a real cluster: 10 peers, 5 published messages, 100%
delivery, mesh formed in ~65s simulated time, full per-host logs landed
in the run output folder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addressing self-review of PR #276 before requesting reviewers.

runtime.py:
- Raise a new ShadowLogParseError when the SHADOW_DONE_EXIT marker is
  missing, instead of returning silently. The experiment's events.log was
  recording `logs_pulled` even when no per-host files were extracted.
- Move the trailing newline into the marker regexes (`===\n?`) so the
  match end()s past it cleanly. Avoids the +1 hack and tolerates kubectl
  output without a final newline.
- Don't double-strip trailing newlines from per-host log bodies: previous
  rstrip(b"\n") would consume legitimate blank lines that were in the
  source file.
- Log a warning instead of info when Shadow itself exits non-zero. Today
  the Job's Failed condition still surfaces this; the warning is defense
  for any future bash wrapper change that might mask the rc.

kube_utils.py:
- get_cleanup_resources + cleanup_resources now know about ConfigMap.
  Without this the Shadow experiment leaks one CM per run (silently
  dropped by the `try/except KeyError` in get_cleanup_resources).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`out/*` has a middle separator so gitignore anchors it to the .gitignore
location (repo root). The actual experiment output lives at
`src/deployments/experiments/out/` (per BaseExperiment._setup_log_paths)
so the rule was never matching the right place; every experiment run
showed `?? src/deployments/experiments/out/` in git status.

`out/` (trailing slash, no middle separator) is unanchored and matches
the directory at any depth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Subclasses Multiple to drive shadow-gossipsub at 10/30/100 nodes.
Sets a 60s delay between runs (Shadow finishes in seconds wall clock,
so the default 120s is overkill).

Validated via --dry-run: all three iterations dispatch cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shadow wrote every per-host stdout/stderr and metrics file to the pod's
stdout (wrapped in markers) so we could parse them out of kubectl logs.
That floods VictoriaLogs at scale: ~200KB/peer, so ~20MB at 100 peers and
~2GB extrapolated to 10k.

Mount a Longhorn PVC at /sim/run so Shadow writes shadow.data there; after
the Job finishes a short-lived reader pod mounts the same claim and we
kubectl cp the tree out. Pod stdout now carries only Shadow's own progress
output (~1MB at 100 peers, down from 20MB).

Also makes the storeMetrics cadence configurable (METRICS_INTERVAL_S, 15s
default for Shadow) so the last scrape captures post-traffic counters, and
teaches kube_utils cleanup about PersistentVolumeClaim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fold a flatten step into pull_shadow_logs: after copying shadow.data off
the PVC, concatenate each host's stdout+stderr into logs/<host>.log
(skipping Shadow's .shimlog noise). Shadow runs the same test-node binary
as k8s, so the log lines are identical and the mesh_analysis FileStack /
FileReader read the result with no analyzer changes. Every run now drops a
flat, ready-to-analyze logs/ dir. Scoped to logs; metrics handled separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shadow can't be scraped live, so storeMetrics appends per-peer /metrics
snapshots to files. To analyze bandwidth with the existing PromQL pipeline,
spin up a throwaway dockerized VictoriaMetrics, split each peer file into
snapshots, and import them with synthesized timestamps (snapshot order x
interval) plus pod/namespace labels. The VM is then queryable exactly like
the lab one, so the same rate(libp2p_network_bytes_total[...]) queries work.

Verified end-to-end on a 10-peer run: VM totals match a direct sum of the
last snapshot per peer, and rate() returns throughput. Logs side handled
separately by the FileStack flatten.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scrape_run_metrics() imports a Shadow run into the throwaway VM, then runs
the existing Scrapper with the libp2p metric set against it and dumps the
same per-metric CSVs the k8s path produces under <run_dir>/metrics/.

Imports tag each series with the labels a k8s scrape adds (pod, instance,
job, node) so the existing PromQL + extract_field analysis reads a Shadow
run unchanged: present metrics (network in/out, peers, open streams,
gossipsub peers, nim gc) dump CSVs; cAdvisor metrics (container_*) have no
data in the sim and skip gracefully. Scrapper's kube_config is now optional
since the port-forward path is unused.

Verified on a 10-peer run: produces libp2p-in/out (per-peer throughput by
time), libp2p-peers, open-streams, low/high-peers, nim-gc CSVs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- shadow_metrics main(): list produced files directly; the rglob("*.csv")
  matched nothing since CSVs are written without a .csv suffix (k8s convention).
- Drop the unused query() helper.
- Fix stale header label list and the Notion doc name in shadow_gossipsub.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#278 moved the inter-run delay to config.delay and the target experiment to
config.name (set in model_post_init), dropping delay_between_exps and the
experiment_name field. Update MultiShadowGossipsub to match: set
config.name + config.delay in model_post_init and register Multiple.add_args.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Analyze a Shadow run's per-peer Prometheus dumps like a k8s run: import them
into a throwaway dockerized VictoriaMetrics (synthesized timestamps + k8s-style
pod/instance/job/node labels), then run the existing Scrapper against it for the
same per-metric CSVs (libp2p-in/out, peers, open-streams, etc.).

scrapper.py: kube_config made optional (the port-forward path is unused), so
analysis doesn't need cluster access.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep this PR to the run path (experiment + builders + runtime). The
ephemeral-VictoriaMetrics metrics analysis and the Scrapper change move to
#285.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@radiken radiken marked this pull request as ready for review May 28, 2026 20:47
@radiken radiken self-assigned this May 28, 2026
@radiken radiken added the enhancement New feature or request label May 28, 2026
@radiken radiken added this to DST May 28, 2026
@radiken radiken linked an issue May 28, 2026 that may be closed by this pull request
@radiken radiken moved this to In review in DST May 28, 2026
radiken and others added 4 commits May 28, 2026 22:00
Collapse the rationale/"why" comment blocks and multi-line docstrings to short
single-liners; keep only the non-obvious gotchas (SHADOWENV literal, daemon
stop_time). No code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapse the module header and multi-line docstrings to terse one-liners.
No code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add _run_analysis() to shadow_gossipsub, called at the end of _run on success:
bandwidth CSVs via the ephemeral-VM metrics path + message reliability via
Nimlibp2pAnalyzer on the flattened logs. Best-effort (each wrapped) so a Docker
or analysis hiccup never fails the run. Now `deployment.py shadow-gossipsub`
does deploy + analysis in one shot, like the connmanager experiment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@radiken radiken changed the title Shadow metrics analysis via ephemeral VictoriaMetrics Shadow: auto-fired post-run analysis (metrics + message reliability) May 28, 2026
@radiken radiken marked this pull request as draft May 29, 2026 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

Shadow

1 participant