Shadow: auto-fired post-run analysis (metrics + message reliability)#285
Draft
radiken wants to merge 19 commits into
Draft
Shadow: auto-fired post-run analysis (metrics + message reliability)#285radiken wants to merge 19 commits into
radiken wants to merge 19 commits into
Conversation
First end-to-end Shadow integration in 10ksim. A new `shadow-gossipsub` experiment runs N nim libp2p peers + a publisher host inside a single Shadow process, mirroring the smoke we validated by hand. Same shape as a k8s run from the user's POV: `python deployment.py shadow-gossipsub` or driven by `Multiple` for scale runs. Layout: - src/deployments/shadow/builders.py: pure data rendering. render_shadow_yaml builds the inline sim config (1_gbit_switch network, N peer hosts running ./main with SHADOWENV=true, one publisher host running traffic_sync.py). build_configmap packs shadow.yaml + traffic_sync.py (read from 10ksim's publisher_headless source) into a k8s ConfigMap. build_shadow_job builds the runner Job (init container fetches the dynamic-linked test-node binary into an emptyDir, main container is radiken/dst-shadow-base with the required seccomp Unconfined + SYS_PTRACE security context). - src/deployments/shadow/runtime.py: wait_for_job_complete polls the k8s Job status (the BaseExperiment wait_for_rollout helper doesn't understand Jobs). pull_shadow_logs collects Shadow's stdout + every per-host stdout/stderr into the experiment's output folder. - src/deployments/experiments/libp2p/shadow_gossipsub.py: the experiment class. ExpConfig with image tags overridable per run for reproducibility. A wrinkle worth documenting in the runner command: once the Job's pod hits Phase=Succeeded, kubectl exec and kubectl cp both refuse, so we can't tar shadow.data out after the fact. Instead the runner container, as its last act, cats every per-host stdout/stderr into its own stdout with unique =BEGIN===<path>= ... =END= markers. pull_shadow_logs parses these out of kubectl logs (which works on completed pods). Avoids PVCs and sleep hacks. Validated against a real cluster: 10 peers, 5 published messages, 100% delivery, mesh formed in ~65s simulated time, full per-host logs landed in the run output folder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addressing self-review of PR #276 before requesting reviewers. runtime.py: - Raise a new ShadowLogParseError when the SHADOW_DONE_EXIT marker is missing, instead of returning silently. The experiment's events.log was recording `logs_pulled` even when no per-host files were extracted. - Move the trailing newline into the marker regexes (`===\n?`) so the match end()s past it cleanly. Avoids the +1 hack and tolerates kubectl output without a final newline. - Don't double-strip trailing newlines from per-host log bodies: previous rstrip(b"\n") would consume legitimate blank lines that were in the source file. - Log a warning instead of info when Shadow itself exits non-zero. Today the Job's Failed condition still surfaces this; the warning is defense for any future bash wrapper change that might mask the rc. kube_utils.py: - get_cleanup_resources + cleanup_resources now know about ConfigMap. Without this the Shadow experiment leaks one CM per run (silently dropped by the `try/except KeyError` in get_cleanup_resources). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`out/*` has a middle separator so gitignore anchors it to the .gitignore location (repo root). The actual experiment output lives at `src/deployments/experiments/out/` (per BaseExperiment._setup_log_paths) so the rule was never matching the right place; every experiment run showed `?? src/deployments/experiments/out/` in git status. `out/` (trailing slash, no middle separator) is unanchored and matches the directory at any depth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Subclasses Multiple to drive shadow-gossipsub at 10/30/100 nodes. Sets a 60s delay between runs (Shadow finishes in seconds wall clock, so the default 120s is overkill). Validated via --dry-run: all three iterations dispatch cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shadow wrote every per-host stdout/stderr and metrics file to the pod's stdout (wrapped in markers) so we could parse them out of kubectl logs. That floods VictoriaLogs at scale: ~200KB/peer, so ~20MB at 100 peers and ~2GB extrapolated to 10k. Mount a Longhorn PVC at /sim/run so Shadow writes shadow.data there; after the Job finishes a short-lived reader pod mounts the same claim and we kubectl cp the tree out. Pod stdout now carries only Shadow's own progress output (~1MB at 100 peers, down from 20MB). Also makes the storeMetrics cadence configurable (METRICS_INTERVAL_S, 15s default for Shadow) so the last scrape captures post-traffic counters, and teaches kube_utils cleanup about PersistentVolumeClaim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fold a flatten step into pull_shadow_logs: after copying shadow.data off the PVC, concatenate each host's stdout+stderr into logs/<host>.log (skipping Shadow's .shimlog noise). Shadow runs the same test-node binary as k8s, so the log lines are identical and the mesh_analysis FileStack / FileReader read the result with no analyzer changes. Every run now drops a flat, ready-to-analyze logs/ dir. Scoped to logs; metrics handled separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shadow can't be scraped live, so storeMetrics appends per-peer /metrics snapshots to files. To analyze bandwidth with the existing PromQL pipeline, spin up a throwaway dockerized VictoriaMetrics, split each peer file into snapshots, and import them with synthesized timestamps (snapshot order x interval) plus pod/namespace labels. The VM is then queryable exactly like the lab one, so the same rate(libp2p_network_bytes_total[...]) queries work. Verified end-to-end on a 10-peer run: VM totals match a direct sum of the last snapshot per peer, and rate() returns throughput. Logs side handled separately by the FileStack flatten. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scrape_run_metrics() imports a Shadow run into the throwaway VM, then runs the existing Scrapper with the libp2p metric set against it and dumps the same per-metric CSVs the k8s path produces under <run_dir>/metrics/. Imports tag each series with the labels a k8s scrape adds (pod, instance, job, node) so the existing PromQL + extract_field analysis reads a Shadow run unchanged: present metrics (network in/out, peers, open streams, gossipsub peers, nim gc) dump CSVs; cAdvisor metrics (container_*) have no data in the sim and skip gracefully. Scrapper's kube_config is now optional since the port-forward path is unused. Verified on a 10-peer run: produces libp2p-in/out (per-peer throughput by time), libp2p-peers, open-streams, low/high-peers, nim-gc CSVs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- shadow_metrics main(): list produced files directly; the rglob("*.csv")
matched nothing since CSVs are written without a .csv suffix (k8s convention).
- Drop the unused query() helper.
- Fix stale header label list and the Notion doc name in shadow_gossipsub.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#278 moved the inter-run delay to config.delay and the target experiment to config.name (set in model_post_init), dropping delay_between_exps and the experiment_name field. Update MultiShadowGossipsub to match: set config.name + config.delay in model_post_init and register Multiple.add_args. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Analyze a Shadow run's per-peer Prometheus dumps like a k8s run: import them into a throwaway dockerized VictoriaMetrics (synthesized timestamps + k8s-style pod/instance/job/node labels), then run the existing Scrapper against it for the same per-metric CSVs (libp2p-in/out, peers, open-streams, etc.). scrapper.py: kube_config made optional (the port-forward path is unused), so analysis doesn't need cluster access. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep this PR to the run path (experiment + builders + runtime). The ephemeral-VictoriaMetrics metrics analysis and the Scrapper change move to #285. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Open
Collapse the rationale/"why" comment blocks and multi-line docstrings to short single-liners; keep only the non-obvious gotchas (SHADOWENV literal, daemon stop_time). No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapse the module header and multi-line docstrings to terse one-liners. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add _run_analysis() to shadow_gossipsub, called at the end of _run on success: bandwidth CSVs via the ephemeral-VM metrics path + message reliability via Nimlibp2pAnalyzer on the flattened logs. Best-effort (each wrapped) so a Docker or analysis hiccup never fails the run. Now `deployment.py shadow-gossipsub` does deploy + analysis in one shot, like the connmanager experiment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Makes a Shadow run analyze itself like a k8s run, mirroring the connmanager experiment (deploy + analysis + plots in one
deployment.pyinvocation, defined in the experiment).Depends on #276
analysis/metrics/shadow_metrics.py: split each peer's/metricssnapshots, import into a throwaway dockerized VictoriaMetrics (synthesized timestamps + k8s-style labels), run the existingScrapper-> same per-metric CSVs.shadow_gossipsub._run_analysis(): fires after the run (best-effort, never fails it) -> bandwidth CSVs (above) + message reliability viaNimlibp2pAnalyzeron the flattenedlogs/(FileStack).scrapper.py:kube_configmade optional (unused port-forward path).Needs Docker locally for the ephemeral VM.