Reproducible multi-agent experiments,
from hypothesis to paper-ready results
Run ablations across strategies and seeds, replay executions from checkpoints,
evaluate automatically, and export publication-ready tables —
without building custom experiment infrastructure.
Research infrastructure
shouldn't be your research
Multi-agent experiments require orchestration, evaluation, reproducibility, and statistical analysis. Most researchers build this from scratch for every paper — then throw it away.
of research time spent on infrastructure, not science
Industry surveys, 2024–2025 ML practitioner reportsmajor agent frameworks with built-in experiment grids, checkpoint replay, and publication export
Survey of LangGraph, AutoGen, CrewAI, March 2026token overhead in multi-agent experiments without proper orchestration
JamJet benchmark suite, local Ollama runs, March 2026Everything between
hypothesis and publication
Six Reasoning Strategies
React, plan-and-execute, critic, reflection, consensus, debate — swap with a single parameter. Same agent, different reasoning. Perfect for ablation studies.
agent = Agent( strategy="debate", # swap to compare max_iterations=6, )
ExperimentGrid
Run every combination of conditions and seeds in a single call. Cartesian product, parallel execution, automatic result collection.
grid = ExperimentGrid( conditions={ "strategy": ["react", "debate"], }, seeds=[42, 123, 456], ) results = await grid.run()
Publication Export
Export results as LaTeX booktabs tables, CSV for R/pandas, or structured JSON. Mean ± std computed automatically.
results.to_latex("table1.tex") results.to_csv("results.csv") results.compare(A, B) # p-value
Durable Replay
Every execution is checkpointed. Replay any experiment exactly. Fork from any checkpoint with modified parameters for ablation studies.
$ jamjet replay exec_abc $ jamjet fork exec_abc \ --override-input '{"model":"gemini"}'
Built-in Evaluation
LLM-as-judge, assertion, latency, and cost scorers. Eval nodes run inside workflows for self-improving agents. CI exit codes on regression.
# workflow.yaml check: type: eval on_fail: retry_with_feedback max_retries: 2
Research Template
One command to scaffold a complete experiment: agents, baselines, evaluation datasets, experiment runner, and results directory.
$ jamjet init my-study \ --template research # agents/ baselines/ experiments/ # evals/ results/ workflow.yaml
Most agent frameworks prioritize apps
over experimental reproducibility
| Capability | JamJet | LangGraph | AutoGen | Custom scripts |
|---|---|---|---|---|
| Multi-agent orchestration | Native | Native | Native | Possible with custom setup |
| Durable replay | Native | Possible with custom setup | Possible with custom setup | Possible with custom setup |
| Strategy comparison | 6 native strategies | Possible with custom setup | Possible with custom setup | Possible with custom setup |
| Experiment grid | Native | Possible with custom setup | Possible with custom setup | Possible with custom setup |
| LaTeX / CSV export | Native | Possible with custom setup | Possible with custom setup | Possible with custom setup |
| Checkpoint fork | Native | Possible with custom setup | Possible with custom setup | Possible with custom setup |
| Built-in eval harness | Native | External tooling required | External tooling required | Possible with custom setup |
| Per-node cost tracking | Native | Partial | Partial | Possible with custom setup |
| Statistical comparison | Native (Welch's t-test) | Possible with custom setup | Possible with custom setup | Possible with custom setup |
From hypothesis to Methods section
Scaffold
jamjet init --template research
Define agents
Tools, strategies, instructions
15 minRun experiments
ExperimentGrid across conditions
Export results
LaTeX tables, CSV, statistical tests
1 commandReproduce
jamjet replay from checkpoint
One research afternoon, end to end
Compare 6 strategies on your dataset
grid = ExperimentGrid( conditions={"strategy": ["react", "plan_and_execute", "critic", "reflection", "consensus", "debate"]}, seeds=[42, 123, 456], ) results = await grid.run()
Export a LaTeX table for your paper
results.to_latex("table1.tex", caption="Strategy comparison") # Outputs booktabs table with mean +/- std per condition
Replay a failed condition — no re-running prior steps
$ jamjet replay exec_debate_seed42 # Restores from checkpoint. Saves tokens + cost.
Compute significance between conditions
results.compare("debate", "react") # => {p_value: 0.023, effect_size: 0.41, significant: true}
Fork for an ablation study
$ jamjet fork exec_debate_seed42 \ --override-input '{"model":"gpt-4o"}' # Same execution, different model. Instant ablation.
Start as a simple Python agent, scale into reproducible experiment runs — without rewriting your stack. See the quickstart →
What a result looks like
Task: summarize a 2,000-word policy document. 6 strategies, 3 seeds each. Scored by LLM-judge (0–1). Local Ollama, Llama 3.
| Strategy | Score (mean ± std) | Tokens | Latency | Cost |
|---|---|---|---|---|
| react | 0.71 ± 0.04 | 1,240 | 2.1s | $0.002 |
| plan_and_execute | 0.78 ± 0.03 | 1,890 | 3.4s | $0.003 |
| critic | 0.82 ± 0.05 | 2,410 | 4.2s | $0.004 |
| reflection | 0.84 ± 0.02 | 3,100 | 5.8s | $0.005 |
| consensus | 0.86 ± 0.03 | 4,520 | 7.1s | $0.007 |
| debate | 0.89 ± 0.02 | 5,880 | 9.3s | $0.009 |
debate vs. react: p = 0.012 (Welch's t-test, n = 3 seeds). This table was generated by results.to_latex("table1.tex") — zero manual formatting.
Illustrative results from internal testing. Your numbers will vary by model, task, and hardware.
Why not just scripts?
Custom scripts work for one-off experiments. They break down when you need to reproduce, compare, or build on prior work.
Custom scripts
- Reproducibility depends on discipline, not tooling
- No checkpoint — a crash reruns everything from scratch
- Manual experiment matrix loops with ad-hoc seed handling
- Result formatting is copy-paste or custom code
- No built-in cost tracking — discovered after the bill
- Comparing strategies requires rewriting orchestration code
JamJet
- Every execution event-sourced — replay from any checkpoint
- Crash recovery built in — resume exactly where it stopped
ExperimentGridhandles conditions × seeds automatically- One call to
to_latex(),to_csv(), orto_json() - Per-node token and cost tracking, visible in real time
- Change
strategy="debate"tostrategy="react"— same agent, different reasoning
Patterns from published research
LLM Delegate Protocol
Identity-aware agent routing with quality scores, governed sessions, and provenance tracking. JamJet integration via ProtocolAdapter trait.
Deliberative Collective Intelligence
Structured multi-agent deliberation with four reasoning archetypes and typed epistemic acts. Patterns now available as JamJet strategies and examples.
Built for how you work
Multi-agent systems
AAMAS, NeurIPS workshops
Orchestration + evaluation + reproducibilityLLM reasoning
CoT, ToT, debate, reflection
Strategy parameter makes A/B testing trivialTool-augmented LLMs
ReAct, Toolformer
MCP-native tool integrationAI safety & alignment
HITL, guardrails
Human-in-the-loop + policy engineEvaluation & benchmarks
AgentBench, GAIA
Eval harness + batch runner + CI gatesAgent communication
Negotiation, persuasion
Native A2A + LDP protocol supportStart your experiment
From pip install to running multi-agent experiments in under 5 minutes.