Skip to content

Simulations

Simulations let you test optimization changes against historical data before deploying to production, ensuring quality is maintained while reducing costs.

How Simulations Work

┌─────────────────────────────────────────────────────────────────┐
│ SIMULATION PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Select 2. Configure 3. Execute │
│ Historical New Model/ Replay │
│ Data Settings Requests │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ 10,000 │ ────► │ GPT-3.5│ ────► │ Run on │ │
│ │requests│ │ turbo │ │ subset │ │
│ └────────┘ └────────┘ └────────┘ │
│ │ │
│ ▼ │
│ 6. Decision 5. Quality 4. Compare │
│ Gate Scoring Outputs │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Apply? │ ◄──── │ 97.2% │ ◄──── │Original│ │
│ │ │ │ match │ │vs New │ │
│ └────────┘ └────────┘ └────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Creating a Simulation

Step 1: Select Scope

Choose what to simulate:

┌─────────────────────────────────────────────────────────────────┐
│ CREATE SIMULATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Scope │
│ ○ Flow: [customer-support ▼] │
│ ○ Span: [classify-intent ▼] │
│ │
│ Time Range │
│ ○ Last 7 days ● Last 30 days ○ Custom │
│ │
│ Sample Size │
│ [500 requests ▼] (5% of 10,000 total) │
│ │
└─────────────────────────────────────────────────────────────────┘

Step 2: Configure Changes

Specify the optimization to test:

┌─────────────────────────────────────────────────────────────────┐
│ SIMULATION CONFIGURATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Change Type │
│ ● Model Switch ○ Prompt Change ○ Parameter Tuning │
│ │
│ Current Model: GPT-4o │
│ New Model: [GPT-3.5-turbo ▼] │
│ │
│ Additional Settings │
│ Temperature: [0.7 ▼] (current: 0.7) │
│ Max Tokens: [150 ▼] (current: 150) │
│ │
└─────────────────────────────────────────────────────────────────┘

Step 3: Set Quality Threshold

Define what “acceptable quality” means:

┌─────────────────────────────────────────────────────────────────┐
│ QUALITY SETTINGS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Minimum Quality Score │
│ [95% ▼] Pass if similarity ≥ this threshold │
│ │
│ Comparison Method │
│ ● Semantic Similarity (embedding distance) │
│ ○ Exact Match (for structured outputs) │
│ ○ Custom Evaluator (provide function) │
│ │
│ Failure Handling │
│ ● Stop on first failure batch │
│ ○ Continue and report failures │
│ │
└─────────────────────────────────────────────────────────────────┘

Running Simulations

Simulation Progress

┌─────────────────────────────────────────────────────────────────┐
│ SIMULATION: classify-intent Model Switch │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Status: Running ⏳ │
│ Progress: ████████████████████████░░░░░░░░░░ 72% │
│ │
│ Requests Processed: 360 / 500 │
│ Time Elapsed: 2m 15s │
│ Est. Remaining: 52s │
│ │
│ Live Metrics: │
│ • Quality Score: 97.8% ✓ │
│ • Avg Latency: 180ms (was 320ms) │
│ • Cost per Request: $0.0018 (was $0.025) │
│ • Errors: 0 │
│ │
│ [Pause] [Cancel] │
│ │
└─────────────────────────────────────────────────────────────────┘

Results Dashboard

┌─────────────────────────────────────────────────────────────────┐
│ SIMULATION RESULTS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Status: PASSED ✓ │
│ │
│ ┌──────────────────┬────────────────┬────────────────┐ │
│ │ Metric │ Original │ Simulation │ │
│ ├──────────────────┼────────────────┼────────────────┤ │
│ │ Model │ GPT-4o │ GPT-3.5-turbo │ │
│ │ Avg Latency │ 320ms │ 180ms (-44%) │ │
│ │ Cost/Request │ $0.025 │ $0.0018 (-93%) │ │
│ │ Quality Score │ 100% (base) │ 97.2% ✓ │ │
│ │ Requests Tested │ - │ 500 │ │
│ │ Failures │ - │ 14 (2.8%) │ │
│ └──────────────────┴────────────────┴────────────────┘ │
│ │
│ Projected Monthly Savings: $2,340 │
│ │
│ [Apply Changes] [View Failures] [Run Again] [Export] │
│ │
└─────────────────────────────────────────────────────────────────┘

Analyzing Results

Quality Breakdown

View per-request quality scores:

┌─────────────────────────────────────────────────────────────────┐
│ QUALITY BREAKDOWN │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Score Distribution: │
│ 100% ██████████████████████████████████░░ 89% │
│ 95-99% ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 8% │
│ 90-95% █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 2% │
│ Under 90% ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 1% (failures) │
│ │
└─────────────────────────────────────────────────────────────────┘

Failure Analysis

Review cases where quality dropped:

┌─────────────────────────────────────────────────────────────────┐
│ FAILURE ANALYSIS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 14 requests below threshold (2.8%) │
│ │
│ Common Patterns: │
│ • 8 failures: Complex multi-step queries │
│ • 4 failures: Ambiguous intent │
│ • 2 failures: Edge case inputs │
│ │
│ Sample Failure: │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Input: "Can you help me with my issue from yesterday │ │
│ │ about the billing thing?" │ │
│ │ │ │
│ │ Original (GPT-4o): "billing_inquiry" │ │
│ │ Simulation (GPT-3.5): "general_question" │ │
│ │ Similarity: 78% │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ Recommendation: Consider hybrid approach - use GPT-3.5 for │
│ simple queries, GPT-4o for complex/ambiguous ones. │
│ │
└─────────────────────────────────────────────────────────────────┘

A/B Testing

Run live A/B tests instead of historical replay:

┌─────────────────────────────────────────────────────────────────┐
│ A/B TEST CONFIGURATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Control Group (A) │
│ Model: GPT-4o │
│ Traffic: 90% │
│ │
│ Test Group (B) │
│ Model: GPT-3.5-turbo │
│ Traffic: 10% │
│ │
│ Duration: [7 days ▼] │
│ Success Metric: [Quality Score ▼] ≥ 95% │
│ Sample Size: Auto (stop at statistical significance) │
│ │
│ [Start A/B Test] │
│ │
└─────────────────────────────────────────────────────────────────┘

Scheduling Simulations

Automate simulation runs:

// API example: schedule weekly simulation
await ladger.simulations.schedule({
name: 'Weekly classifier check',
scope: { flowName: 'customer-support', spanName: 'classify-intent' },
change: { model: 'gpt-3.5-turbo' },
schedule: '0 9 * * MON', // Every Monday at 9am
qualityThreshold: 0.95,
notify: ['team@company.com'],
});

API Access

Create Simulation

Terminal window
curl -X POST "https://ladger.pages.dev/api/v1/simulations" \
-H "Authorization: Bearer ladger_sk_live_..." \
-H "Content-Type: application/json" \
-d '{
"flowName": "customer-support",
"spanName": "classify-intent",
"dateRange": {
"start": "2024-01-01",
"end": "2024-01-31"
},
"sampleSize": 500,
"change": {
"model": "gpt-3.5-turbo"
},
"qualityThreshold": 0.95
}'

Get Results

Terminal window
curl -X GET "https://ladger.pages.dev/api/v1/simulations/{id}/results" \
-H "Authorization: Bearer ladger_sk_live_..."

Best Practices

  1. Start small: Test with 5% sample before running full simulations
  2. Review failures: Always analyze why requests failed
  3. Consider edge cases: Ensure your sample includes diverse inputs
  4. Set realistic thresholds: 100% quality match is usually unnecessary
  5. Track over time: Re-run simulations periodically as data changes

Next Steps