Simulations

Simulations let you test optimization changes against historical data before deploying to production, ensuring quality is maintained while reducing costs.

How Simulations Work

┌─────────────────────────────────────────────────────────────────┐
│                      SIMULATION PIPELINE                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   1. Select          2. Configure       3. Execute              │
│   Historical         New Model/         Replay                   │
│   Data               Settings           Requests                 │
│   ┌────────┐         ┌────────┐         ┌────────┐              │
│   │ 10,000 │  ────►  │ GPT-3.5│  ────►  │ Run on │              │
│   │requests│         │ turbo  │         │ subset │              │
│   └────────┘         └────────┘         └────────┘              │
│                                              │                   │
│                                              ▼                   │
│   6. Decision        5. Quality          4. Compare             │
│   Gate               Scoring             Outputs                 │
│   ┌────────┐         ┌────────┐         ┌────────┐              │
│   │ Apply? │  ◄────  │ 97.2%  │  ◄────  │Original│              │
│   │        │         │ match  │         │vs New  │              │
│   └────────┘         └────────┘         └────────┘              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Creating a Simulation

Step 1: Select Scope

Choose what to simulate:

┌─────────────────────────────────────────────────────────────────┐
│  CREATE SIMULATION                                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Scope                                                          │
│  ○ Flow: [customer-support ▼]                                   │
│  ○ Span: [classify-intent ▼]                                    │
│                                                                  │
│  Time Range                                                      │
│  ○ Last 7 days  ● Last 30 days  ○ Custom                       │
│                                                                  │
│  Sample Size                                                     │
│  [500 requests ▼]  (5% of 10,000 total)                        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Step 2: Configure Changes

Specify the optimization to test:

┌─────────────────────────────────────────────────────────────────┐
│  SIMULATION CONFIGURATION                                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Change Type                                                    │
│  ● Model Switch  ○ Prompt Change  ○ Parameter Tuning           │
│                                                                  │
│  Current Model:   GPT-4o                                        │
│  New Model:       [GPT-3.5-turbo ▼]                            │
│                                                                  │
│  Additional Settings                                            │
│  Temperature:     [0.7 ▼]  (current: 0.7)                      │
│  Max Tokens:      [150 ▼]  (current: 150)                      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Step 3: Set Quality Threshold

Define what “acceptable quality” means:

┌─────────────────────────────────────────────────────────────────┐
│  QUALITY SETTINGS                                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Minimum Quality Score                                          │
│  [95%     ▼]  Pass if similarity ≥ this threshold              │
│                                                                  │
│  Comparison Method                                              │
│  ● Semantic Similarity (embedding distance)                    │
│  ○ Exact Match (for structured outputs)                        │
│  ○ Custom Evaluator (provide function)                         │
│                                                                  │
│  Failure Handling                                               │
│  ● Stop on first failure batch                                 │
│  ○ Continue and report failures                                │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Running Simulations

Simulation Progress

┌─────────────────────────────────────────────────────────────────┐
│  SIMULATION: classify-intent Model Switch                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Status: Running ⏳                                              │
│  Progress: ████████████████████████░░░░░░░░░░  72%              │
│                                                                  │
│  Requests Processed: 360 / 500                                  │
│  Time Elapsed: 2m 15s                                           │
│  Est. Remaining: 52s                                            │
│                                                                  │
│  Live Metrics:                                                  │
│  • Quality Score: 97.8% ✓                                       │
│  • Avg Latency: 180ms (was 320ms)                              │
│  • Cost per Request: $0.0018 (was $0.025)                      │
│  • Errors: 0                                                    │
│                                                                  │
│  [Pause]  [Cancel]                                              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Results Dashboard

┌─────────────────────────────────────────────────────────────────┐
│  SIMULATION RESULTS                                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Status: PASSED ✓                                               │
│                                                                  │
│  ┌──────────────────┬────────────────┬────────────────┐         │
│  │ Metric           │ Original       │ Simulation     │         │
│  ├──────────────────┼────────────────┼────────────────┤         │
│  │ Model            │ GPT-4o         │ GPT-3.5-turbo  │         │
│  │ Avg Latency      │ 320ms          │ 180ms (-44%)   │         │
│  │ Cost/Request     │ $0.025         │ $0.0018 (-93%) │         │
│  │ Quality Score    │ 100% (base)    │ 97.2% ✓        │         │
│  │ Requests Tested  │ -              │ 500            │         │
│  │ Failures         │ -              │ 14 (2.8%)      │         │
│  └──────────────────┴────────────────┴────────────────┘         │
│                                                                  │
│  Projected Monthly Savings: $2,340                              │
│                                                                  │
│  [Apply Changes]  [View Failures]  [Run Again]  [Export]        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Analyzing Results

Quality Breakdown

View per-request quality scores:

┌─────────────────────────────────────────────────────────────────┐
│  QUALITY BREAKDOWN                                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Score Distribution:                                            │
│  100%     ██████████████████████████████████░░  89%             │
│  95-99%   ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   8%             │
│  90-95%   █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   2%             │
│  Under 90% ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  1% (failures)  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Failure Analysis

Review cases where quality dropped:

┌─────────────────────────────────────────────────────────────────┐
│  FAILURE ANALYSIS                                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  14 requests below threshold (2.8%)                             │
│                                                                  │
│  Common Patterns:                                               │
│  • 8 failures: Complex multi-step queries                       │
│  • 4 failures: Ambiguous intent                                 │
│  • 2 failures: Edge case inputs                                 │
│                                                                  │
│  Sample Failure:                                                │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │ Input: "Can you help me with my issue from yesterday       │ │
│  │         about the billing thing?"                          │ │
│  │                                                             │ │
│  │ Original (GPT-4o):     "billing_inquiry"                   │ │
│  │ Simulation (GPT-3.5):  "general_question"                  │ │
│  │ Similarity: 78%                                            │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  Recommendation: Consider hybrid approach - use GPT-3.5 for    │
│  simple queries, GPT-4o for complex/ambiguous ones.            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

A/B Testing

Run live A/B tests instead of historical replay:

┌─────────────────────────────────────────────────────────────────┐
│  A/B TEST CONFIGURATION                                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Control Group (A)                                              │
│  Model: GPT-4o                                                  │
│  Traffic: 90%                                                   │
│                                                                  │
│  Test Group (B)                                                 │
│  Model: GPT-3.5-turbo                                          │
│  Traffic: 10%                                                   │
│                                                                  │
│  Duration: [7 days ▼]                                          │
│  Success Metric: [Quality Score ▼] ≥ 95%                       │
│  Sample Size: Auto (stop at statistical significance)          │
│                                                                  │
│  [Start A/B Test]                                               │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Scheduling Simulations

Automate simulation runs:

// API example: schedule weekly simulation
await ladger.simulations.schedule({
  name: 'Weekly classifier check',
  scope: { flowName: 'customer-support', spanName: 'classify-intent' },
  change: { model: 'gpt-3.5-turbo' },
  schedule: '0 9 * * MON', // Every Monday at 9am
  qualityThreshold: 0.95,
  notify: ['team@company.com'],
});

API Access

Create Simulation

curl -X POST "https://ladger.pages.dev/api/v1/simulations" \
  -H "Authorization: Bearer ladger_sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "flowName": "customer-support",
    "spanName": "classify-intent",
    "dateRange": {
      "start": "2024-01-01",
      "end": "2024-01-31"
    },
    "sampleSize": 500,
    "change": {
      "model": "gpt-3.5-turbo"
    },
    "qualityThreshold": 0.95
  }'

Get Results

curl -X GET "https://ladger.pages.dev/api/v1/simulations/{id}/results" \
  -H "Authorization: Bearer ladger_sk_live_..."

Best Practices

Start small: Test with 5% sample before running full simulations
Review failures: Always analyze why requests failed
Consider edge cases: Ensure your sample includes diverse inputs
Set realistic thresholds: 100% quality match is usually unnecessary
Track over time: Re-run simulations periodically as data changes

Next Steps

Apply validated changes with Optimization rollout
Monitor impact with Cost Analysis