Simulations
Simulations let you test optimization changes against historical data before deploying to production, ensuring quality is maintained while reducing costs.
How Simulations Work
┌─────────────────────────────────────────────────────────────────┐│ SIMULATION PIPELINE │├─────────────────────────────────────────────────────────────────┤│ ││ 1. Select 2. Configure 3. Execute ││ Historical New Model/ Replay ││ Data Settings Requests ││ ┌────────┐ ┌────────┐ ┌────────┐ ││ │ 10,000 │ ────► │ GPT-3.5│ ────► │ Run on │ ││ │requests│ │ turbo │ │ subset │ ││ └────────┘ └────────┘ └────────┘ ││ │ ││ ▼ ││ 6. Decision 5. Quality 4. Compare ││ Gate Scoring Outputs ││ ┌────────┐ ┌────────┐ ┌────────┐ ││ │ Apply? │ ◄──── │ 97.2% │ ◄──── │Original│ ││ │ │ │ match │ │vs New │ ││ └────────┘ └────────┘ └────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Creating a Simulation
Step 1: Select Scope
Choose what to simulate:
┌─────────────────────────────────────────────────────────────────┐│ CREATE SIMULATION │├─────────────────────────────────────────────────────────────────┤│ ││ Scope ││ ○ Flow: [customer-support ▼] ││ ○ Span: [classify-intent ▼] ││ ││ Time Range ││ ○ Last 7 days ● Last 30 days ○ Custom ││ ││ Sample Size ││ [500 requests ▼] (5% of 10,000 total) ││ │└─────────────────────────────────────────────────────────────────┘Step 2: Configure Changes
Specify the optimization to test:
┌─────────────────────────────────────────────────────────────────┐│ SIMULATION CONFIGURATION │├─────────────────────────────────────────────────────────────────┤│ ││ Change Type ││ ● Model Switch ○ Prompt Change ○ Parameter Tuning ││ ││ Current Model: GPT-4o ││ New Model: [GPT-3.5-turbo ▼] ││ ││ Additional Settings ││ Temperature: [0.7 ▼] (current: 0.7) ││ Max Tokens: [150 ▼] (current: 150) ││ │└─────────────────────────────────────────────────────────────────┘Step 3: Set Quality Threshold
Define what “acceptable quality” means:
┌─────────────────────────────────────────────────────────────────┐│ QUALITY SETTINGS │├─────────────────────────────────────────────────────────────────┤│ ││ Minimum Quality Score ││ [95% ▼] Pass if similarity ≥ this threshold ││ ││ Comparison Method ││ ● Semantic Similarity (embedding distance) ││ ○ Exact Match (for structured outputs) ││ ○ Custom Evaluator (provide function) ││ ││ Failure Handling ││ ● Stop on first failure batch ││ ○ Continue and report failures ││ │└─────────────────────────────────────────────────────────────────┘Running Simulations
Simulation Progress
┌─────────────────────────────────────────────────────────────────┐│ SIMULATION: classify-intent Model Switch │├─────────────────────────────────────────────────────────────────┤│ ││ Status: Running ⏳ ││ Progress: ████████████████████████░░░░░░░░░░ 72% ││ ││ Requests Processed: 360 / 500 ││ Time Elapsed: 2m 15s ││ Est. Remaining: 52s ││ ││ Live Metrics: ││ • Quality Score: 97.8% ✓ ││ • Avg Latency: 180ms (was 320ms) ││ • Cost per Request: $0.0018 (was $0.025) ││ • Errors: 0 ││ ││ [Pause] [Cancel] ││ │└─────────────────────────────────────────────────────────────────┘Results Dashboard
┌─────────────────────────────────────────────────────────────────┐│ SIMULATION RESULTS │├─────────────────────────────────────────────────────────────────┤│ ││ Status: PASSED ✓ ││ ││ ┌──────────────────┬────────────────┬────────────────┐ ││ │ Metric │ Original │ Simulation │ ││ ├──────────────────┼────────────────┼────────────────┤ ││ │ Model │ GPT-4o │ GPT-3.5-turbo │ ││ │ Avg Latency │ 320ms │ 180ms (-44%) │ ││ │ Cost/Request │ $0.025 │ $0.0018 (-93%) │ ││ │ Quality Score │ 100% (base) │ 97.2% ✓ │ ││ │ Requests Tested │ - │ 500 │ ││ │ Failures │ - │ 14 (2.8%) │ ││ └──────────────────┴────────────────┴────────────────┘ ││ ││ Projected Monthly Savings: $2,340 ││ ││ [Apply Changes] [View Failures] [Run Again] [Export] ││ │└─────────────────────────────────────────────────────────────────┘Analyzing Results
Quality Breakdown
View per-request quality scores:
┌─────────────────────────────────────────────────────────────────┐│ QUALITY BREAKDOWN │├─────────────────────────────────────────────────────────────────┤│ ││ Score Distribution: ││ 100% ██████████████████████████████████░░ 89% ││ 95-99% ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 8% ││ 90-95% █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 2% ││ Under 90% ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 1% (failures) ││ │└─────────────────────────────────────────────────────────────────┘Failure Analysis
Review cases where quality dropped:
┌─────────────────────────────────────────────────────────────────┐│ FAILURE ANALYSIS │├─────────────────────────────────────────────────────────────────┤│ ││ 14 requests below threshold (2.8%) ││ ││ Common Patterns: ││ • 8 failures: Complex multi-step queries ││ • 4 failures: Ambiguous intent ││ • 2 failures: Edge case inputs ││ ││ Sample Failure: ││ ┌────────────────────────────────────────────────────────────┐ ││ │ Input: "Can you help me with my issue from yesterday │ ││ │ about the billing thing?" │ ││ │ │ ││ │ Original (GPT-4o): "billing_inquiry" │ ││ │ Simulation (GPT-3.5): "general_question" │ ││ │ Similarity: 78% │ ││ └────────────────────────────────────────────────────────────┘ ││ ││ Recommendation: Consider hybrid approach - use GPT-3.5 for ││ simple queries, GPT-4o for complex/ambiguous ones. ││ │└─────────────────────────────────────────────────────────────────┘A/B Testing
Run live A/B tests instead of historical replay:
┌─────────────────────────────────────────────────────────────────┐│ A/B TEST CONFIGURATION │├─────────────────────────────────────────────────────────────────┤│ ││ Control Group (A) ││ Model: GPT-4o ││ Traffic: 90% ││ ││ Test Group (B) ││ Model: GPT-3.5-turbo ││ Traffic: 10% ││ ││ Duration: [7 days ▼] ││ Success Metric: [Quality Score ▼] ≥ 95% ││ Sample Size: Auto (stop at statistical significance) ││ ││ [Start A/B Test] ││ │└─────────────────────────────────────────────────────────────────┘Scheduling Simulations
Automate simulation runs:
// API example: schedule weekly simulationawait ladger.simulations.schedule({ name: 'Weekly classifier check', scope: { flowName: 'customer-support', spanName: 'classify-intent' }, change: { model: 'gpt-3.5-turbo' }, schedule: '0 9 * * MON', // Every Monday at 9am qualityThreshold: 0.95, notify: ['team@company.com'],});API Access
Create Simulation
curl -X POST "https://ladger.pages.dev/api/v1/simulations" \ -H "Authorization: Bearer ladger_sk_live_..." \ -H "Content-Type: application/json" \ -d '{ "flowName": "customer-support", "spanName": "classify-intent", "dateRange": { "start": "2024-01-01", "end": "2024-01-31" }, "sampleSize": 500, "change": { "model": "gpt-3.5-turbo" }, "qualityThreshold": 0.95 }'Get Results
curl -X GET "https://ladger.pages.dev/api/v1/simulations/{id}/results" \ -H "Authorization: Bearer ladger_sk_live_..."Best Practices
- Start small: Test with 5% sample before running full simulations
- Review failures: Always analyze why requests failed
- Consider edge cases: Ensure your sample includes diverse inputs
- Set realistic thresholds: 100% quality match is usually unnecessary
- Track over time: Re-run simulations periodically as data changes
Next Steps
- Apply validated changes with Optimization rollout
- Monitor impact with Cost Analysis