Accuracy Report · v2.3 · May 2026

We publish our failures too

These are our internal accuracy numbers across common production incident patterns. Not a formal audit — but an honest one. We document every failure case, what caused it, and what we've done about it.

✓ This report includes cases where OperatorMesh was wrong
~87%
Diagnosis Accuracy
Internal testing · common incident patterns
5+
Incident Categories
DB · Memory · K8s · API · Cascade
~35min
Avg Manual Triage
Industry benchmark vs ~19s with OM
~91%
Action Usefulness
Suggested actions rated helpful
~13%
Failure Rate
Cases where diagnosis was wrong
~78%
Remediation Accuracy
Suggested fix resolved the incident
Accuracy by incident type
Incident Type Coverage Accuracy
Database / Connection Pool Well covered ~94%
Memory / OOM Kills Well covered ~92%
Kubernetes / Container Well covered ~88%
API / Gateway Errors Well covered ~85%
Latency / Timeout Spikes Moderate coverage ~79%
Multi-Service Cascades Limited coverage ~62%
* All figures approximate — internal testing · independent verification in progress · contribute your incidents →
Confidence calibration

When OperatorMesh says 90% confidence, it should be right ~90% of the time. Here's how our stated confidence correlates with actual accuracy in internal testing — a well-calibrated model means the confidence number actually means something.

90–99% confidence
Actual accuracy: 93%
✓ Well calibrated — slightly conservative
75–89% confidence
Actual accuracy: 81%
✓ Well calibrated — matches stated range
60–74% confidence
Actual accuracy: 68%
⚠ Slightly overconfident — improving
50–59% confidence
Actual accuracy: 51%
→ Auto-escalation triggered at <60%
Where we fail — documented honestly
🔴 Multi-service cascade failures 38% of failures
When 3+ services fail simultaneously with no clear trigger, OperatorMesh sometimes identifies a symptom as the root cause rather than the underlying cascade source. Complex distributed failures with circular dependencies are our hardest case.
→ Mitigation: Blast Radius mode now maps cascades before triage runs
🟡 Insufficient log context 31% of failures
When input contains only a single error line with no stack trace, service name, or recent changes, accuracy drops significantly. Garbage in, garbage out — the AI can only work with what you give it.
→ Mitigation: Missing Signals feature tells you exactly what to add to improve diagnosis
🟠 Novel / unusual error patterns 19% of failures
Highly specific internal service errors with proprietary error codes that don't appear in public engineering literature. OperatorMesh correctly flags low confidence and triggers escalation in most of these cases.
→ Mitigation: Auto-escalation + Rejected Hypotheses help engineers know when to dig deeper
🔵 Infrastructure-level issues 12% of failures
Cloud provider outages, hardware failures, or network-level issues that produce application-layer symptoms. OperatorMesh diagnoses the application symptom correctly but misses the underlying infrastructure cause.
→ Mitigation: Status page cross-reference coming in v2.4
How we measure this
⚠ Transparency notice

These numbers are based on internal testing across common production incident patterns — real incident types that SREs and DevOps engineers encounter regularly. This is not yet a formal third-party audit or academic benchmark.

We tested against well-documented incident categories (PostgreSQL connection exhaustion, Kubernetes OOMKill, API gateway timeouts, memory leaks, cascade failures) and compared AI output against known correct diagnoses for each pattern.

We are actively seeking SRE and DevOps teams to participate in independent verification with real production incidents. If you'd like to contribute, email founder@operatormesh.com — early participants get 6 months Pro free.

📋
Real incident patterns
Tested against real-world incident categories that appear regularly in production. Not synthetic edge cases or toy examples designed to make the AI look good.
📊
Binary correctness
A diagnosis counts as correct only if the root cause matches the known correct answer for that incident pattern. Partial credit not given. We count our failures.
🔄
Updated every version
This report updates with every major release. We never remove results when accuracy improves — all historical data is preserved and visible.
🔍
Independent verification wanted
We are actively seeking external engineers to verify these results with real production incidents. Contact us to participate — early contributors get 6 months Pro free.
Manual triage vs OperatorMesh

Based on internal testing across common production incident patterns. Manual triage time of 35 minutes is consistent with industry research and widely reported SRE benchmarks.

❌ Manual Triage
35 min
average per incident
log diving · guessing · team debate
✅ OperatorMesh
~19 sec
average analysis time
root cause · confidence · ranked actions
⚡ Time Saved
97.5%
reduction in triage time
at $150/hr = $87 saved per incident
💰 ROI at $19/mo
46x
return on investment
10 incidents/month · 2 engineers
How we compare to alternatives

We're not replacing your monitoring stack. We're adding the explanation layer that none of them provide. Here's an honest feature comparison.

Feature Datadog PagerDuty Grafana OperatorMesh
Detects anomalies
Explains root cause Partial ✓ 87% accuracy
Confidence scoring ✓ Dual scores
Ranked fix actions ✓ 3 ranked
Pre-mortem scanning ✓ World first
Blast radius prediction ✓ World first
Post-mortem auto-draft Add-on Template only ✓ AI-generated
On-call handoff briefing ✓ World first
Rejected hypotheses ✓ World first
Webhook log retention Stored Stored Stored Zero retained
Free to start
Starting price $15/host/mo $21/user/mo $0 OSS $0 free · $19/mo
* Competitor features based on public documentation as of May 2026. OperatorMesh is not affiliated with any listed tool.
⚠ What we don't claim
We don't claim OperatorMesh replaces experienced SREs. It accelerates them.
We don't claim 87% accuracy on your specific stack. Results vary by log quality and incident type.
We don't claim our benchmark sample of 50 is statistically perfect. It's honest and growing.
We don't claim remediation confidence equals guaranteed fix. It's a signal, not a guarantee.
We don't claim the AI is always right when it's confident. See calibration table above.

Judge it yourself

The best benchmark is your own incidents. Paste a real log and see if the diagnosis matches what your team found.

⚡ Try free — no card needed View changelog →