Accuracy Report · v2.3 · May 2026
We publish our failures too
These are our internal accuracy numbers across common production incident patterns. Not a formal audit — but an honest one. We document every failure case, what caused it, and what we've done about it.
✓ This report includes cases where OperatorMesh was wrong
~87%
Diagnosis Accuracy
Internal testing · common incident patterns
5+
Incident Categories
DB · Memory · K8s · API · Cascade
~35min
Avg Manual Triage
Industry benchmark vs ~19s with OM
~91%
Action Usefulness
Suggested actions rated helpful
~13%
Failure Rate
Cases where diagnosis was wrong
~78%
Remediation Accuracy
Suggested fix resolved the incident
Accuracy by incident type
| Incident Type |
Coverage |
Accuracy |
|
| Database / Connection Pool |
Well covered |
~94% |
|
| Memory / OOM Kills |
Well covered |
~92% |
|
| Kubernetes / Container |
Well covered |
~88% |
|
| API / Gateway Errors |
Well covered |
~85% |
|
| Latency / Timeout Spikes |
Moderate coverage |
~79% |
|
| Multi-Service Cascades |
Limited coverage |
~62% |
|
Confidence calibration
When OperatorMesh says 90% confidence, it should be right ~90% of the time. Here's how our stated confidence correlates with actual accuracy in internal testing — a well-calibrated model means the confidence number actually means something.
90–99% confidence
Actual accuracy: 93%
✓ Well calibrated — slightly conservative
75–89% confidence
Actual accuracy: 81%
✓ Well calibrated — matches stated range
60–74% confidence
Actual accuracy: 68%
⚠ Slightly overconfident — improving
50–59% confidence
Actual accuracy: 51%
→ Auto-escalation triggered at <60%
Where we fail — documented honestly
When 3+ services fail simultaneously with no clear trigger, OperatorMesh sometimes identifies a symptom as the root cause rather than the underlying cascade source. Complex distributed failures with circular dependencies are our hardest case.
→ Mitigation: Blast Radius mode now maps cascades before triage runs
When input contains only a single error line with no stack trace, service name, or recent changes, accuracy drops significantly. Garbage in, garbage out — the AI can only work with what you give it.
→ Mitigation: Missing Signals feature tells you exactly what to add to improve diagnosis
Highly specific internal service errors with proprietary error codes that don't appear in public engineering literature. OperatorMesh correctly flags low confidence and triggers escalation in most of these cases.
→ Mitigation: Auto-escalation + Rejected Hypotheses help engineers know when to dig deeper
Cloud provider outages, hardware failures, or network-level issues that produce application-layer symptoms. OperatorMesh diagnoses the application symptom correctly but misses the underlying infrastructure cause.
→ Mitigation: Status page cross-reference coming in v2.4
How we measure this
⚠ Transparency notice
These numbers are based on internal testing across common production incident patterns — real incident types that SREs and DevOps engineers encounter regularly. This is not yet a formal third-party audit or academic benchmark.
We tested against well-documented incident categories (PostgreSQL connection exhaustion, Kubernetes OOMKill, API gateway timeouts, memory leaks, cascade failures) and compared AI output against known correct diagnoses for each pattern.
We are actively seeking SRE and DevOps teams to participate in independent verification with real production incidents. If you'd like to contribute, email founder@operatormesh.com — early participants get 6 months Pro free.
📋
Real incident patterns
Tested against real-world incident categories that appear regularly in production. Not synthetic edge cases or toy examples designed to make the AI look good.
📊
Binary correctness
A diagnosis counts as correct only if the root cause matches the known correct answer for that incident pattern. Partial credit not given. We count our failures.
🔄
Updated every version
This report updates with every major release. We never remove results when accuracy improves — all historical data is preserved and visible.
🔍
Independent verification wanted
We are actively seeking external engineers to verify these results with real production incidents. Contact us to participate — early contributors get 6 months Pro free.
Manual triage vs OperatorMesh
Based on internal testing across common production incident patterns. Manual triage time of 35 minutes is consistent with industry research and widely reported SRE benchmarks.
❌ Manual Triage
35 min
average per incident
log diving · guessing · team debate
✅ OperatorMesh
~19 sec
average analysis time
root cause · confidence · ranked actions
⚡ Time Saved
97.5%
reduction in triage time
at $150/hr = $87 saved per incident
💰 ROI at $19/mo
46x
return on investment
10 incidents/month · 2 engineers
How we compare to alternatives
We're not replacing your monitoring stack. We're adding the explanation layer that none of them provide. Here's an honest feature comparison.
| Feature |
Datadog |
PagerDuty |
Grafana |
OperatorMesh |
| Detects anomalies |
✓ |
✕ |
✓ |
✓ |
| Explains root cause |
✕ |
✕ |
Partial |
✓ 87% accuracy |
| Confidence scoring |
✕ |
✕ |
✕ |
✓ Dual scores |
| Ranked fix actions |
✕ |
✕ |
✕ |
✓ 3 ranked |
| Pre-mortem scanning |
✕ |
✕ |
✕ |
✓ World first |
| Blast radius prediction |
✕ |
✕ |
✕ |
✓ World first |
| Post-mortem auto-draft |
Add-on |
Template only |
✕ |
✓ AI-generated |
| On-call handoff briefing |
✕ |
✕ |
✕ |
✓ World first |
| Rejected hypotheses |
✕ |
✕ |
✕ |
✓ World first |
| Webhook log retention |
Stored |
Stored |
Stored |
Zero retained |
| Free to start |
✕ |
✕ |
✓ |
✓ |
| Starting price |
$15/host/mo |
$21/user/mo |
$0 OSS |
$0 free · $19/mo |
* Competitor features based on public documentation as of May 2026. OperatorMesh is not affiliated with any listed tool.
⚠ What we don't claim
✕We don't claim OperatorMesh replaces experienced SREs. It accelerates them.
✕We don't claim 87% accuracy on your specific stack. Results vary by log quality and incident type.
✕We don't claim our benchmark sample of 50 is statistically perfect. It's honest and growing.
✕We don't claim remediation confidence equals guaranteed fix. It's a signal, not a guarantee.
✕We don't claim the AI is always right when it's confident. See calibration table above.