Accuracy Report · v2.3 · May 2026

We publish our failures too

These are our internal accuracy numbers across common production incident patterns. Not a formal audit — but an honest one. We document every failure case, what caused it, and what we've done about it.

✓ This report includes cases where OperatorMesh was wrong

~87%

Diagnosis Accuracy

Internal testing · common incident patterns

Incident Categories

DB · Memory · K8s · API · Cascade

~35min

Avg Manual Triage

Industry benchmark vs ~19s with OM

~91%

Action Usefulness

Suggested actions rated helpful

~13%

Failure Rate

Cases where diagnosis was wrong

~78%

Remediation Accuracy

Suggested fix resolved the incident

Accuracy by incident type

Incident Type	Coverage	Accuracy
Database / Connection Pool	Well covered	~94%
Memory / OOM Kills	Well covered	~92%
Kubernetes / Container	Well covered	~88%
API / Gateway Errors	Well covered	~85%
Latency / Timeout Spikes	Moderate coverage	~79%
Multi-Service Cascades	Limited coverage	~62%

* All figures approximate — internal testing · independent verification in progress · contribute your incidents →

Confidence calibration

When OperatorMesh says 90% confidence, it should be right ~90% of the time. Here's how our stated confidence correlates with actual accuracy in internal testing — a well-calibrated model means the confidence number actually means something.

90–99% confidence

Actual accuracy: 93%

✓ Well calibrated — slightly conservative

75–89% confidence

Actual accuracy: 81%

✓ Well calibrated — matches stated range

60–74% confidence

Actual accuracy: 68%

⚠ Slightly overconfident — improving

50–59% confidence

Actual accuracy: 51%

→ Auto-escalation triggered at <60%

Where we fail — documented honestly

🔴 Multi-service cascade failures 38% of failures

When 3+ services fail simultaneously with no clear trigger, OperatorMesh sometimes identifies a symptom as the root cause rather than the underlying cascade source. Complex distributed failures with circular dependencies are our hardest case.

→ Mitigation: Blast Radius mode now maps cascades before triage runs

🟡 Insufficient log context 31% of failures

When input contains only a single error line with no stack trace, service name, or recent changes, accuracy drops significantly. Garbage in, garbage out — the AI can only work with what you give it.

→ Mitigation: Missing Signals feature tells you exactly what to add to improve diagnosis

🟠 Novel / unusual error patterns 19% of failures

Highly specific internal service errors with proprietary error codes that don't appear in public engineering literature. OperatorMesh correctly flags low confidence and triggers escalation in most of these cases.

→ Mitigation: Auto-escalation + Rejected Hypotheses help engineers know when to dig deeper

🔵 Infrastructure-level issues 12% of failures

Cloud provider outages, hardware failures, or network-level issues that produce application-layer symptoms. OperatorMesh diagnoses the application symptom correctly but misses the underlying infrastructure cause.

→ Mitigation: Status page cross-reference coming in v2.4

How we measure this

⚠ Transparency notice

These numbers are based on internal testing across common production incident patterns — real incident types that SREs and DevOps engineers encounter regularly. This is not yet a formal third-party audit or academic benchmark.

We tested against well-documented incident categories (PostgreSQL connection exhaustion, Kubernetes OOMKill, API gateway timeouts, memory leaks, cascade failures) and compared AI output against known correct diagnoses for each pattern.

We are actively seeking SRE and DevOps teams to participate in independent verification with real production incidents. If you'd like to contribute, email founder@operatormesh.com — early participants get 6 months Pro free.

📋

Real incident patterns

Tested against real-world incident categories that appear regularly in production. Not synthetic edge cases or toy examples designed to make the AI look good.

📊

Binary correctness

A diagnosis counts as correct only if the root cause matches the known correct answer for that incident pattern. Partial credit not given. We count our failures.

🔄

Updated every version

This report updates with every major release. We never remove results when accuracy improves — all historical data is preserved and visible.

🔍

Independent verification wanted

We are actively seeking external engineers to verify these results with real production incidents. Contact us to participate — early contributors get 6 months Pro free.

Manual triage vs OperatorMesh

Based on internal testing across common production incident patterns. Manual triage time of 35 minutes is consistent with industry research and widely reported SRE benchmarks.

❌ Manual Triage

35 min

average per incident

log diving · guessing · team debate

✅ OperatorMesh

~19 sec

average analysis time

root cause · confidence · ranked actions

⚡ Time Saved

97.5%

reduction in triage time

at $150/hr = $87 saved per incident

💰 ROI at $19/mo

46x

return on investment

10 incidents/month · 2 engineers

How we compare to alternatives

We're not replacing your monitoring stack. We're adding the explanation layer that none of them provide. Here's an honest feature comparison.

Feature	Datadog	PagerDuty	Grafana	OperatorMesh
Detects anomalies	✓	✕	✓	✓
Explains root cause	✕	✕	Partial	✓ 87% accuracy
Confidence scoring	✕	✕	✕	✓ Dual scores
Ranked fix actions	✕	✕	✕	✓ 3 ranked
Pre-mortem scanning	✕	✕	✕	✓ World first
Blast radius prediction	✕	✕	✕	✓ World first
Post-mortem auto-draft	Add-on	Template only	✕	✓ AI-generated
On-call handoff briefing	✕	✕	✕	✓ World first
Rejected hypotheses	✕	✕	✕	✓ World first
Webhook log retention	Stored	Stored	Stored	Zero retained
Free to start	✕	✕	✓	✓
Starting price	$15/host/mo	$21/user/mo	$0 OSS	$0 free · $19/mo

* Competitor features based on public documentation as of May 2026. OperatorMesh is not affiliated with any listed tool.

⚠ What we don't claim

✕We don't claim OperatorMesh replaces experienced SREs. It accelerates them.

✕We don't claim 87% accuracy on your specific stack. Results vary by log quality and incident type.

✕We don't claim our benchmark sample of 50 is statistically perfect. It's honest and growing.

✕We don't claim remediation confidence equals guaranteed fix. It's a signal, not a guarantee.

✕We don't claim the AI is always right when it's confident. See calibration table above.

Judge it yourself

The best benchmark is your own incidents. Paste a real log and see if the diagnosis matches what your team found.

⚡ Try free — no card needed View changelog →