You paste raw logs, an alert payload, or a stack trace into the demo — or your monitoring tool fires a webhook. OperatorMesh accepts any plaintext incident data.

2

AI triage engine runs

Your input is sent to an advanced large language model with a specialized SRE system prompt. The model identifies error patterns, correlates signals, and generates a structured JSON diagnosis.

💡

Zero retention: Input is processed in-memory and never written to disk, database, or logs. Your production data is invisible to us.

3

Structured output returned

Within seconds you receive: root cause, dual confidence scores (diagnosis + remediation), matched signals, 3 ranked actions, escalation flag if fix is uncertain, and what to investigate first.

4

Delivered to Slack (optional)

On Starter and Pro plans, the triage report is automatically posted to your configured Slack channel — formatted, readable, and actionable.

Integration

Slack setup

Connect OperatorMesh to Slack so every production alert automatically triggers an AI triage report in your incident channel.

✅

Available on: Starter ($19/mo) and Pro ($49/mo) plans. Praveen configures this personally during onboarding.

1

Email founder@operatormesh.com

Tell Praveen your Slack workspace and which channel you want reports posted to (e.g. #incidents, #alerts). You'll receive a webhook URL within 2 hours.

2

Configure your monitoring tool

Point your Datadog, PagerDuty, or Grafana alert webhook to the OperatorMesh endpoint. Use POST with JSON body.

JSON — Webhook payload
{
  "service": "api-gateway",
  "error": "503 upstream timeout after deploy v2.4.1",
  "logs": "upstream connect error or disconnect/reset...",
  "recent_changes": "deployed v2.4.1 at 14:32 UTC",
  "time": "2026-05-01T14:35:00Z",
  "source": "datadog"
}

3

Triage report lands in Slack

Within seconds of your alert firing, your Slack channel receives a formatted incident report.

#incidents — Slack message

🚨 OperatorMesh Incident Report
Service: api-gateway · Source: Datadog

ROOT CAUSE
DB connection pool exhausted — pool_size reduced 10→5 in v2.4.1

CONFIDENCE
87% · deploy-correlated timeout pattern matched

ACTIONS
1. Diff database.yml v2.4.0 vs v2.4.1
2. SHOW STATUS LIKE 'Threads_connected'
3. Raise connection_pool_size to 20

Advisory only · No actions taken · operatormesh.com

Integrations

All integrations

OperatorMesh connects to your existing monitoring stack — it does not replace it. Think of it as the explanation layer on top of every tool you already use.

🐕

Datadog

Coming soon

Native Datadog webhook parser. Fires on monitor alerts, APM traces, and log-based monitors. Auto-extracts service, error, and deployment context.

📟

PagerDuty

Coming soon

Trigger triage on incident creation. Triage report auto-posted as a PagerDuty note and to your Slack channel simultaneously.

📊

Grafana

Coming soon

Works with Grafana Alerting webhooks. Parses alert labels, annotations, and panel data to provide context-aware triage.

💬

Slack

Live

Automatic triage delivery to any Slack channel. Formatted report with root cause, confidence, and actions. Available on Starter and Pro.

🔔

New Relic

Coming soon

Connect New Relic alert policies. Triage fires on NRQL alert conditions and deployment markers.

🐛

Sentry

Coming soon

Issue-level triage from Sentry webhooks. Stack traces, breadcrumbs, and release info all parsed automatically.

⚡

Need an integration now? Email founder@operatormesh.com — Praveen can manually configure any monitoring tool that supports webhooks, usually within 24 hours.

Reference

Webhook API

Send any incident data to OperatorMesh via HTTP POST. Works with any monitoring tool that supports outbound webhooks.

Endpoint
POST https://operatormesh.com/.netlify/functions/analyze
# Required header: x-om-key: mesh_v1
# Body fields: incident (string), mode ("triage"|"premortem"), user_id (optional)

cURL — Triage incident
curl -X POST https://operatormesh.com/.netlify/functions/analyze \
  -H "Content-Type: application/json" \
  -H "x-om-key: mesh_v1" \
  -d '{
    "incident": "service: payment-service\nerror: Connection timeout to postgres:5432\nlogs: FATAL connection refused\nrecent_changes: scaled down DB replicas at 09:15 UTC",
    "mode": "triage",
    "user_id": "optional-your-user-id"
  }'

Reference

Input format

The more context you provide, the higher the confidence score. At minimum include the error message. Ideal input includes all 4 fields.

Field	Type	Required	Description
service	string	optional	Service or component name. e.g. "api-gateway", "auth-service"
error	string	required	The error message, alert text, or exception. This is the primary signal.
logs	string	optional	Raw log lines, stack traces, or additional context. Improves accuracy significantly.
recent_changes	string	optional	Recent deploys, config changes, or infra modifications. Critical for deploy-correlated issues.
time	ISO 8601	optional	Incident timestamp. Used for context, not stored.
source	string	optional	Origin tool. e.g. "datadog", "pagerduty", "manual"

💡

Pro tip: Including recent_changes typically increases confidence score by 15–25%. Deploy-correlated incidents are the most common production failure pattern.

Reference

Output format

Every triage returns a structured JSON object. All fields are always present.

JSON Response — Triage mode
{
  "mode": "triage",
  "root_cause": "DB connection pool exhausted after pool_size reduced in v2.4.1",
  "diagnosis_confidence": 91,
  "remediation_confidence": 84,
  "confidence_reason": "Deploy-correlated timeout with pool exhaustion pattern matched",
  "remediation_reason": "Fix is straightforward — pool config is the only variable",
  "escalate": false,
  "signals": [
    { "match": "upstream timeout", "desc": "Gateway cannot reach upstream service" },
    { "match": "deploy correlation", "desc": "Symptoms appeared immediately after v2.4.1 deploy" },
    { "match": "connection pool", "desc": "Pool size config change detected in logs" }
  ],
  "rejected_hypotheses": [
    "DNS outage — DNS resolves correctly across all nodes",
    "Backend crash — upstream service returning 200s",
    "Network partition — internal services reachable"
  ],
  "missing_signals": [
    "Deploy diff for database.yml between v2.4.0 and v2.4.1",
    "DB connection pool metrics from last 30 minutes"
  ],
  "actions": [
    "Diff database.yml between v2.4.0 and v2.4.1",
    "Run SHOW STATUS LIKE 'Threads_connected' on primary DB",
    "Raise connection_pool_size to 20 and restart app servers"
  ],
  "investigate": "Check database.yml in v2.4.1 diff — pool_size is the primary suspect",
  "time_saved": "30-40 minutes",
  "plan": "free"
}

JSON Response — Pre-Mortem mode
{
  "mode": "premortem",
  "deploy_safety_score": 42,
  "safety_label": "High Risk",
  "verdict": "Column rename without backward compatibility will break dependent services",
  "predicted_failures": [
    { "failure": "auth-service crashes on login", "likelihood": "High", "impact": "All logins fail", "trigger": "email field missing" }
  ],
  "at_risk_services": ["auth-service", "notification-service"],
  "post_deploy_monitors": [
    "Watch auth-service error rate for 5xx spikes",
    "Monitor notification delivery success rate",
    "Check DB query errors for column not found"
  ],
  "rollback_trigger": "Any 5xx rate above 1% within 10 minutes of deploy"
}

Field	Mode	Type	Description
mode	both	string	"triage" or "premortem"
root_cause	triage	string	One clear sentence explaining the most likely cause
diagnosis_confidence	triage	integer	50–99. How certain is the root cause identification
remediation_confidence	triage	integer	50–99. How certain the suggested fix will work
confidence_reason	triage	string	Why this diagnosis confidence level was assigned
remediation_reason	triage	string	Why the fix is certain or uncertain
escalate	triage	boolean	true if remediation_confidence < 60 — human validation required
signals	triage	array[3]	Exactly 3 matched signals with name and explanation
rejected_hypotheses	triage	array[3]	3 alternative causes that were considered and eliminated — shows reasoning transparency
missing_signals	triage	array[1-3]	Specific logs or metrics that would increase confidence if available
actions	triage	array[3]	Exactly 3 ranked next actions — specific and executable
investigate	triage	string	The single highest-priority thing to check first
time_saved	triage	string	Estimated manual triage time this replaces
deploy_safety_score	premortem	integer	0–100. Overall deploy risk score. 100 = very safe.
safety_label	premortem	string	"Safe" \| "Caution" \| "High Risk" \| "Do Not Deploy"
verdict	premortem	string	One-sentence plain-language risk assessment
predicted_failures	premortem	array[3]	Predicted failure modes with likelihood, impact, and trigger
at_risk_services	premortem	array	Services likely to be affected by this deploy
post_deploy_monitors	premortem	array[3]	Specific metrics/logs to watch immediately after deploy
rollback_trigger	premortem	string	Exact condition that should trigger immediate rollback

Reference

Confidence scores

OperatorMesh returns two separate confidence scores. These are different — a clear root cause may still have an uncertain fix. Neither is ever 100% — OperatorMesh is advisory only.

Diagnosis Confidence

How certain the AI is about the root cause identification. Driven by how many signals matched and how strongly they correlate.

Remediation Confidence

How certain the AI is that the suggested fix will resolve the issue. Can be lower even when diagnosis is high — for example, when multiple fix paths exist or the change required is risky.

Score range	Meaning	Recommended action
90–99%	Strong match — multiple correlated signals	High confidence. Verify and act.
75–89%	Good match — clear signals with some uncertainty	Likely correct. Cross-reference one signal first.
60–74%	Partial match — limited context or ambiguous signals	Use as hypothesis. Add more context if possible.
<60%	Weak signal — `escalate: true` returned	Human validation required before applying fix.

⚠️

Escalation flag: When remediation_confidence drops below 60, the response includes "escalate": true. The UI shows a red warning and Slack messages include a human escalation notice. Never apply a fix with escalate:true without senior review.

🔒

Advisory only: OperatorMesh never takes autonomous actions. All outputs are recommendations. You retain full control and responsibility for any infrastructure changes made.

New in v2.1

Pre-Mortem Scanner

Predict failure modes before you deploy. Paste a git diff, describe a change, or explain what you're shipping — OperatorMesh returns predicted failures, at-risk services, post-deploy monitors, and a rollback trigger. Use "mode": "premortem" in the request body.

What to paste
# Describe the change in plain language — or paste a git diff
deploying: user-service v3.2.0
change: ALTER TABLE users RENAME COLUMN email TO email_address
affected services: auth-service, notification-service
database: PostgreSQL 14, ~4M rows
deploy window: rolling restart

💡

Best results: Include which services consume the changed resource, database size, and whether it's a rolling or full restart. The more deployment context you provide, the more specific the failure predictions.

Safety score	Label	Recommended action
75–100	Safe	Deploy. Monitor the returned signals post-deploy.
50–74	Caution	Review predicted failures. Consider off-peak deploy window.
25–49	High Risk	Address predicted failures before deploying.
0–24	Do Not Deploy	Stop. At least one High likelihood failure predicted.

Reference

Rate limits

Plan	Weekly limit	Rate limit	Max input
Anonymous	3 analyses	15 req/min/IP	8,000 chars
Free (logged in)	10 analyses	15 req/min/IP	8,000 chars
Starter $19/mo	Unlimited	15 req/min/IP	8,000 chars
Pro $49/mo	Unlimited	15 req/min/IP	8,000 chars

Support

Common questions

Is my data stored anywhere?

Your raw logs and incident payloads are never stored after analysis — they are processed in memory and discarded immediately. Nothing is written to disk. If you sign in, only structured results (root cause, confidence score, recommendations) are saved to your private dashboard. You have full control to delete any analysis at any time. Raw log data never persists on our infrastructure.

What AI model powers the triage?

OperatorMesh uses an enterprise-grade large language model, accessed via an official API with a data processing agreement in place. The model is called with a specialized SRE system prompt engineered for incident triage accuracy. For details on data handling, see our Security page.

Can I use it during an active incident?

Yes — that's the primary use case. Paste your logs while the incident is happening. You get a starting hypothesis in seconds, which you verify before acting. It eliminates the first 20–30 minutes of guesswork.

How do I get support?

Email founder@operatormesh.com. Praveen responds personally within 24 hours (Starter) or 4 hours (Pro). For urgent issues, Pro customers get WhatsApp access.