Getting Started

Quick start guide

Get your first AI-powered incident triage running in under 5 minutes. No setup. No agents. No code required for the web demo.


Architecture

How it works

OperatorMesh is stateless by design. Your logs are never stored, never logged, never used for training. Every request is processed and discarded.

1
Input received
You paste raw logs, an alert payload, or a stack trace into the demo โ€” or your monitoring tool fires a webhook. OperatorMesh accepts any plaintext incident data.
2
AI triage engine runs
Your input is sent to Claude Sonnet 4 with a specialized SRE system prompt. The model identifies error patterns, correlates signals, and generates a structured JSON diagnosis.
๐Ÿ’ก
Zero retention: Input is processed in-memory and never written to disk, database, or logs. Your production data is invisible to us.
3
Structured output returned
Within 3 seconds you receive: root cause, dual confidence scores (diagnosis + remediation), matched signals, 3 ranked actions, escalation flag if fix is uncertain, and what to investigate first.
4
Delivered to Slack (optional)
On Starter and Pro plans, the triage report is automatically posted to your configured Slack channel โ€” formatted, readable, and actionable.

Integration

Slack setup

Connect OperatorMesh to Slack so every production alert automatically triggers an AI triage report in your incident channel.

โœ…
Available on: Starter ($19/mo) and Pro ($49/mo) plans. Praveen configures this personally during onboarding.
1
Email founder@operatormesh.com
Tell Praveen your Slack workspace and which channel you want reports posted to (e.g. #incidents, #alerts). You'll receive a webhook URL within 2 hours.
2
Configure your monitoring tool
Point your Datadog, PagerDuty, or Grafana alert webhook to the OperatorMesh endpoint. Use POST with JSON body.
JSON โ€” Webhook payload
{
  "service": "api-gateway",
  "error": "503 upstream timeout after deploy v2.4.1",
  "logs": "upstream connect error or disconnect/reset...",
  "recent_changes": "deployed v2.4.1 at 14:32 UTC",
  "time": "2026-05-01T14:35:00Z",
  "source": "datadog"
}
3
Triage report lands in Slack
Within 3 seconds of your alert firing, your Slack channel receives a formatted incident report.
#incidents โ€” Slack message
๐Ÿšจ OperatorMesh Incident Report
Service: api-gateway ยท Source: Datadog

ROOT CAUSE
DB connection pool exhausted โ€” pool_size reduced 10โ†’5 in v2.4.1

CONFIDENCE
87% ยท deploy-correlated timeout pattern matched

ACTIONS
1. Diff database.yml v2.4.0 vs v2.4.1
2. SHOW STATUS LIKE 'Threads_connected'
3. Raise connection_pool_size to 20

Advisory only ยท No actions taken ยท operatormesh.com

Integrations

All integrations

OperatorMesh connects to your existing monitoring stack โ€” it does not replace it. Think of it as the explanation layer on top of every tool you already use.

๐Ÿ•
Datadog
Coming soon
Native Datadog webhook parser. Fires on monitor alerts, APM traces, and log-based monitors. Auto-extracts service, error, and deployment context.
๐Ÿ“Ÿ
PagerDuty
Coming soon
Trigger triage on incident creation. Triage report auto-posted as a PagerDuty note and to your Slack channel simultaneously.
๐Ÿ“Š
Grafana
Coming soon
Works with Grafana Alerting webhooks. Parses alert labels, annotations, and panel data to provide context-aware triage.
๐Ÿ’ฌ
Slack
Live
Automatic triage delivery to any Slack channel. Formatted report with root cause, confidence, and actions. Available on Starter and Pro.
๐Ÿ””
New Relic
Coming soon
Connect New Relic alert policies. Triage fires on NRQL alert conditions and deployment markers.
๐Ÿ›
Sentry
Coming soon
Issue-level triage from Sentry webhooks. Stack traces, breadcrumbs, and release info all parsed automatically.
โšก
Need an integration now? Email founder@operatormesh.com โ€” Praveen can manually configure any monitoring tool that supports webhooks, usually within 24 hours.

Reference

Webhook API

Send any incident data to OperatorMesh via HTTP POST. Works with any monitoring tool that supports outbound webhooks.

Endpoint
POST https://operatormesh.com/.netlify/functions/analyze
# Required header: x-om-key: mesh_v1
# Body fields: incident (string), mode ("triage"|"premortem"), user_id (optional)
cURL โ€” Triage incident
curl -X POST https://operatormesh.com/.netlify/functions/analyze \
  -H "Content-Type: application/json" \
  -H "x-om-key: mesh_v1" \
  -d '{
    "incident": "service: payment-service\nerror: Connection timeout to postgres:5432\nlogs: FATAL connection refused\nrecent_changes: scaled down DB replicas at 09:15 UTC",
    "mode": "triage",
    "user_id": "optional-your-user-id"
  }'

Reference

Input format

The more context you provide, the higher the confidence score. At minimum include the error message. Ideal input includes all 4 fields.

FieldTypeRequiredDescription
servicestringoptionalService or component name. e.g. "api-gateway", "auth-service"
errorstringrequiredThe error message, alert text, or exception. This is the primary signal.
logsstringoptionalRaw log lines, stack traces, or additional context. Improves accuracy significantly.
recent_changesstringoptionalRecent deploys, config changes, or infra modifications. Critical for deploy-correlated issues.
timeISO 8601optionalIncident timestamp. Used for context, not stored.
sourcestringoptionalOrigin tool. e.g. "datadog", "pagerduty", "manual"
๐Ÿ’ก
Pro tip: Including recent_changes typically increases confidence score by 15โ€“25%. Deploy-correlated incidents are the most common production failure pattern.

Reference

Output format

Every triage returns a structured JSON object. All fields are always present.

JSON Response โ€” Triage mode
{
  "mode": "triage",
  "root_cause": "DB connection pool exhausted after pool_size reduced in v2.4.1",
  "diagnosis_confidence": 91,
  "remediation_confidence": 84,
  "confidence_reason": "Deploy-correlated timeout with pool exhaustion pattern matched",
  "remediation_reason": "Fix is straightforward โ€” pool config is the only variable",
  "escalate": false,
  "signals": [
    { "match": "upstream timeout", "desc": "Gateway cannot reach upstream service" },
    { "match": "deploy correlation", "desc": "Symptoms appeared immediately after v2.4.1 deploy" },
    { "match": "connection pool", "desc": "Pool size config change detected in logs" }
  ],
  "rejected_hypotheses": [
    "DNS outage โ€” DNS resolves correctly across all nodes",
    "Backend crash โ€” upstream service returning 200s",
    "Network partition โ€” internal services reachable"
  ],
  "missing_signals": [
    "Deploy diff for database.yml between v2.4.0 and v2.4.1",
    "DB connection pool metrics from last 30 minutes"
  ],
  "actions": [
    "Diff database.yml between v2.4.0 and v2.4.1",
    "Run SHOW STATUS LIKE 'Threads_connected' on primary DB",
    "Raise connection_pool_size to 20 and restart app servers"
  ],
  "investigate": "Check database.yml in v2.4.1 diff โ€” pool_size is the primary suspect",
  "time_saved": "30-40 minutes",
  "plan": "free"
}
JSON Response โ€” Pre-Mortem mode
{
  "mode": "premortem",
  "deploy_safety_score": 42,
  "safety_label": "High Risk",
  "verdict": "Column rename without backward compatibility will break dependent services",
  "predicted_failures": [
    { "failure": "auth-service crashes on login", "likelihood": "High", "impact": "All logins fail", "trigger": "email field missing" }
  ],
  "at_risk_services": ["auth-service", "notification-service"],
  "post_deploy_monitors": [
    "Watch auth-service error rate for 5xx spikes",
    "Monitor notification delivery success rate",
    "Check DB query errors for column not found"
  ],
  "rollback_trigger": "Any 5xx rate above 1% within 10 minutes of deploy"
}
FieldModeTypeDescription
modebothstring"triage" or "premortem"
root_causetriagestringOne clear sentence explaining the most likely cause
diagnosis_confidencetriageinteger50โ€“99. How certain is the root cause identification
remediation_confidencetriageinteger50โ€“99. How certain the suggested fix will work
confidence_reasontriagestringWhy this diagnosis confidence level was assigned
remediation_reasontriagestringWhy the fix is certain or uncertain
escalatetriagebooleantrue if remediation_confidence < 60 โ€” human validation required
signalstriagearray[3]Exactly 3 matched signals with name and explanation
rejected_hypothesestriagearray[3]3 alternative causes that were considered and eliminated โ€” shows reasoning transparency
missing_signalstriagearray[1-3]Specific logs or metrics that would increase confidence if available
actionstriagearray[3]Exactly 3 ranked next actions โ€” specific and executable
investigatetriagestringThe single highest-priority thing to check first
time_savedtriagestringEstimated manual triage time this replaces
deploy_safety_scorepremorteminteger0โ€“100. Overall deploy risk score. 100 = very safe.
safety_labelpremortemstring"Safe" | "Caution" | "High Risk" | "Do Not Deploy"
verdictpremortemstringOne-sentence plain-language risk assessment
predicted_failurespremortemarray[3]Predicted failure modes with likelihood, impact, and trigger
at_risk_servicespremortemarrayServices likely to be affected by this deploy
post_deploy_monitorspremortemarray[3]Specific metrics/logs to watch immediately after deploy
rollback_triggerpremortemstringExact condition that should trigger immediate rollback

Reference

Confidence scores

OperatorMesh returns two separate confidence scores. These are different โ€” a clear root cause may still have an uncertain fix. Neither is ever 100% โ€” OperatorMesh is advisory only.

Diagnosis Confidence

How certain the AI is about the root cause identification. Driven by how many signals matched and how strongly they correlate.

Remediation Confidence

How certain the AI is that the suggested fix will resolve the issue. Can be lower even when diagnosis is high โ€” for example, when multiple fix paths exist or the change required is risky.

Score rangeMeaningRecommended action
90โ€“99%Strong match โ€” multiple correlated signalsHigh confidence. Verify and act.
75โ€“89%Good match โ€” clear signals with some uncertaintyLikely correct. Cross-reference one signal first.
60โ€“74%Partial match โ€” limited context or ambiguous signalsUse as hypothesis. Add more context if possible.
<60%Weak signal โ€” escalate: true returnedHuman validation required before applying fix.
โš ๏ธ
Escalation flag: When remediation_confidence drops below 60, the response includes "escalate": true. The UI shows a red warning and Slack messages include a human escalation notice. Never apply a fix with escalate:true without senior review.
๐Ÿ”’
Advisory only: OperatorMesh never takes autonomous actions. All outputs are recommendations. You retain full control and responsibility for any infrastructure changes made.

New in v2.1

Pre-Mortem Scanner

Predict failure modes before you deploy. Paste a git diff, describe a change, or explain what you're shipping โ€” OperatorMesh returns predicted failures, at-risk services, post-deploy monitors, and a rollback trigger. Use "mode": "premortem" in the request body.

What to paste
# Describe the change in plain language โ€” or paste a git diff
deploying: user-service v3.2.0
change: ALTER TABLE users RENAME COLUMN email TO email_address
affected services: auth-service, notification-service
database: PostgreSQL 14, ~4M rows
deploy window: rolling restart
๐Ÿ’ก
Best results: Include which services consume the changed resource, database size, and whether it's a rolling or full restart. The more deployment context you provide, the more specific the failure predictions.
Safety scoreLabelRecommended action
75โ€“100SafeDeploy. Monitor the returned signals post-deploy.
50โ€“74CautionReview predicted failures. Consider off-peak deploy window.
25โ€“49High RiskAddress predicted failures before deploying.
0โ€“24Do Not DeployStop. At least one High likelihood failure predicted.

Reference

Rate limits

PlanWeekly limitRate limitMax input
Anonymous3 analyses15 req/min/IP8,000 chars
Free (logged in)10 analyses15 req/min/IP8,000 chars
Starter $19/moUnlimited15 req/min/IP8,000 chars
Pro $49/moUnlimited15 req/min/IP8,000 chars

Support

Common questions

Is my data stored anywhere?
Your raw logs and incident payloads are never stored after analysis โ€” they are processed in memory and discarded immediately. Nothing is written to disk. If you sign in, only structured results (root cause, confidence score, recommendations) are saved to your private dashboard. You have full control to delete any analysis at any time. Raw log data never persists on our infrastructure.
What AI model powers the triage?
Claude Sonnet 4 by Anthropic, accessed via the official API. The model is called with a specialized SRE system prompt engineered for incident triage accuracy. Anthropic's data processing agreement applies.
Can I use it during an active incident?
Yes โ€” that's the primary use case. Paste your logs while the incident is happening. You get a starting hypothesis in 3 seconds, which you verify before acting. It eliminates the first 20โ€“30 minutes of guesswork.
How do I get support?
Email founder@operatormesh.com. Praveen responds personally within 24 hours (Starter) or 4 hours (Pro). For urgent issues, Pro customers get WhatsApp access.