Quick start guide
Get your first AI-powered incident triage running in under 5 minutes. No setup. No agents. No code required for the web demo.
How it works
OperatorMesh is stateless by design. Your logs are never stored, never logged, never used for training. Every request is processed and discarded.
Slack setup
Connect OperatorMesh to Slack so every production alert automatically triggers an AI triage report in your incident channel.
{
"service": "api-gateway",
"error": "503 upstream timeout after deploy v2.4.1",
"logs": "upstream connect error or disconnect/reset...",
"recent_changes": "deployed v2.4.1 at 14:32 UTC",
"time": "2026-05-01T14:35:00Z",
"source": "datadog"
}Service: api-gateway ยท Source: Datadog
ROOT CAUSE
DB connection pool exhausted โ pool_size reduced 10โ5 in v2.4.1
CONFIDENCE
87% ยท deploy-correlated timeout pattern matched
ACTIONS
1. Diff database.yml v2.4.0 vs v2.4.1
2. SHOW STATUS LIKE 'Threads_connected'
3. Raise connection_pool_size to 20
Advisory only ยท No actions taken ยท operatormesh.com
All integrations
OperatorMesh connects to your existing monitoring stack โ it does not replace it. Think of it as the explanation layer on top of every tool you already use.
Webhook API
Send any incident data to OperatorMesh via HTTP POST. Works with any monitoring tool that supports outbound webhooks.
POST https://operatormesh.com/.netlify/functions/analyze # Required header: x-om-key: mesh_v1 # Body fields: incident (string), mode ("triage"|"premortem"), user_id (optional)
curl -X POST https://operatormesh.com/.netlify/functions/analyze \ -H "Content-Type: application/json" \ -H "x-om-key: mesh_v1" \ -d '{ "incident": "service: payment-service\nerror: Connection timeout to postgres:5432\nlogs: FATAL connection refused\nrecent_changes: scaled down DB replicas at 09:15 UTC", "mode": "triage", "user_id": "optional-your-user-id" }'
Input format
The more context you provide, the higher the confidence score. At minimum include the error message. Ideal input includes all 4 fields.
| Field | Type | Required | Description |
|---|---|---|---|
| service | string | optional | Service or component name. e.g. "api-gateway", "auth-service" |
| error | string | required | The error message, alert text, or exception. This is the primary signal. |
| logs | string | optional | Raw log lines, stack traces, or additional context. Improves accuracy significantly. |
| recent_changes | string | optional | Recent deploys, config changes, or infra modifications. Critical for deploy-correlated issues. |
| time | ISO 8601 | optional | Incident timestamp. Used for context, not stored. |
| source | string | optional | Origin tool. e.g. "datadog", "pagerduty", "manual" |
recent_changes typically increases confidence score by 15โ25%. Deploy-correlated incidents are the most common production failure pattern.Output format
Every triage returns a structured JSON object. All fields are always present.
{
"mode": "triage",
"root_cause": "DB connection pool exhausted after pool_size reduced in v2.4.1",
"diagnosis_confidence": 91,
"remediation_confidence": 84,
"confidence_reason": "Deploy-correlated timeout with pool exhaustion pattern matched",
"remediation_reason": "Fix is straightforward โ pool config is the only variable",
"escalate": false,
"signals": [
{ "match": "upstream timeout", "desc": "Gateway cannot reach upstream service" },
{ "match": "deploy correlation", "desc": "Symptoms appeared immediately after v2.4.1 deploy" },
{ "match": "connection pool", "desc": "Pool size config change detected in logs" }
],
"rejected_hypotheses": [
"DNS outage โ DNS resolves correctly across all nodes",
"Backend crash โ upstream service returning 200s",
"Network partition โ internal services reachable"
],
"missing_signals": [
"Deploy diff for database.yml between v2.4.0 and v2.4.1",
"DB connection pool metrics from last 30 minutes"
],
"actions": [
"Diff database.yml between v2.4.0 and v2.4.1",
"Run SHOW STATUS LIKE 'Threads_connected' on primary DB",
"Raise connection_pool_size to 20 and restart app servers"
],
"investigate": "Check database.yml in v2.4.1 diff โ pool_size is the primary suspect",
"time_saved": "30-40 minutes",
"plan": "free"
}{
"mode": "premortem",
"deploy_safety_score": 42,
"safety_label": "High Risk",
"verdict": "Column rename without backward compatibility will break dependent services",
"predicted_failures": [
{ "failure": "auth-service crashes on login", "likelihood": "High", "impact": "All logins fail", "trigger": "email field missing" }
],
"at_risk_services": ["auth-service", "notification-service"],
"post_deploy_monitors": [
"Watch auth-service error rate for 5xx spikes",
"Monitor notification delivery success rate",
"Check DB query errors for column not found"
],
"rollback_trigger": "Any 5xx rate above 1% within 10 minutes of deploy"
}| Field | Mode | Type | Description |
|---|---|---|---|
| mode | both | string | "triage" or "premortem" |
| root_cause | triage | string | One clear sentence explaining the most likely cause |
| diagnosis_confidence | triage | integer | 50โ99. How certain is the root cause identification |
| remediation_confidence | triage | integer | 50โ99. How certain the suggested fix will work |
| confidence_reason | triage | string | Why this diagnosis confidence level was assigned |
| remediation_reason | triage | string | Why the fix is certain or uncertain |
| escalate | triage | boolean | true if remediation_confidence < 60 โ human validation required |
| signals | triage | array[3] | Exactly 3 matched signals with name and explanation |
| rejected_hypotheses | triage | array[3] | 3 alternative causes that were considered and eliminated โ shows reasoning transparency |
| missing_signals | triage | array[1-3] | Specific logs or metrics that would increase confidence if available |
| actions | triage | array[3] | Exactly 3 ranked next actions โ specific and executable |
| investigate | triage | string | The single highest-priority thing to check first |
| time_saved | triage | string | Estimated manual triage time this replaces |
| deploy_safety_score | premortem | integer | 0โ100. Overall deploy risk score. 100 = very safe. |
| safety_label | premortem | string | "Safe" | "Caution" | "High Risk" | "Do Not Deploy" |
| verdict | premortem | string | One-sentence plain-language risk assessment |
| predicted_failures | premortem | array[3] | Predicted failure modes with likelihood, impact, and trigger |
| at_risk_services | premortem | array | Services likely to be affected by this deploy |
| post_deploy_monitors | premortem | array[3] | Specific metrics/logs to watch immediately after deploy |
| rollback_trigger | premortem | string | Exact condition that should trigger immediate rollback |
Confidence scores
OperatorMesh returns two separate confidence scores. These are different โ a clear root cause may still have an uncertain fix. Neither is ever 100% โ OperatorMesh is advisory only.
Diagnosis Confidence
How certain the AI is about the root cause identification. Driven by how many signals matched and how strongly they correlate.
Remediation Confidence
How certain the AI is that the suggested fix will resolve the issue. Can be lower even when diagnosis is high โ for example, when multiple fix paths exist or the change required is risky.
| Score range | Meaning | Recommended action |
|---|---|---|
| 90โ99% | Strong match โ multiple correlated signals | High confidence. Verify and act. |
| 75โ89% | Good match โ clear signals with some uncertainty | Likely correct. Cross-reference one signal first. |
| 60โ74% | Partial match โ limited context or ambiguous signals | Use as hypothesis. Add more context if possible. |
| <60% | Weak signal โ escalate: true returned | Human validation required before applying fix. |
remediation_confidence drops below 60, the response includes "escalate": true. The UI shows a red warning and Slack messages include a human escalation notice. Never apply a fix with escalate:true without senior review.Pre-Mortem Scanner
Predict failure modes before you deploy. Paste a git diff, describe a change, or explain what you're shipping โ OperatorMesh returns predicted failures, at-risk services, post-deploy monitors, and a rollback trigger. Use "mode": "premortem" in the request body.
# Describe the change in plain language โ or paste a git diff
deploying: user-service v3.2.0
change: ALTER TABLE users RENAME COLUMN email TO email_address
affected services: auth-service, notification-service
database: PostgreSQL 14, ~4M rows
deploy window: rolling restart| Safety score | Label | Recommended action |
|---|---|---|
| 75โ100 | Safe | Deploy. Monitor the returned signals post-deploy. |
| 50โ74 | Caution | Review predicted failures. Consider off-peak deploy window. |
| 25โ49 | High Risk | Address predicted failures before deploying. |
| 0โ24 | Do Not Deploy | Stop. At least one High likelihood failure predicted. |
Rate limits
| Plan | Weekly limit | Rate limit | Max input |
|---|---|---|---|
| Anonymous | 3 analyses | 15 req/min/IP | 8,000 chars |
| Free (logged in) | 10 analyses | 15 req/min/IP | 8,000 chars |
| Starter $19/mo | Unlimited | 15 req/min/IP | 8,000 chars |
| Pro $49/mo | Unlimited | 15 req/min/IP | 8,000 chars |