Real-time alert delivery to external systems. NO competitor has webhook integration.
| Platform | Endpoint | Triggers | Status | Last Fired |
|---|---|---|---|---|
| Slack | #ops-alerts |
Critical, Warning | ✓ Active | 2 min ago |
| Discord | #mission-control |
Critical only | ✓ Active | 47 min ago |
| PagerDuty | SRE Escalation |
Critical + Escalated | ✓ Active | 3 hours ago |
ops@gilchrist.research |
All alerts | ✓ Active | 12 min ago |
{
"alert_id": "alert-cpu-spike-1234",
"timestamp": "2026-02-20T13:42:00Z",
"severity": "critical",
"metric": "cpu_usage",
"value": 94.7,
"threshold": 85.0,
"source": "armada.gilchrist.research",
"message": "CPU usage exceeded threshold",
"escalated": false,
"escalation_count": 0,
"muted": false,
"conditions_met": ["cpu > 85", "duration > 5min"]
}
See also: Alert System v1, Event Log v2, Operational SLAs
Automatic severity escalation when conditions persist or worsen.
| Rule | Trigger | Action | Escalations (7d) |
|---|---|---|---|
| Warning → Critical | Persists > 15 min | Escalate + PagerDuty | 23 escalations |
| Critical → P1 Incident | Persists > 30 min | Page on-call + Create incident | 4 incidents |
| Repeated Warnings | 3+ in 1 hour | Escalate to critical | 12 escalations |
| Multi-Service Impact | 2+ services down | Auto-escalate + All hands | 1 escalation |
7-day escalation timeline (warning→critical in orange, critical→P1 in red)
See also: Alert Templates, Alert Analytics
Complex alert logic with AND/OR conditions. Precision alerting, zero false positives.
| Alert Name | Conditions | Logic | Fired (7d) |
|---|---|---|---|
| Database Overload | CPU > 85% AND Query Time > 500ms | AND | 12 times |
| Service Degradation | Latency > 1s OR Error Rate > 5% | OR | 34 times |
| Memory Leak Detection | Memory > 80% AND Growth Rate > 2%/min AND Duration > 10min | AND (3 conditions) | 3 times |
| Critical Resource Exhaustion | (CPU > 90% OR Memory > 90%) AND Swap > 50% | Complex (nested) | 7 times |
cpu > 85 AND duration > 5min — CPU spike that persistserror_rate > 5% OR latency > 1s — Service degradation (any cause)(cpu > 90 OR memory > 90) AND disk < 10% — Resource crisisdeployment.status == "failed" AND rollback.available == false — Deployment disasterSee also: Alert Templates, Metrics Catalog
Temporary alert suppression with reason tracking. Prevent alert fatigue during maintenance.
| Alert | Muted Until | Reason | Muted By | Action |
|---|---|---|---|---|
| CPU High on Workhorse | 2026-02-20 15:00 (1h 20m) | Planned model training run | brandon | |
| Disk Usage Warning | 2026-02-20 18:00 (4h 20m) | Archive job in progress, will cleanup after | operator | |
| Network Latency Spike | 2026-02-20 14:30 (50m) | ISP maintenance window (scheduled) | brandon |
Daily mute count (blue = planned maintenance, orange = incident response, red = alert fatigue)
See also: Alert System v1, Maintenance Windows
Performance metrics for alert system health. MTTR, false positive rate, escalation efficiency.
| Alert Type | Fired (7d) | False Positives | MTTR | Auto-Resolved |
|---|---|---|---|---|
| CPU Threshold | 147 | 6 (4.1%) | 4.2 min | 89% |
| Memory Threshold | 89 | 2 (2.2%) | 6.8 min | 76% |
| Disk Space | 34 | 0 (0%) | 18.4 min | 12% |
| Network Latency | 127 | 12 (9.4%) | 3.1 min | 94% |
| Service Health | 67 | 1 (1.5%) | 12.2 min | 43% |
Mean time to respond trending down (alert tuning + multi-condition logic paying off)
False positive rate decreased from 8.2% → 3.8% (multi-condition alerts + threshold tuning)
See also: Operational SLAs, Event Analytics, Alert Templates
| Feature | v1 | v2 |
|---|---|---|
| Webhook Integration | ❌ | ✅ Slack, Discord, PagerDuty, Email |
| Auto-Escalation | ❌ | ✅ 4 escalation policies |
| Multi-Condition Alerts | ❌ Single condition only | ✅ AND/OR logic, nested conditions |
| Alert Muting | ❌ | ✅ Snooze with reason tracking |
| Alert Analytics | ❌ | ✅ MTTR, false positive rate, escalation stats |
| Basic Alerting | ✅ | ✅ Enhanced |
NO competitor has webhook integration. Firm A (Terminal), Firm C (Quant Lab), Firm D (Spatial) all lack external system integration. Webhook integration is a 3-4 week engineering project (OAuth setup, rate limiting, retry logic, delivery tracking, payload formatting).
Time to replicate Alert System v2: 3-4 weeks minimum.