Anya Core Alert Reference¶
Table of Contents¶
[AIR-3][AIS-3][BPC-3][RES-3]
Overview¶
This document provides a comprehensive reference for all alerts configured in the Anya Core monitoring stack. Alerts are categorized by severity and component for easy reference.
Alert Severity Levels¶
Level | Description | Response Time | Notification Channel |
---|---|---|---|
Critical | Immediate attention required, service impact | < 15 minutes | Email, SMS, PagerDuty |
Warning | Attention needed soon, potential issues | < 1 hour | Email, Slack |
Info | Informational messages, no immediate action | N/A | Email (digest) |
Core Alerts¶
Node Health¶
Alert Name | Severity | Condition | Description | Resolution |
---|---|---|---|---|
NodeDown |
Critical | up == 0 |
Node is not responding to metrics collection | Check node status, restart if needed |
NodeHighCPU |
Warning | rate(node_cpu_seconds_total{mode!="idle"}[5m]) > 0.9 |
CPU usage is very high | Investigate high CPU processes |
NodeHighMemory |
Warning | (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9 |
Memory usage is very high | Check for memory leaks, add more RAM |
Disk & Storage¶
Alert Name | Severity | Condition | Description | Resolution |
---|---|---|---|---|
LowDiskSpace |
Warning | node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.2 |
Disk space is running low | Clean up disk space or expand storage |
HighDiskIO |
Warning | rate(node_disk_io_time_seconds_total[5m]) > 0.9 |
High disk I/O utilization | Check for disk bottlenecks |
Network¶
Alert Name | Severity | Condition | Description | Resolution |
---|---|---|---|---|
HighNetworkTraffic |
Warning | rate(node_network_receive_bytes_total[5m]) > 100000000 |
High network receive rate | Investigate traffic source |
NetworkErrors |
Warning | rate(node_network_receive_errs_total[5m]) > 0 |
Network interface errors detected | Check network hardware and connections |
Bitcoin-Specific Alerts¶
Blockchain¶
Alert Name | Severity | Condition | Description | Resolution |
---|---|---|---|---|
BitcoinNodeDown |
Critical | bitcoin_blocks < (time() - bitcoin_latest_block_time) / 600 > 3 |
Bitcoin node is not syncing | Check bitcoind status |
BitcoinIBD |
Warning | bitcoin_ibd == 1 |
Node is in Initial Block Download | Monitor progress |
BitcoinMempoolFull |
Warning | bitcoin_mempool_size > 100000 |
Mempool size is very large | Check for network congestion |
P2P Network¶
Alert Name | Severity | Condition | Description | Resolution |
---|---|---|---|---|
LowPeerCount |
Warning | bitcoin_peers < 8 |
Low number of peer connections | Check network connectivity |
HighPingTime |
Warning | bitcoin_ping_time > 5 |
High ping time to peers | Check network latency |
Custom Alert Rules¶
Adding New Alerts¶
- Edit the appropriate rule file in
monitoring/prometheus/rules/
- Follow the format:
yaml
- alert: AlertName
expr: alert_condition
for: 5m
labels:
severity: warning|critical
annotations:
description: "Detailed description"
summary: "Short alert summary"
Alert Routing¶
Alerts are routed based on severity and component:
routes:
- match:
severity: 'critical'
receiver: 'critical-alerts'
- match:
severity: 'warning'
receiver: 'warning-alerts'
- match:
alertname: 'NodeDown'
receiver: 'pagerduty'
Notification Templates¶
Email Template¶
{{ define "email.default.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{ range .Alerts.Firing }}
[FIRING] {{ .Labels.alertname }}
Severity: {{ .Labels.severity }}
Summary: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{ range .Alerts.Resolved }}
[RESOLVED] {{ .Labels.alertname }}
Resolved at: {{ .StartsAt }}
{{ end }}
{{- end }}
{{- end }}
Testing Alerts¶
Manual Testing¶
- Use the Alertmanager UI to silence an alert
- Use
amtool
to test alert configurations:
bash
amtool alert --alertmanager.url=http://localhost:9093 --alertname=NodeDown
Integration Testing¶
- Deploy to staging environment
- Trigger test alerts using the Alertmanager API:
bash
curl -X POST http://localhost:9093/api/v2/alerts -d '
[
{
"status": "firing",
"labels": {
"alertname": "TestAlert",
"severity": "warning"
},
"annotations": {
"summary": "Test alert",
"description": "This is a test alert"
}
}
]'
Alert Suppression¶
During Maintenance¶
- Create a maintenance window in Alertmanager:
bash
curl -X POST http://localhost:9093/api/v2/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{"name": "alertname", "value": ".+", "isRegex": true}
],
"startsAt": "2025-01-01T00:00:00Z",
"endsAt": "2025-01-01T02:00:00Z",
"createdBy": "maintenance",
"comment": "Planned maintenance window"
}'
Best Practices¶
- Alert Fatigue Prevention
- Set appropriate thresholds
- Use alert grouping
-
Implement alert inhibition rules
-
Alert Documentation
- Document all alerts
- Include runbooks
-
Define escalation policies
-
Alert Tuning
- Regularly review alert thresholds
- Remove unused alerts
- Adjust for seasonality
Support¶
For alert-related issues:
- Email: botshelomokoka+alerts@gmail.com
- GitHub Issues: https://github.com/your-org/anya-core/issues
- Documentation: Monitoring Guide
AI Labeling¶
- [AIR-3] - Automated alert management
- [AIS-3] - Secure alert handling
- [BPC-3] - Bitcoin monitoring best practices
- [RES-3] - Comprehensive alert coverage