Alerting and monitoring system

Work in progress

Existing systems

Within NYC Mesh only

UISP: https://10.70.76.21/nms/login
Grafana: http://10.70.90.82:3000
Prometheus General: http://10.70.90.82:9090
Prometheus Omni only: http://10.70.90.142:9090
- Omni port5 at 100Mbps
- Omni memory used 75% or above
snmp-exporter: http://10.70.90.82:9116 (general)
snmp-exporter: http://10.70.90.142:9090 (omni only)

UNIFI: https://10.70.90.158:8443/ (old: https://10.70.95.63:8443)

support report generator

Zabbix

IP:http://10.70.73.58/
Details: Runs on Quincy's server, connected to Beta Slack

Requirements

Must:
- alert Slack team when key infrasture goes offline within 5 minutes
Should:
- be easy to update for new equipment
- be easy to configure to notify new volunteers
- be easy to deploy
- be reliable
- be configurable though a version controlled config to enable easy updates
- be editable by multiple volunteers

Questions

Major
- What key metrics should we alert based on?
Minor
- frequency? ~1 point/hour

Proposed software

Zabbix
Nagios
Grafana
[add your suggestion here]

Next Steps

Log

prompted by this Slack discussion on Grand St. outage
added Zabbix server during Hack night,

Back to top