Alerting and monitoring system

Out of date

Existing systems

Within NYC Mesh only

~~#monitoring-unms/UISP~~

~~Grafana/Prometheus~~
- ~~public, setup 4 years ago:~~UISP: https://~~stats.nycmesh.net~~10.70.76.21/nms/login
- ~~Mesh only, Omni's etc:~~Grafana: http://10.70.90.82:~~3000/dashboards~~3000
- Prometheus General: http://10.70.90.82:9090
- Prometheus Omni only: http://10.70.90.142:9090
  - Omni port5 at 100Mbps
  - Omni memory used 75% or above
- snmp-exporter: http://10.70.90.82:9116
- support report generator
Zabbix
- IP:http://10.70.73.58/
- Details: Runs on Quincy's server, connected to Beta Slack
Requirements
- Must:
  - alert Slack team when key infrasture goes offline within 5 minutes
- Should:
  - be easy to update for new equipment
  - be easy to configure to notify new volunteers
  - be easy to deploy
  - be reliable
  - be configurable though a version controlled config to enable easy updates
  - be editable by multiple volunteers
Questions
- Major
  - What key metrics should we alert based on?
- Minor
  - frequency? ~1 point/hour
Proposed software
- Zabbix
- Nagios
- Grafana
- [add your suggestion here]
Next Steps

Log
- prompted by this Slack discussion on Grand St. outage
- added Zabbix server during Hack night,

Back to top