Skip to main content

Alerting and monitoring system

Out of date

Existing systems

Zabbix

Requirements

  • Must:
    • alert Slack team when key infrasture goes offline within 5 minutes
  • Should:
    • be easy to update for new equipment
    • be easy to configure to notify new volunteers
    • be easy to deploy
    • be reliable
    • be configurable though a version controlled config to enable easy updates
    • be editable by multiple volunteers

Questions

  • Major
    • What key metrics should we alert based on?
  • Minor
    • frequency? ~1 point/hour

Proposed software

Next Steps

Log

  • prompted by this Slack discussion on Grand St. outage
  • added Zabbix server during Hack night,