Add metrics and alerts for unexpected errors
What we're after
If the billing service encounters an unexpected error, we need to alert the operations team.
Further context for those unfamiliar with what we're doing
Critical failures include errors reading usage and errors encountered while transacting credits.
I propose exposing one or more custom Prometheus metrics, scraping them with our Prometheus instance, and creating custom alerts in AlertManager.
Security considerations
The prometheus endpoint should only be exposed internally, similar to admin endpoints.
Notes for implementers
-
Metrics: Probably use the Prometheus Go package -
See if we can hook into the sloghandler and automatically increment a counter when any log of ERROR or higher is logged. -
Alerts: See if CF automatically scrapes metrics/or if we need to register the app as a specific scrape target.