Skip to content

Add metrics and alerts for unexpected errors

What we're after

If the billing service encounters an unexpected error, we need to alert the operations team.

Further context for those unfamiliar with what we're doing

Critical failures include errors reading usage and errors encountered while transacting credits.

I propose exposing one or more custom Prometheus metrics, scraping them with our Prometheus instance, and creating custom alerts in AlertManager.

Security considerations

The prometheus endpoint should only be exposed internally, similar to admin endpoints.

Notes for implementers

  • Metrics: Probably use the Prometheus Go package
  • See if we can hook into the slog handler and automatically increment a counter when any log of ERROR or higher is logged.
  • Alerts: See if CF automatically scrapes metrics/ or if we need to register the app as a specific scrape target.

Related issues/sub-projects