Technical Mid Level

How do you set up monitoring and observability for a production system? What do you monitor and what alerts do you set?

Quick Tip

Alert on user-facing symptoms, not internal metrics. "Error rate > 1%" is actionable; "CPU > 80%" usually is not.

What good answers include

Strong answers cover the three pillars: metrics (system and application), logs (structured, centralised), and traces (distributed tracing for microservices). Alert philosophy: alert on symptoms not causes, avoid alert fatigue, use SLOs/SLIs. Best candidates mention dashboards for different audiences and runbooks for common alerts.

What interviewers are looking for

Tests operational maturity. Weak candidates only monitor CPU and memory. Strong candidates think about user experience, SLOs, and actionable alerts. Ask: "How do you handle alert fatigue?"

← All DevOps / SRE questions