Alert on user-facing symptoms, not internal metrics. "Error rate > 1%" is actionable; "CPU > 80%" usually is not.
Strong answers cover the three pillars: metrics (system and application), logs (structured, centralised), and traces (distributed tracing for microservices). Alert philosophy: alert on symptoms not causes, avoid alert fatigue, use SLOs/SLIs. Best candidates mention dashboards for different audiences and runbooks for common alerts.
Tests operational maturity. Weak candidates only monitor CPU and memory. Strong candidates think about user experience, SLOs, and actionable alerts. Ask: "How do you handle alert fatigue?"