A service in production is intermittently timing out when calling a downstream dependency. Walk me through how you would diagnose this.

Question

Accepted Answer

Strong answers describe a systematic approach: checking metrics (latency percentiles, error rates), examining logs with correlation IDs, verifying DNS resolution, checking network connectivity and load balancer health, examining the downstream service's health, reviewing recent changes, and considering resource exhaustion (connection pool, file descriptors). Best candidates think in layers: application, platform, network.

Interview Prep

🔧 DevOps / SRE

A service in production is intermittently timing out when calling a downstream dependency. Walk me through how you would diagnose this.

What good answers include

What interviewers are looking for