Behavioural Mid Level

A service in production is intermittently timing out when calling a downstream dependency. Walk me through how you would diagnose this.

Quick Tip

Think in layers: is it the application (slow queries, resource leaks), the platform (pod limits, connection pools), or the network (DNS, load balancer, firewall)? Narrow down the layer before diving deep.

What good answers include

Strong answers describe a systematic approach: checking metrics (latency percentiles, error rates), examining logs with correlation IDs, verifying DNS resolution, checking network connectivity and load balancer health, examining the downstream service's health, reviewing recent changes, and considering resource exhaustion (connection pool, file descriptors). Best candidates think in layers: application, platform, network.

What interviewers are looking for

Tests troubleshooting methodology. Real-world incidents require systematic diagnosis, not guessing. Candidates who can articulate a structured debugging approach are effective on-call engineers.

← All DevOps / SRE questions