Questions on CI/CD, infrastructure as code, monitoring, incident response, and reliability engineering.
Walk through each stage and explain what it catches. A pipeline is only as valuable as the problems it prevents from reaching production.
Strong answers describe concrete pipeline stages: code checkout, dependency installation, linting, unit tests, integration tests, security scanning, build, deploy to staging, smoke tests, approval gate, deploy to production, and post-deploy verification. Look for understanding of why each stage exists, not just the tools used.
Baseline DevOps question. The depth of detail reveals real experience vs theoretical knowledge. Ask: "What happens when the pipeline fails at 3am?" to test on-call awareness.
Discuss how you handle state - remote backends, locking, and drift detection. This is where IaC gets real.
Look for: understanding of declarative vs imperative approaches, experience with Terraform, Pulumi, CloudFormation, or Ansible, state management strategies (remote state, locking), module reuse and DRY principles, code review for infra changes, and drift detection. Best candidates discuss testing infrastructure code.
Fundamental DevOps skill. Candidates who only use click-ops or scripts are not truly practicing IaC. Ask about handling secrets in IaC code and managing environment differences.
Alert on user-facing symptoms, not internal metrics. "Error rate > 1%" is actionable; "CPU > 80%" usually is not.
Strong answers cover the three pillars: metrics (system and application), logs (structured, centralised), and traces (distributed tracing for microservices). Alert philosophy: alert on symptoms not causes, avoid alert fatigue, use SLOs/SLIs. Best candidates mention dashboards for different audiences and runbooks for common alerts.
Tests operational maturity. Weak candidates only monitor CPU and memory. Strong candidates think about user experience, SLOs, and actionable alerts. Ask: "How do you handle alert fatigue?"
Show your structured approach: Detect, Communicate, Mitigate, Resolve, Review. Emphasise communication cadence during the incident.
Look for a structured process: detect (alerts, user reports), triage (severity assessment, incident commander), communicate (status page, stakeholder updates), mitigate (rollback, feature flag, hotfix), resolve (root cause fix), and review (blameless post-mortem, action items). Best candidates mention communication templates and escalation paths.
Critical SRE skill. Tests composure under pressure and systematic thinking. Red flag: heroes who fix everything alone. Good sign: structured process with clear roles and communication.
Do not just list features. Describe how you configured health checks to prevent bad deploys and how auto-scaling handled traffic spikes in a real scenario.
Strong answers cover: Kubernetes concepts (pods, services, deployments, namespaces), deployment strategies (rolling update, blue-green, canary), horizontal pod autoscaling, liveness and readiness probes, resource limits, and secrets management. Best candidates discuss the operational complexity and when containers are not the right answer.
Senior DevOps/SRE question. Look for practical experience, not just certification knowledge. Ask: "When would you NOT use Kubernetes?" to test judgment.
Focus on the systemic changes that came out of the post-mortem, not just what went wrong. Did you automate the manual step? Add a monitoring gap?
Best answers: describe the incident clearly, demonstrate blameless analysis, identify systemic causes (not just the immediate trigger), propose preventive measures (automation, monitoring, process changes), and show follow-through on action items. Strong candidates distinguish between human error and system design problems.
Reveals operational culture and learning mindset. Red flag: blaming individuals. Good sign: identifying systemic improvements and following through on them.
Frame it as enabling velocity safely, not restricting it. Good guardrails make teams faster by giving them confidence to ship.
Strong answers mention: SLO-based approach (error budgets), automated testing and deployment guardrails, feature flags for progressive rollouts, investing in developer experience to make the safe path the easy path, and treating reliability as a feature. Best candidates frame this as a collaboration, not a conflict.
Senior SRE question. Tests philosophical alignment. The best SREs enable speed by reducing the cost of failure (canary deploys, feature flags, fast rollbacks), not by adding gates.
Show that security checks are automated in the pipeline, not manual gates. Developers should get fast feedback on security issues, not a report two weeks later.
Look for: dependency scanning, SAST/DAST tools in CI, secrets management (not in code), container image scanning, least-privilege access controls, network segmentation, and security review processes. Best candidates discuss shifting security left and making it a developer responsibility, not just a gate.
DevSecOps is increasingly expected. Candidates who treat security as someone else's problem are a risk. Look for practical integration, not just policy awareness.
Show that you measure and improve on-call health: pages per shift, time to resolution, and engineer satisfaction should all be tracked and reviewed.
Strong answers include: reasonable rotation schedules, compensating on-call fairly, investing in reducing alert noise, runbooks for common issues, blameless culture, protected recovery time after incidents, and tracking on-call load as a team health metric. Best candidates mention that on-call should be a learning opportunity, not just a burden.
Reveals values and leadership. Candidates who normalise 24/7 hero culture are concerning. Look for sustainable practices that respect people's time and well-being.
Start with visibility: you cannot optimise what you cannot see. Tag everything, set up cost dashboards, and review monthly with the team.
Look for: regular cost reviews, rightsizing instances, reserved/spot instance strategies, auto-scaling based on actual demand, eliminating unused resources, tagging for cost allocation, and setting budgets/alerts. Best candidates discuss the trade-off between cost and reliability, and when spending more is the right answer.
Increasingly important skill. Tests financial awareness and operational efficiency. Red flag: candidates who only focus on cutting costs without considering reliability impact. Good sign: systematic approach with clear trade-off thinking.
A disaster recovery plan that has never been tested is not a plan - it is a hope. Schedule regular DR drills and fix the gaps you discover.
Strong answers cover: defining RPO (recovery point objective) and RTO (recovery time objective) based on business requirements, backup strategies (incremental, differential, cross-region), automated failover, regular DR testing (not just documentation), and chaos engineering practices. Best candidates have actually tested their DR plans and found gaps.
Critical SRE skill. Many teams have DR plans on paper but never test them. Candidates who have run DR drills and found (and fixed) issues demonstrate operational maturity.
Focus on the practical benefits: every change is auditable, rollbacks are a git revert, and the repo is the single source of truth for what is deployed.
Look for: understanding of GitOps principles (declarative desired state, versioned in Git, automatically applied, self-healing), tools used (ArgoCD, Flux), benefits (audit trail, rollback, consistency), and challenges (secrets management, handling drift, learning curve). Best candidates discuss when GitOps is appropriate and when simpler approaches suffice.
Modern DevOps practice. Not all environments need GitOps, but understanding the approach shows current knowledge. Ask: "What challenges did you face adopting GitOps?"
Always make migrations backwards-compatible. Deploy the code that works with both old and new schema first, then run the migration, then clean up the old code path.
Strong answers cover: replication strategies, automated backups with tested restores, connection pooling, migration strategies (expand-contract pattern, backwards-compatible changes), monitoring query performance, and failover procedures. Best candidates discuss the specific challenges of schema changes on large, active tables.
Advanced operational skill. Database outages are some of the most impactful incidents. Candidates who have managed production database migrations safely demonstrate mature operational thinking.
Be honest about the complexity trade-off. A service mesh solves real problems but introduces operational overhead. Show you understand when it is worth it and when it is not.
Look for: understanding of what service meshes provide (mTLS, traffic management, observability), experience with specific tools (Istio, Linkerd), awareness of the complexity cost (latency overhead, operational burden, debugging difficulty), and judgment about when simpler solutions suffice. Best candidates can articulate clear criteria for adoption.
Tests depth of infrastructure knowledge and judgment. Candidates who advocate for service meshes in every environment lack practical experience with the operational burden. Those who have never considered them may not be current.
Start with identity. If you can strongly verify who is making a request, you can make better access decisions. mTLS between services is a practical first step.
Strong answers cover: principle of least privilege, network microsegmentation, identity-based access (not IP-based), mutual TLS, continuous verification (not just at authentication), device trust, and monitoring for anomalies. Best candidates discuss the practical challenges of implementing zero trust in existing environments and the migration path.
Senior security-minded SRE question. Zero trust is increasingly expected. Candidates who rely solely on network perimeter security are operating in an outdated model.
Describe a specific moment when the error budget influenced a decision: "We had consumed 80% of our monthly error budget, so we paused feature releases and focused on stability."
Look for: understanding the SLI/SLO/SLA hierarchy, choosing meaningful SLIs (latency, availability, correctness), setting realistic SLOs based on user expectations, using error budgets to balance reliability and velocity, and what happens when the budget is exhausted. Best candidates describe a real example where error budget data changed a team decision.
Core SRE concept. Candidates who cannot explain error budgets are not practicing modern SRE. Ask: "What do you do when the error budget is exhausted but the team wants to ship?"
Track the time you spend on toil weekly. If it exceeds 50%, you are a manual operator, not an SRE. Automate the highest-frequency tasks first.
Strong answers define toil clearly (manual, repetitive, automatable, without enduring value), describe measurement approaches (time tracking, ticket analysis), and give specific automation examples with impact metrics. Best candidates discuss the strategic importance of keeping toil below a threshold to preserve time for engineering work.
Fundamental SRE principle. Candidates who accept manual work as inevitable are not SRE-minded. Those who systematically reduce toil free themselves for higher-value reliability engineering.
The best platform teams treat developers as customers and measure success by adoption, not by mandate. If developers route around your platform, it is not meeting their needs.
Best answers show: treating internal developers as customers, building self-service capabilities, measuring adoption and developer satisfaction, providing good documentation and support, and balancing standardisation with flexibility. Best candidates discuss the paved road approach: make the right way the easy way.
Modern DevOps evolution. Platform engineering is a growing discipline. Candidates who understand product thinking applied to infrastructure show forward-looking mindset.
Frame infrastructure debt in business terms: "Our deployment takes 45 minutes, costing us 20 engineer-hours per week. Investing 2 weeks in modernisation saves 1,000 hours per year."
Look for: tracking infrastructure debt visibly, quantifying its impact (incident frequency, deployment time, developer productivity), making a business case for modernisation, incremental migration strategies, and balancing reliability investment with business needs. Best candidates make infrastructure health visible to leadership with clear metrics.
Tests strategic communication skills. Many DevOps engineers struggle to justify infrastructure investment to non-technical leaders. Those who can make the business case clearly are far more effective.
Know your current limits through load testing, not guessing. Then set capacity targets based on growth projections plus a safety margin for unexpected spikes.
Strong answers cover: analysing growth trends from historical data, load testing to find current limits, defining headroom targets (e.g. 30% above peak), auto-scaling strategies with appropriate thresholds, cost modelling for different capacity scenarios, and planning for organic growth vs sudden spikes (viral events, launches). Best candidates discuss the relationship between capacity planning and SLOs.
Practical SRE skill. Under-provisioning causes outages; over-provisioning wastes money. Candidates who have managed capacity during rapid growth demonstrate operational maturity. Ask: "How far ahead do you plan?"
Start small and in staging. Define what "normal" looks like (steady state), then introduce one failure and observe. Only move to production chaos when you have confidence in your observability and rollback capabilities.
Strong answers cover: starting with game days in non-production environments, defining steady-state hypotheses, using tools like Chaos Monkey or Litmus, running experiments with clear blast radius limits, having abort conditions, and documenting findings. Best candidates discuss the cultural prerequisites for chaos engineering and how to build organisational buy-in.
Advanced SRE practice. Candidates with chaos engineering experience have mature reliability practices. Ask: "What did your first chaos experiment reveal that you did not expect?" to test genuine experience.
Profile each stage first. Often one or two steps account for 80% of the time. Quick wins include dependency caching, test parallelisation, and splitting the pipeline into fast feedback and thorough verification stages.
Strong answers describe a systematic approach: profiling each stage, identifying the slowest steps, parallelising independent stages, caching dependencies, using incremental builds, optimising test suites (running fast tests first, slow tests asynchronously), and right-sizing build agents. Best candidates discuss the trade-off between pipeline speed and thoroughness.
Practical DevOps skill. Slow pipelines reduce developer productivity and discourage small, frequent commits. Candidates who have systematically optimised a pipeline demonstrate operational awareness.
The test is simple: if a developer clones your repo, can they see any secrets? If yes, your secrets management needs work. Use a dedicated secrets manager and inject at runtime, never at build time.
Strong answers cover: never storing secrets in code or version control, using dedicated secret management tools (Vault, AWS Secrets Manager, SOPS), rotating secrets regularly, auditing access, handling secrets in CI/CD pipelines, and developer workflows that are both secure and convenient. Best candidates discuss the tension between security and developer experience.
Non-negotiable security skill. Secret leaks are one of the most common security incidents. Candidates who have implemented proper secrets management demonstrate security awareness. Ask: "What happens when a secret is accidentally committed to git?"
Be honest about the trade-offs. Multi-cloud sounds good in theory but doubles your operational expertise requirement. Cloud-agnostic abstractions add latency and limit access to provider-specific features.
Strong answers show nuanced thinking: multi-cloud adds complexity (different APIs, networking, IAM), but can reduce vendor lock-in, improve resilience, and leverage best-of-breed services. Best candidates discuss when multi-cloud is justified (regulatory requirements, specific service needs) versus when it is unnecessary complexity.
Tests strategic infrastructure thinking. Candidates who advocate multi-cloud without acknowledging the complexity cost may lack operational experience. Those who categorically oppose it may lack strategic awareness.
Always include a correlation ID that threads through every service in a request chain. Without it, debugging a distributed system is like searching for a needle in a hayfield, not just a haystack.
Strong answers cover: structured logging (JSON), correlation IDs across services, log levels used consistently, centralised aggregation, retention policies, and the balance between logging enough to debug and not so much that costs and noise become problems. Best candidates discuss how they make logs searchable and actionable during incidents.
Practical observability skill. Logs are the first tool engineers reach for during incidents. Candidates who have designed logging strategies for distributed systems understand observability at a deep level.
Match strategy to risk: canary for critical user-facing services, rolling for internal services, blue-green when you need instant rollback. The choice depends on your risk tolerance and infrastructure budget.
Strong answers explain each strategy and when to use it: blue-green for instant rollback but double infrastructure cost, canary for gradual traffic shifting with early feedback, and rolling for resource-efficient updates. Best candidates discuss the prerequisites (observability, traffic routing, health checks) and how deployment strategy relates to service criticality and risk tolerance.
Fundamental DevOps knowledge. Candidates who deploy everything the same way are not thinking about risk. Those who choose deployment strategies based on service characteristics demonstrate operational maturity.
Store configuration in the environment, not in the code. Validate configs in CI before deployment. If staging and production drift apart, you are testing assumptions, not reality.
Strong answers cover: externalised configuration (not baked into images), environment-specific config files or services, config validation in CI, drift detection, and maintaining environment parity to reduce "works on staging but not production" issues. Best candidates discuss twelve-factor app principles and immutable infrastructure approaches.
Core operational skill. Configuration issues cause a disproportionate number of production incidents. Candidates who have systematic configuration management prevent an entire category of outages.
Think in layers: is it the application (slow queries, resource leaks), the platform (pod limits, connection pools), or the network (DNS, load balancer, firewall)? Narrow down the layer before diving deep.
Strong answers describe a systematic approach: checking metrics (latency percentiles, error rates), examining logs with correlation IDs, verifying DNS resolution, checking network connectivity and load balancer health, examining the downstream service's health, reviewing recent changes, and considering resource exhaustion (connection pool, file descriptors). Best candidates think in layers: application, platform, network.
Tests troubleshooting methodology. Real-world incidents require systematic diagnosis, not guessing. Candidates who can articulate a structured debugging approach are effective on-call engineers.
The key benefit is confidence: if staging and production use the same image, you know production will behave like staging. The key challenge is handling anything stateful. Separate state from compute.
Strong answers explain: never modifying running infrastructure (replace instead of patch), benefits (reproducibility, consistency, simplified rollback), challenges (longer deploy cycles, handling state, cultural shift from SSH-and-fix), and prerequisites (automated image building, good CI/CD, external state management). Best candidates discuss when immutable infrastructure is appropriate and when it is not.
Tests modern infrastructure philosophy. Candidates who still SSH into production to make changes are operating in an outdated model. Those who understand immutable infrastructure are thinking about reliability at scale.
Make the right thing easy: auto-generate service dashboards, provide libraries that instrument code automatically, and include observability in your definition of done for new features.
Strong answers cover: making observability tools developer-friendly, embedding dashboards and alerts into the development workflow, "you build it, you run it" ownership, providing training and runbooks, celebrating good observability practices, and gradually increasing developer on-call responsibility. Best candidates discuss overcoming resistance and making observability a natural part of the development process.
Tests cultural leadership alongside technical skills. The best observability tools are useless if developers do not use them. Candidates who can bridge the gap between SRE practices and development teams are invaluable.