Infrastructure
1. Walk me through how a request flows from a user to your backend.
DNS → CDN edge → load balancer → ingress / service mesh → application pod → database. Mention TLS termination, where caching happens, and where you would add observability. Strong answers also flag failure points at each hop.
CI/CD
2. How do you safely deploy a change to production?
Pipeline gates (tests, security scan, manual approval if needed). Deploy strategy: blue-green, canary, or rolling. Health checks. Auto-rollback on metric regression. Feature flags decouple deploy from release. Discuss what you actually use, not theory.
Observability
3. A service is slow but not erroring. How do you debug?
Start with the four golden signals: latency, traffic, errors, saturation. Look at p95/p99, not the mean. Trace one slow request end-to-end. Check downstream dependencies (DB, cache, external API). Most production slowness is one of: lock contention, slow query, GC pause, or noisy neighbor.
Security
4. How do you manage secrets in production?
A secrets manager (AWS Secrets Manager, Vault, GCP Secret Manager). Short-lived credentials via IAM roles or workload identity. No secrets in env files committed to git. Rotation policy. Audit access. Strong answers cite a real near-miss.
Incident
5. Tell me about an incident you led and what changed afterwards.
Specific incident, your role, the timeline, the customer impact, and — most importantly — the systemic change. Tooling change > process change > "we communicated more". Blameless framing is expected at this point.
Cost
6. How would you reduce cloud spend without sacrificing reliability?
Reserved capacity for steady workloads, spot/preemptible for fault-tolerant batch, right-size instances based on actual utilization, S3 lifecycle policies, retire orphaned resources. Reliability comes from chaos testing the cheaper config, not from over-provisioning.