Kubernetes in Production: 10 Best Practices We Enforce on Every Engagement
Ten battle-tested Kubernetes best practices for production workloads covering RBAC, networking, observability, and GitOps.
Running Kubernetes in a lab is easy. Running it in production — reliably, securely, and cost-effectively — is a different discipline entirely. Over the past several years, we have hardened Kubernetes clusters across Azure (AKS), AWS (EKS), and bare-metal environments. These are the ten non-negotiable practices we enforce on every engagement.
1. Namespace Strategy with Clear Boundaries
Namespaces are not just organisational labels — they are security and resource boundaries. We enforce a structured namespace strategy from day one.
What we do:
- One namespace per application or microservice domain (never a shared "apps" namespace)
- Dedicated namespaces for platform concerns:
monitoring,ingress,cert-manager,external-secrets - ResourceQuotas on every namespace to prevent a single team from consuming cluster resources
- LimitRanges to set default CPU/memory requests and limits for pods that forget to specify them
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-alpha-quota
namespace: team-alpha
spec:
hard:
requests.cpu: "8"
requests.memory: 16Gi
limits.cpu: "16"
limits.memory: 32Gi
pods: "50"2. RBAC That Follows Least Privilege
Default Kubernetes RBAC is too permissive. We build RBAC from zero trust up.
- No cluster-admin bindings except for platform operators — and those via PIM or time-boxed access
- Namespace-scoped RoleBindings tied to AD groups (via Azure AD integration on AKS)
- Developers get a custom role that allows get/list/watch on most resources, exec into pods in non-production, and create/delete on pods only in their namespace
- Audit logging enabled to track who did what (AKS Diagnostic Settings > kube-audit-admin)
3. Resource Requests and Limits on Every Pod
Pods without resource requests are unschedulable in a meaningful way — the scheduler cannot make informed placement decisions, leading to noisy-neighbour problems and node over-commitment.
Our rule:
- Every pod must declare
requestsandlimitsfor both CPU and memory - Enforce via OPA Gatekeeper or Kyverno policy that rejects pods missing these fields
- Right-size using Vertical Pod Autoscaler (VPA) in recommend-only mode, then apply findings
Common mistake: Setting CPU limits too aggressively causes throttling that is invisible to application teams. We recommend setting CPU limits at 2-4x the request value and monitoring for throttling via
container_cpu_cfs_throttled_periods_total.
4. Network Policies as Default-Deny
By default, all pods in a Kubernetes cluster can talk to all other pods. In production, this is unacceptable.
- Deploy a default-deny NetworkPolicy in every namespace
- Explicitly allow only required ingress and egress flows
- Use Cilium or Calico as the CNI — Azure CNI with network policy support works for AKS
- Test network policies in staging with traffic-mirroring before enforcing in production
5. Pod Security Standards (Not Pod Security Policies)
Pod Security Policies were deprecated in Kubernetes 1.21 and removed in 1.25. Use Pod Security Standards (via Pod Security Admission) or OPA Gatekeeper.
At minimum, enforce the restricted profile in production namespaces:
- No privileged containers
- No host namespace sharing
- Read-only root filesystem
- Non-root user
- No privilege escalation
6. Secrets Management Done Right
Kubernetes Secrets are base64-encoded, not encrypted at rest by default. This is not secrets management.
Our approach:
- Enable etcd encryption at rest (default on AKS, must be configured on self-managed)
- Use External Secrets Operator to sync secrets from Azure Key Vault, HashiCorp Vault, or AWS Secrets Manager
- Never store secrets in Git — not even encrypted (use sealed-secrets only as a last resort)
- Rotate secrets automatically with a maximum lifetime of 90 days
- Mount secrets as files, not environment variables (environment variables appear in logs and process listings)
7. GitOps for All Deployments
Manual kubectl apply in production is an incident waiting to happen. Every deployment must go through a Git-based pipeline.
- ArgoCD or Flux as the GitOps operator
- All manifests stored in a dedicated Git repository (not the application repo)
- Pull-based deployment: the cluster pulls desired state from Git, not the other way around
- Drift detection alerts if someone manually changes a resource
- Promotion across environments via branch strategy or directory structure, never by modifying manifests in place
8. Observability: Metrics, Logs, Traces
You cannot operate what you cannot see. We deploy a standardised observability stack on every cluster.
- Metrics: Prometheus (or Azure Monitor managed Prometheus) + Grafana with pre-built dashboards for cluster, node, namespace, and pod-level metrics
- Logs: Fluent Bit shipping to a centralised backend (Azure Monitor, Elasticsearch, or Loki)
- Traces: OpenTelemetry Collector exporting to Jaeger or Azure Application Insights
- Alerting rules for: node NotReady, pod CrashLoopBackOff, PVC near capacity, certificate expiry, HPA at max replicas
Non-negotiable: Every cluster must have a dashboard answering "Is the cluster healthy?" and "Is my application healthy?" within 30 seconds.
9. Ingress and TLS Termination
- Use a single ingress controller per cluster (NGINX Ingress Controller or Azure Application Gateway Ingress Controller)
- Terminate TLS at the ingress with certificates from cert-manager + Let's Encrypt (or internal CA)
- Enforce HTTPS-only — redirect HTTP to HTTPS at the ingress level
- Set rate limiting, request size limits, and timeouts on the ingress controller
- For multi-tenant clusters, consider separate ingress controllers per tenant namespace
10. Cluster Upgrades and Patch Management
Kubernetes releases three minor versions per year, each with a 14-month support window. Falling behind on upgrades is a security and supportability risk.
Our cadence:
- Upgrade to the latest stable minor version within 60 days of release
- Test upgrades in a staging cluster that mirrors production node pools and workloads
- Use node surge upgrades (AKS) or rolling node replacements to avoid downtime
- Review deprecation notes for each release — API removals break workloads silently
- Automate node OS patching (e.g., AKS auto-upgrade channel set to
node-image)
Bringing It All Together
These practices are not aspirational — they are the baseline. We codify them as Gatekeeper/Kyverno policies, Terraform modules, and ArgoCD ApplicationSets so they are enforced automatically, not by convention.
The goal is a cluster where a new development team can deploy on day one, knowing that security, observability, and cost guardrails are already in place.
How We Can Help
Related Resources
- GitOps for Kubernetes with Flux — Declarative delivery for your Kubernetes workloads.
- DevSecOps Pipeline Design — Integrate security into your container build and deploy pipeline.
- Platform Engineering: Building an Internal Developer Platform — The platform layer that abstracts Kubernetes complexity for your teams.
CC Conceptualise offers a Kubernetes Production Readiness Assessment — a one-week review of your existing cluster(s) against these ten practices, producing a prioritised remediation backlog. For greenfield deployments, we deliver production-hardened clusters as part of our platform engineering engagements. Reach out to discuss your Kubernetes challenges.