Skip to main content
All posts
Cloud Architecture6 min read

Kubernetes in Production: 10 Best Practices We Enforce on Every Engagement

Ten battle-tested Kubernetes best practices for production workloads covering RBAC, networking, observability, and GitOps.

Running Kubernetes in a lab is easy. Running it in production — reliably, securely, and cost-effectively — is a different discipline entirely. Over the past several years, we have hardened Kubernetes clusters across Azure (AKS), AWS (EKS), and bare-metal environments. These are the ten non-negotiable practices we enforce on every engagement.

1. Namespace Strategy with Clear Boundaries

Namespaces are not just organisational labels — they are security and resource boundaries. We enforce a structured namespace strategy from day one.

What we do:

  • One namespace per application or microservice domain (never a shared "apps" namespace)
  • Dedicated namespaces for platform concerns: monitoring, ingress, cert-manager, external-secrets
  • ResourceQuotas on every namespace to prevent a single team from consuming cluster resources
  • LimitRanges to set default CPU/memory requests and limits for pods that forget to specify them
YAML
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi
    pods: "50"

2. RBAC That Follows Least Privilege

Default Kubernetes RBAC is too permissive. We build RBAC from zero trust up.

  • No cluster-admin bindings except for platform operators — and those via PIM or time-boxed access
  • Namespace-scoped RoleBindings tied to AD groups (via Azure AD integration on AKS)
  • Developers get a custom role that allows get/list/watch on most resources, exec into pods in non-production, and create/delete on pods only in their namespace
  • Audit logging enabled to track who did what (AKS Diagnostic Settings > kube-audit-admin)

3. Resource Requests and Limits on Every Pod

Pods without resource requests are unschedulable in a meaningful way — the scheduler cannot make informed placement decisions, leading to noisy-neighbour problems and node over-commitment.

Our rule:

  • Every pod must declare requests and limits for both CPU and memory
  • Enforce via OPA Gatekeeper or Kyverno policy that rejects pods missing these fields
  • Right-size using Vertical Pod Autoscaler (VPA) in recommend-only mode, then apply findings

Common mistake: Setting CPU limits too aggressively causes throttling that is invisible to application teams. We recommend setting CPU limits at 2-4x the request value and monitoring for throttling via container_cpu_cfs_throttled_periods_total.

4. Network Policies as Default-Deny

By default, all pods in a Kubernetes cluster can talk to all other pods. In production, this is unacceptable.

  • Deploy a default-deny NetworkPolicy in every namespace
  • Explicitly allow only required ingress and egress flows
  • Use Cilium or Calico as the CNI — Azure CNI with network policy support works for AKS
  • Test network policies in staging with traffic-mirroring before enforcing in production

5. Pod Security Standards (Not Pod Security Policies)

Pod Security Policies were deprecated in Kubernetes 1.21 and removed in 1.25. Use Pod Security Standards (via Pod Security Admission) or OPA Gatekeeper.

At minimum, enforce the restricted profile in production namespaces:

  • No privileged containers
  • No host namespace sharing
  • Read-only root filesystem
  • Non-root user
  • No privilege escalation

6. Secrets Management Done Right

Kubernetes Secrets are base64-encoded, not encrypted at rest by default. This is not secrets management.

Our approach:

  • Enable etcd encryption at rest (default on AKS, must be configured on self-managed)
  • Use External Secrets Operator to sync secrets from Azure Key Vault, HashiCorp Vault, or AWS Secrets Manager
  • Never store secrets in Git — not even encrypted (use sealed-secrets only as a last resort)
  • Rotate secrets automatically with a maximum lifetime of 90 days
  • Mount secrets as files, not environment variables (environment variables appear in logs and process listings)

7. GitOps for All Deployments

Manual kubectl apply in production is an incident waiting to happen. Every deployment must go through a Git-based pipeline.

  • ArgoCD or Flux as the GitOps operator
  • All manifests stored in a dedicated Git repository (not the application repo)
  • Pull-based deployment: the cluster pulls desired state from Git, not the other way around
  • Drift detection alerts if someone manually changes a resource
  • Promotion across environments via branch strategy or directory structure, never by modifying manifests in place

8. Observability: Metrics, Logs, Traces

You cannot operate what you cannot see. We deploy a standardised observability stack on every cluster.

  • Metrics: Prometheus (or Azure Monitor managed Prometheus) + Grafana with pre-built dashboards for cluster, node, namespace, and pod-level metrics
  • Logs: Fluent Bit shipping to a centralised backend (Azure Monitor, Elasticsearch, or Loki)
  • Traces: OpenTelemetry Collector exporting to Jaeger or Azure Application Insights
  • Alerting rules for: node NotReady, pod CrashLoopBackOff, PVC near capacity, certificate expiry, HPA at max replicas

Non-negotiable: Every cluster must have a dashboard answering "Is the cluster healthy?" and "Is my application healthy?" within 30 seconds.

9. Ingress and TLS Termination

  • Use a single ingress controller per cluster (NGINX Ingress Controller or Azure Application Gateway Ingress Controller)
  • Terminate TLS at the ingress with certificates from cert-manager + Let's Encrypt (or internal CA)
  • Enforce HTTPS-only — redirect HTTP to HTTPS at the ingress level
  • Set rate limiting, request size limits, and timeouts on the ingress controller
  • For multi-tenant clusters, consider separate ingress controllers per tenant namespace

10. Cluster Upgrades and Patch Management

Kubernetes releases three minor versions per year, each with a 14-month support window. Falling behind on upgrades is a security and supportability risk.

Our cadence:

  • Upgrade to the latest stable minor version within 60 days of release
  • Test upgrades in a staging cluster that mirrors production node pools and workloads
  • Use node surge upgrades (AKS) or rolling node replacements to avoid downtime
  • Review deprecation notes for each release — API removals break workloads silently
  • Automate node OS patching (e.g., AKS auto-upgrade channel set to node-image)

Bringing It All Together

These practices are not aspirational — they are the baseline. We codify them as Gatekeeper/Kyverno policies, Terraform modules, and ArgoCD ApplicationSets so they are enforced automatically, not by convention.

The goal is a cluster where a new development team can deploy on day one, knowing that security, observability, and cost guardrails are already in place.

How We Can Help

Related Resources

CC Conceptualise offers a Kubernetes Production Readiness Assessment — a one-week review of your existing cluster(s) against these ten practices, producing a prioritised remediation backlog. For greenfield deployments, we deliver production-hardened clusters as part of our platform engineering engagements. Reach out to discuss your Kubernetes challenges.

Kubernetes best practicesKubernetes productionAKS securityGitOpscontainer orchestration

Need expert guidance?

Our team specializes in cloud architecture, security, AI platforms, and DevSecOps. Let's discuss how we can help your organization.

Related articles