Kubernetes in Production: 10 Best Practices We Enforce on Every Engagement

Running Kubernetes in a lab is easy. Running it in production — reliably, securely, and cost-effectively — is a different discipline entirely. Over the past several years, we have hardened Kubernetes clusters across Azure (AKS), AWS (EKS), and bare-metal environments. These are the ten non-negotiable practices we enforce on every engagement.

1. Namespace Strategy with Clear Boundaries

Namespaces are not just organisational labels — they are security and resource boundaries. We enforce a structured namespace strategy from day one.

What we do:

One namespace per application or microservice domain (never a shared "apps" namespace)
Dedicated namespaces for platform concerns: monitoring, ingress, cert-manager, external-secrets
ResourceQuotas on every namespace to prevent a single team from consuming cluster resources
LimitRanges to set default CPU/memory requests and limits for pods that forget to specify them

YAML

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi
    pods: "50"

2. RBAC That Follows Least Privilege

Default Kubernetes RBAC is too permissive. We build RBAC from zero trust up.

No cluster-admin bindings except for platform operators — and those via PIM or time-boxed access
Namespace-scoped RoleBindings tied to AD groups (via Azure AD integration on AKS)
Developers get a custom role that allows get/list/watch on most resources, exec into pods in non-production, and create/delete on pods only in their namespace
Audit logging enabled to track who did what (AKS Diagnostic Settings > kube-audit-admin)

3. Resource Requests and Limits on Every Pod

Pods without resource requests are unschedulable in a meaningful way — the scheduler cannot make informed placement decisions, leading to noisy-neighbour problems and node over-commitment.

Our rule:

Every pod must declare requests and limits for both CPU and memory
Enforce via OPA Gatekeeper or Kyverno policy that rejects pods missing these fields
Right-size using Vertical Pod Autoscaler (VPA) in recommend-only mode, then apply findings

Common mistake: Setting CPU limits too aggressively causes throttling that is invisible to application teams. We recommend setting CPU limits at 2-4x the request value and monitoring for throttling via container_cpu_cfs_throttled_periods_total.

4. Network Policies as Default-Deny

By default, all pods in a Kubernetes cluster can talk to all other pods. In production, this is unacceptable.

Deploy a default-deny NetworkPolicy in every namespace
Explicitly allow only required ingress and egress flows
Use Cilium or Calico as the CNI — Azure CNI with network policy support works for AKS
Test network policies in staging with traffic-mirroring before enforcing in production

5. Pod Security Standards (Not Pod Security Policies)

Pod Security Policies were deprecated in Kubernetes 1.21 and removed in 1.25. Use Pod Security Standards (via Pod Security Admission) or OPA Gatekeeper.

At minimum, enforce the restricted profile in production namespaces:

No privileged containers
No host namespace sharing
Read-only root filesystem
Non-root user
No privilege escalation

6. Secrets Management Done Right

Kubernetes Secrets are base64-encoded, not encrypted at rest by default. This is not secrets management.

Our approach:

Enable etcd encryption at rest (default on AKS, must be configured on self-managed)
Use External Secrets Operator to sync secrets from Azure Key Vault, HashiCorp Vault, or AWS Secrets Manager
Never store secrets in Git — not even encrypted (use sealed-secrets only as a last resort)
Rotate secrets automatically with a maximum lifetime of 90 days
Mount secrets as files, not environment variables (environment variables appear in logs and process listings)

7. GitOps for All Deployments

Manual kubectl apply in production is an incident waiting to happen. Every deployment must go through a Git-based pipeline.

ArgoCD or Flux as the GitOps operator
All manifests stored in a dedicated Git repository (not the application repo)
Pull-based deployment: the cluster pulls desired state from Git, not the other way around
Drift detection alerts if someone manually changes a resource
Promotion across environments via branch strategy or directory structure, never by modifying manifests in place

8. Observability: Metrics, Logs, Traces

You cannot operate what you cannot see. We deploy a standardised observability stack on every cluster.

Metrics: Prometheus (or Azure Monitor managed Prometheus) + Grafana with pre-built dashboards for cluster, node, namespace, and pod-level metrics
Logs: Fluent Bit shipping to a centralised backend (Azure Monitor, Elasticsearch, or Loki)
Traces: OpenTelemetry Collector exporting to Jaeger or Azure Application Insights
Alerting rules for: node NotReady, pod CrashLoopBackOff, PVC near capacity, certificate expiry, HPA at max replicas

Non-negotiable: Every cluster must have a dashboard answering "Is the cluster healthy?" and "Is my application healthy?" within 30 seconds.

9. Ingress and TLS Termination

Use a single ingress controller per cluster (NGINX Ingress Controller or Azure Application Gateway Ingress Controller)
Terminate TLS at the ingress with certificates from cert-manager + Let's Encrypt (or internal CA)
Enforce HTTPS-only — redirect HTTP to HTTPS at the ingress level
Set rate limiting, request size limits, and timeouts on the ingress controller
For multi-tenant clusters, consider separate ingress controllers per tenant namespace

10. Cluster Upgrades and Patch Management

Kubernetes releases three minor versions per year, each with a 14-month support window. Falling behind on upgrades is a security and supportability risk.

Our cadence:

Upgrade to the latest stable minor version within 60 days of release
Test upgrades in a staging cluster that mirrors production node pools and workloads
Use node surge upgrades (AKS) or rolling node replacements to avoid downtime
Review deprecation notes for each release — API removals break workloads silently
Automate node OS patching (e.g., AKS auto-upgrade channel set to node-image)

Bringing It All Together

These practices are not aspirational — they are the baseline. We codify them as Gatekeeper/Kyverno policies, Terraform modules, and ArgoCD ApplicationSets so they are enforced automatically, not by convention.

The goal is a cluster where a new development team can deploy on day one, knowing that security, observability, and cost guardrails are already in place.

How We Can Help

Related Resources

GitOps for Kubernetes with Flux — Declarative delivery for your Kubernetes workloads.
DevSecOps Pipeline Design — Integrate security into your container build and deploy pipeline.
Platform Engineering: Building an Internal Developer Platform — The platform layer that abstracts Kubernetes complexity for your teams.

CC Conceptualise offers a Kubernetes Production Readiness Assessment — a one-week review of your existing cluster(s) against these ten practices, producing a prioritised remediation backlog. For greenfield deployments, we deliver production-hardened clusters as part of our platform engineering engagements. Reach out to discuss your Kubernetes challenges.