Become a DevOps Engineer: Kubernetes Full Course with Real Projects and Deployment Scenarios

Complete DevOps Training Program

Kubernetes: Zero to Production

A complete, battle-tested course built from 10+ years of real production experience. From first cluster to enterprise-scale deployment.

15 In-Depth Chapters

40+ Real Projects

Production-Grade YAML

AWS EKS + CI/CD

Course Contents

1Introduction to Kubernetes 2Kubernetes Architecture 3Installation and Setup 4Pods and Containers 5Deployments and ReplicaSets 6Services and Networking 7ConfigMaps and Secrets 8Volumes and Storage 9Ingress and Load Balancing 10Helm Package Manager 11Kubernetes Security 12Monitoring and Logging 13CI/CD with Kubernetes 14Kubernetes on AWS EKS 15Production Deployment Strategies *Master Enterprise Projects

Chapter One

Introduction to Kubernetes

What Is Kubernetes and Why Does It Exist?

Before Kubernetes existed, running applications at scale was a serious operational nightmare. Imagine you have a web application that gets 10 million requests per day. Some of those requests spike unpredictably — a marketing campaign drops, traffic doubles in 10 minutes. Your operations team scrambles to spin up new virtual machines, install dependencies, deploy code, and configure load balancers — all manually. Then when traffic drops, those machines sit idle, burning money.

Kubernetes was created by Google in 2014, born from their internal system called Borg — the same system that ran Gmail, YouTube, and Google Search for over a decade. Google ran billions of containers per week internally using Borg. They open-sourced the lessons learned as Kubernetes and donated it to the Cloud Native Computing Foundation (CNCF) in 2016.

Key Insight

Kubernetes is not just a tool — it is a platform for building platforms. It provides the building blocks (networking, storage, scheduling, self-healing) so your team can focus on writing business logic, not infrastructure plumbing.

The Problem Kubernetes Solves

Let us walk through the evolution of application deployment to fully understand why Kubernetes matters:

Era 1 — Physical Servers (2000s): One application per physical server. If Application A crashed, it might affect Application B on the same machine. Scaling meant purchasing new hardware. Provisioning took weeks. Utilization was typically below 15%.

Era 2 — Virtual Machines (2010s): Virtualization allowed multiple applications to share hardware. VMware, Hyper-V, KVM. Better utilization (~50%), but each VM carried a full OS kernel — heavy, slow to boot, expensive in memory. Hundreds of VMs still meant complex orchestration.

Era 3 — Containers (2013+): Docker changed everything. Containers share the host OS kernel, start in milliseconds, and pack dozens onto a single machine. But running 500 containers across 30 servers raised new questions: which server runs which container? What happens when a container crashes? How do containers talk to each other? How do you roll out updates without downtime?

Era 4 — Container Orchestration (2016+): Kubernetes answers every one of those questions. It is the operating system for your data center.

What Kubernetes Actually Does

Kubernetes handles the following automatically, things that would require entire operations teams to do manually:

Bin Packing

Kubernetes schedules containers to nodes based on available CPU and memory, maximizing hardware utilization automatically.

Self-Healing

If a container crashes, Kubernetes restarts it. If a node goes offline, workloads are rescheduled to healthy nodes within seconds.

Horizontal Scaling

Scale from 3 replicas to 50 with a single command or automatically via the Horizontal Pod Autoscaler based on CPU or custom metrics.

Rolling Updates

Deploy new versions with zero downtime. Kubernetes replaces pods gradually. If health checks fail, it automatically rolls back.

Service Discovery

Every service gets a stable DNS name. Your frontend just calls http://backend-service — Kubernetes handles the routing transparently.

Secret Management

Store passwords, tokens, and certificates encrypted. Mount them into pods without hard-coding sensitive data into container images.

Real-World Examples: Where Kubernetes Runs Today

Example 1 — Spotify:

Spotify migrated from a hand-rolled orchestration system (Helios) to Kubernetes. They manage over 1,200 microservices across Kubernetes clusters. Their recommendation engine, playlist service, and search all run in isolated pods. Rolling deployments happen dozens of times per day without user impact.

Example 2 — Airbnb:

Airbnb runs 1,000+ microservices on Kubernetes. During peak booking periods (holidays), their Horizontal Pod Autoscaler scales the search service from 20 pods to 200 pods within minutes. When traffic subsides, it scales back down — saving significant cloud costs.

Example 3 — The New York Times:

NYT moved their entire infrastructure to Kubernetes on GKE. Breaking news events (elections, major events) cause traffic spikes of 10x in minutes. Kubernetes auto-scaling keeps their site up. Their publishing workflow, image processing, and reader API all run as separate microservices in the same cluster.

Example 4 — Pokemon GO:

When Pokemon GO launched, it received 50x the expected traffic within days. Because Niantic ran it on Kubernetes on Google Cloud, they were able to scale their clusters rapidly to handle the load. Without Kubernetes, the game would have been unavailable for days.

Example 5 — Goldman Sachs:

Goldman Sachs uses Kubernetes for their trading platform, running thousands of pods for risk calculation, market data ingestion, and order routing. Multi-tenancy features allow different trading desks to share infrastructure while staying isolated from each other via namespaces and network policies.

Chapter 1 — Project 1

Your First Kubernetes Application: Static Website Deployment

Scenario: A startup called TechBlog wants to deploy their static marketing website on Kubernetes to learn the platform. You are the platform engineer. Your goal: containerize a simple Nginx web server, push it to Docker Hub, and deploy it to a local Kubernetes cluster using Minikube.

Step 1: Create Your Application

# Create project directory
mkdir techblog-k8s
cd techblog-k8s

# Create a simple HTML page
cat > index.html <<EOF
<!DOCTYPE html>
<html>
<head><title>TechBlog - Running on Kubernetes!</title></head>
<body style="font-family:sans-serif;text-align:center;padding:50px;">
  <h1>Welcome to TechBlog</h1>
  <p>This page is served from a Kubernetes Pod!</p>
</body>
</html>
EOF

Step 2: Create the Dockerfile

# Dockerfile
# We use the official Nginx image as our base
FROM nginx:1.25-alpine

# Copy our HTML file into the Nginx default serving directory
COPY index.html /usr/share/nginx/html/index.html

# Expose port 80 (this is documentation only - does not publish the port)
EXPOSE 80

# The default CMD from the Nginx image starts the server
# No need to override it

Step 3: Build and Push the Docker Image

# Build the image (replace 'yourdockerhubusername' with your actual username)
docker build -t yourdockerhubusername/techblog:v1.0 .
# -t = tag, the image name and version we give to this build
# . = build context is the current directory (where Dockerfile lives)

# Test locally before pushing
docker run -d -p 8080:80 yourdockerhubusername/techblog:v1.0
# -d = detached mode (runs in background)
# -p 8080:80 = map host port 8080 to container port 80

# Verify it works
curl http://localhost:8080
# You should see the HTML of your welcome page

# Stop the local container
docker stop $(docker ps -q --filter ancestor=yourdockerhubusername/techblog:v1.0)

# Login to Docker Hub
docker login

# Push the image
docker push yourdockerhubusername/techblog:v1.0
# This uploads the image layers to Docker Hub registry
# Kubernetes nodes will pull from here during deployment

Step 4: Create the Kubernetes Deployment YAML

# deployment.yaml
apiVersion: apps/v1         # Which Kubernetes API group handles this resource
kind: Deployment            # The type of resource we are creating
metadata:
  name: techblog            # Internal name for this deployment
  labels:
    app: techblog           # Labels help us query/manage related resources
spec:
  replicas: 3               # Run 3 identical pods for high availability
  selector:
    matchLabels:
      app: techblog         # The Deployment manages Pods with this label
  template:                 # Template defines what each Pod looks like
    metadata:
      labels:
        app: techblog       # Every Pod created gets this label
    spec:
      containers:
      - name: techblog      # Name of the container inside the pod
        image: yourdockerhubusername/techblog:v1.0  # Docker image to use
        ports:
        - containerPort: 80   # Port the application listens on inside the container
        resources:
          requests:           # Minimum resources needed to schedule the pod
            memory: "64Mi"
            cpu: "100m"       # 100 millicores = 0.1 CPU core
          limits:             # Maximum resources the pod can consume
            memory: "128Mi"
            cpu: "200m"       # Hard cap to prevent runaway processes

Step 5: Create the Service YAML

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: techblog-service
spec:
  selector:
    app: techblog           # Routes traffic to pods with this label
  ports:
  - protocol: TCP
    port: 80                # Port exposed by the service (cluster internal)
    targetPort: 80          # Port on the pod to forward to
  type: NodePort            # Exposes the service on each node's IP at a static port
                            # For local Minikube testing; use LoadBalancer on cloud

Step 6: Deploy to Kubernetes

# Start Minikube (local Kubernetes cluster)
minikube start --driver=docker

# Apply the deployment - Kubernetes reads the YAML and creates resources
kubectl apply -f deployment.yaml
# Output: deployment.apps/techblog created

# Apply the service
kubectl apply -f service.yaml
# Output: service/techblog-service created

# Watch pods come up (press Ctrl+C to stop watching)
kubectl get pods -w
# NAME                        READY   STATUS    RESTARTS   AGE
# techblog-5d9f7d8c4f-2xhkp   1/1     Running   0          30s
# techblog-5d9f7d8c4f-7klmn   1/1     Running   0          30s
# techblog-5d9f7d8c4f-9pvwr   1/1     Running   0          30s

# Get the URL to access your app (Minikube specific)
minikube service techblog-service --url
# Outputs something like: http://192.168.49.2:32451

# Open it in your browser or curl it
curl $(minikube service techblog-service --url)
# You see your TechBlog HTML page - served from a Kubernetes pod!

Expected Output and What You Learned

Your website is now running across 3 pods. If you delete one pod manually (kubectl delete pod techblog-5d9f7d8c4f-2xhkp), Kubernetes will automatically create a new one within seconds. This is self-healing in action. You have just deployed your first production-style application on Kubernetes.

Interview Questions — Chapter 1

What problem does Kubernetes solve that Docker alone cannot? (Answer: Docker runs and packages containers on a single host. Kubernetes orchestrates containers across multiple hosts, handling scheduling, scaling, self-healing, service discovery, and rolling updates automatically.)
What is the difference between a container and a pod in Kubernetes?
Why would you choose Kubernetes over Docker Compose for a production application?
What is the CNCF and why does it matter for Kubernetes adoption?
Explain the concept of desired state vs actual state in Kubernetes.
A pod crashes repeatedly. What does Kubernetes do automatically and what would you do to investigate?
What is bin packing in the context of Kubernetes scheduling?
Name three production companies running Kubernetes and describe what they use it for.

Chapter Two

Kubernetes Architecture

Understanding Kubernetes architecture is essential. Every problem you debug, every performance issue you tune, every security decision you make — it all starts with knowing what each component does and how they communicate. Let us go deep.

The Big Picture: Control Plane vs Worker Nodes

A Kubernetes cluster is divided into two sets of machines: the Control Plane (previously called Master) and Worker Nodes. The Control Plane is the brain — it makes decisions about the cluster. Worker Nodes are the muscle — they actually run your workloads.

KUBERNETES CLUSTER ARCHITECTURE


  CONTROL PLANE (Master Node)
  +----------------------------------------------------------+
  |                                                          |
  |  +------------------+    +-------------------------+    |
  |  |   kube-apiserver |    |   etcd (Key-Value Store)|    |
  |  |  (Front door to  |    |   (Cluster state/truth) |    |
  |  |   the cluster)   |    +-------------------------+    |
  |  +------------------+                                   |
  |                                                          |
  |  +-------------------+   +------------------------+    |
  |  |  kube-scheduler   |   |  kube-controller-mgr   |    |
  |  | (Assigns pods to  |   | (Reconciliation loops) |    |
  |  |  nodes)           |   +------------------------+    |
  |  +-------------------+                                   |
  +----------------------------------------------------------+
              |               |               |
  WORKER NODE 1        WORKER NODE 2        WORKER NODE 3
  +----------+          +----------+         +----------+
  | kubelet  |          | kubelet  |         | kubelet  |
  | kube-    |          | kube-    |         | kube-    |
  |  proxy   |          |  proxy   |         |  proxy   |
  | Pod A    |          | Pod C    |         | Pod E    |
  | Pod B    |          | Pod D    |         | Pod F    |
  +----------+          +----------+         +----------+

Control Plane Components (Deep Dive)

kube-apiserver — The Gateway

The API server is the only component in the cluster that other components interact with directly. When you run kubectl apply -f deployment.yaml, your request goes to the API server. When the scheduler places a pod, it tells the API server. When a kubelet reports pod status, it reports to the API server.

The API server validates requests (authentication, authorization, admission control), persists state to etcd, and serves as the hub through which all cluster communication flows. It is stateless — you can run multiple instances for high availability. All state lives in etcd, not in the API server’s memory.

etcd — The Brain’s Memory

etcd is a distributed key-value store that serves as Kubernetes’ backing store for all cluster data. Every object you create — pods, services, deployments, secrets — is stored as a key-value entry in etcd. The key is the object’s path (e.g., /registry/pods/default/my-pod) and the value is the serialized object.

etcd uses the Raft consensus algorithm to maintain consistency across multiple etcd nodes. In a production cluster, you run 3 or 5 etcd nodes — always an odd number to achieve quorum. Losing etcd without a backup is catastrophic: you lose all cluster state. Always back up etcd in production with etcdctl snapshot save.

kube-scheduler — The Placement Engine

When a pod is created without a node assignment (which is every new pod), the scheduler picks which node to place it on. This is not random. The scheduler runs two phases:

Filtering: Eliminate nodes that do not meet the pod’s requirements. Does the node have enough CPU/memory? Does it have the right labels (node selectors)? Does it have the required taints/tolerations? After filtering, only eligible nodes remain.

Scoring: Among eligible nodes, score each one. Prefer nodes with the most remaining capacity. Spread pods across availability zones. Prefer nodes that already have the required container image cached. The highest-scoring node wins, and the scheduler writes that assignment back to etcd via the API server.

kube-controller-manager — The Reconciliation Engine

The controller manager runs multiple control loops simultaneously, each one responsible for a specific resource type. The core idea behind every controller is reconciliation: continuously compare the desired state (what you declared) with the actual state (what exists), and take action to close the gap.

ReplicaSet Controller: You declared 3 replicas. Only 2 pods are running (one crashed). This controller sees the difference and creates a new pod to restore the count to 3.

Deployment Controller: You updated a deployment image. This controller creates a new ReplicaSet with the new image and scales it up while scaling down the old one.

Node Controller: A node stops sending heartbeats. After a configurable timeout (default 5 minutes), this controller marks its pods as Terminated and evicts them to be rescheduled on healthy nodes.

Worker Node Components (Deep Dive)

kubelet — The Node Agent

The kubelet is a process that runs on every worker node. It is the interface between the API server and the container runtime (Docker, containerd, CRI-O). Its job is to ensure that the containers described in PodSpecs are running and healthy.

The kubelet watches for pods assigned to its node, pulls container images, starts containers using the container runtime, runs liveness and readiness probes, reports container status back to the API server, and manages container logs. If a container fails a liveness probe, kubelet restarts it. If a container is stuck at startup and fails a readiness probe, kubelet keeps it out of the service’s load balancer.

kube-proxy — The Network Magician

kube-proxy runs on every node and implements the Kubernetes Service concept. When you create a Service, it gets a stable virtual IP address (ClusterIP). kube-proxy watches the API server for Service and Endpoint changes, then programs iptables (or IPVS) rules on the node to forward traffic destined for that ClusterIP to one of the backend pods.

Modern clusters increasingly use IPVS mode (IP Virtual Server) instead of iptables because IPVS provides better performance at scale (thousands of services) using hash tables instead of linear iptables rule chains.

Container Runtime — The Execution Layer

The container runtime is what actually creates and runs containers. Kubernetes communicates with it through the Container Runtime Interface (CRI). The dominant runtime today is containerd (Docker was deprecated as a Kubernetes runtime in 1.24 because it included unnecessary layers; containerd is the lower-level runtime Docker itself uses). CRI-O is another option, popular with OpenShift.

Cluster DNS — CoreDNS

Every Kubernetes cluster runs CoreDNS as its internal DNS server. When you create a Service named backend in namespace production, CoreDNS creates a DNS entry: backend.production.svc.cluster.local. Any pod in the cluster can resolve this name to the service’s ClusterIP, enabling service discovery without hard-coding IP addresses.

What Happens When You Run kubectl apply?

Let us trace the complete lifecycle of a pod creation, because this sequence will help you debug any scheduling or startup issue:

You run kubectl apply -f pod.yaml. kubectl reads the file and sends an HTTP POST to the kube-apiserver.
The API server authenticates your request (via certificate, token, or service account). It then runs admission controllers (e.g., LimitRanger injects default resource limits).
The API server validates the object schema and writes the pod object to etcd with status Pending.
The API server notifies watchers. The kube-scheduler is watching for unscheduled pods. It picks up the new pod.
The scheduler runs its filter/score algorithm and picks a node. It writes the node name back to the pod object in etcd.
The kubelet on the chosen node is watching for pods assigned to it. It sees the new pod.
The kubelet calls the container runtime (containerd) via CRI to pull the image and create the container.
The container starts. The kubelet updates the pod status to Running via the API server.
kube-proxy on each node updates iptables rules if a Service selector matches this new pod.
CoreDNS updates its records if applicable. Your pod is now live and reachable.

Chapter 2 — Project 1

Cluster Inspection: Understanding Your Infrastructure

Scenario: You just joined FinTech Corp as a platform engineer. On your first day, you need to audit the existing Kubernetes cluster, document all components, and verify everything is healthy. This is a real task every DevOps engineer does when joining a team.

# 1. View the cluster information and identify control plane endpoint
kubectl cluster-info
# Kubernetes control plane is running at https://127.0.0.1:8443
# CoreDNS is running at https://127.0.0.1:8443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

# 2. List all nodes and their roles
kubectl get nodes -o wide
# NAME       STATUS   ROLES           AGE   VERSION   INTERNAL-IP   OS-IMAGE
# minikube   Ready    control-plane   10d   v1.28.0   192.168.49.2  Ubuntu 22.04

# 3. Inspect a specific node in full detail
kubectl describe node minikube
# Shows: CPU, memory capacity/allocatable, conditions, system info
# Look for: MemoryPressure=False, DiskPressure=False, PIDPressure=False, Ready=True

# 4. View all control plane components running as static pods
kubectl get pods -n kube-system
# kube-apiserver-minikube
# etcd-minikube
# kube-controller-manager-minikube
# kube-scheduler-minikube
# coredns-xxx
# kube-proxy-xxx

# 5. Check the health of the API server components
kubectl get componentstatuses
# NAME                 STATUS    MESSAGE   ERROR
# scheduler            Healthy   ok
# controller-manager   Healthy   ok
# etcd-0               Healthy   {"health":"true","reason":""}

# 6. Back up etcd (CRITICAL in production)
# First, exec into the etcd pod to find the correct certs
kubectl exec -n kube-system etcd-minikube -- \
  etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/minikube/certs/etcd/ca.crt \
  --cert=/var/lib/minikube/certs/etcd/server.crt \
  --key=/var/lib/minikube/certs/etcd/server.key \
  snapshot save /tmp/etcd-backup.db
# Snapshot saved at /tmp/etcd-backup.db

# 7. Check resource allocation across the node
kubectl describe node minikube | grep -A 6 "Allocated resources"
# Shows how much CPU/memory is requested by running pods
# This tells you how much headroom you have before the node is "full"

This audit tells you the cluster health, available capacity, Kubernetes version (important for CVE patching), and whether etcd is backed up. In production, run this audit weekly and store etcd snapshots to S3.

Errors and Troubleshooting — Chapter 2

Problem: Node shows NotReady status

Root Cause: The kubelet on that node has stopped communicating with the API server. Common reasons: kubelet service crashed, out of disk space, network partition, or the node itself is down.

Fix: SSH into the node and check: systemctl status kubelet. If kubelet is stopped: systemctl restart kubelet. Check disk: df -h. Check kubelet logs: journalctl -u kubelet -f.

Problem: etcd leader election keeps cycling, cluster is unstable

Root Cause: Usually caused by slow disk I/O (etcd requires fast disk for its WAL writes), network latency between etcd nodes, or a misconfigured heartbeat timeout.

Fix: Move etcd to SSD storage. Tune --heartbeat-interval and --election-timeout in the etcd configuration. Ensure all etcd nodes are in the same availability zone or have low latency between them.

Chapter Three

Installation and Setup

There is no single way to run Kubernetes. The right installation method depends on your environment: laptop learning, dev/test clusters, or production. Let us cover all of them.

Option 1: Minikube (Local Development)

Minikube runs a complete single-node Kubernetes cluster in a VM or container on your local machine. It is the fastest way to learn Kubernetes without any cloud costs. Perfect for development and testing.

# Install Minikube on Linux/Mac (Intel)
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube

# Install kubectl (the Kubernetes CLI)
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

# Verify installations
kubectl version --client    # Should show: Client Version: v1.28.x
minikube version            # Should show: minikube version: v1.31.x

# Start a cluster with specific resource allocation
minikube start \
  --cpus=4 \              # Allocate 4 CPUs to the cluster
  --memory=8192 \         # 8GB RAM
  --disk-size=30g \       # 30GB disk
  --driver=docker         # Use Docker as the VM driver (most compatible)

# Verify the cluster is up
kubectl get nodes
# NAME       STATUS   ROLES           AGE   VERSION
# minikube   Ready    control-plane   60s   v1.28.0

# Enable useful addons
minikube addons enable metrics-server    # CPU/memory metrics
minikube addons enable ingress           # Nginx Ingress Controller
minikube addons enable dashboard         # Web UI

# Open the Kubernetes Dashboard
minikube dashboard

Option 2: kubeadm (On-Premises Production Cluster)

kubeadm is the official tool for bootstrapping production Kubernetes clusters on bare metal or VMs. This is what you use in enterprise data centers, on-premises environments, or when you need full control over the cluster configuration.

Prerequisites: 2 VMs minimum (1 control plane, 1 worker). Ubuntu 22.04 LTS. 2 CPUs, 2GB RAM per node. Unique hostname, MAC address, and product_uuid for each node. Swap disabled (required by Kubernetes). Ports: 6443 (API server), 2379-2380 (etcd), 10250-10252 (kubelet, kube-scheduler, kube-controller-manager).

# ============================================
# RUN THESE COMMANDS ON ALL NODES (control plane and workers)
# ============================================

# 1. Disable swap (Kubernetes requires swap to be off)
sudo swapoff -a
# Make it permanent - comment out the swap line in /etc/fstab
sudo sed -i '/ swap / s/^/#/' /etc/fstab

# 2. Load required kernel modules
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
sudo modprobe overlay
sudo modprobe br_netfilter

# 3. Set required sysctl parameters
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF
sudo sysctl --system

# 4. Install containerd (container runtime)
sudo apt-get update
sudo apt-get install -y containerd
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
# Enable SystemdCgroup (required for Kubernetes)
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
sudo systemctl restart containerd
sudo systemctl enable containerd

# 5. Install kubeadm, kubelet, kubectl
sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl gpg
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.28/deb/Release.key | \
  sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.28/deb/ /' | \
  sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet=1.28.0-1.1 kubeadm=1.28.0-1.1 kubectl=1.28.0-1.1
# Pin the version to prevent accidental upgrades
sudo apt-mark hold kubelet kubeadm kubectl

# ============================================
# RUN ONLY ON THE CONTROL PLANE NODE
# ============================================

# 6. Initialize the cluster
sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16 \    # CIDR for pod IPs (used by Flannel CNI)
  --apiserver-advertise-address=10.0.0.10 \  # Control plane node's IP
  --kubernetes-version=1.28.0

# 7. Configure kubectl for your user
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# 8. Install the CNI plugin (Flannel for networking between pods)
kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml

# 9. Get the join command for worker nodes
kubeadm token create --print-join-command
# Output: kubeadm join 10.0.0.10:6443 --token abc123... --discovery-token-ca-cert-hash sha256:...

# ============================================
# RUN ONLY ON WORKER NODES
# ============================================

# 10. Join the cluster (use the command from step 9)
sudo kubeadm join 10.0.0.10:6443 \
  --token abc123.xyz789 \
  --discovery-token-ca-cert-hash sha256:aabbcc...

# ============================================
# BACK ON CONTROL PLANE - VERIFY
# ============================================
kubectl get nodes
# NAME           STATUS   ROLES           AGE   VERSION
# control-plane  Ready    control-plane   5m    v1.28.0
# worker-1       Ready    <none>          2m    v1.28.0
# worker-2       Ready    <none>          90s   v1.28.0

kubectl Essentials: Your Daily Commands

# Context management (switching between clusters)
kubectl config get-contexts                    # List all clusters you have access to
kubectl config use-context production-cluster  # Switch to production cluster
kubectl config current-context                 # Show which cluster you are currently using

# Namespace management (logical separation within a cluster)
kubectl create namespace staging               # Create a namespace
kubectl get namespaces                         # List all namespaces
kubectl config set-context --current --namespace=staging  # Set default namespace

# Resource inspection
kubectl get all -n production                  # All resources in a namespace
kubectl get pods --all-namespaces              # Pods across every namespace
kubectl get pod my-pod -o yaml                 # Full YAML definition of a pod
kubectl get pod my-pod -o json | jq .status   # Get just the status as JSON

# Detailed inspection
kubectl describe pod my-pod                    # Human-readable detail including events
kubectl describe node worker-1                 # Node capacity, conditions, pods running

# Logs
kubectl logs my-pod                            # Current logs
kubectl logs my-pod --previous                 # Logs from the previous (crashed) container
kubectl logs my-pod -f                         # Follow logs in real time (like tail -f)
kubectl logs my-pod -c sidecar-container       # Logs from a specific container in a multi-container pod

# Exec into a pod (for debugging)
kubectl exec -it my-pod -- /bin/bash           # Open interactive shell
kubectl exec my-pod -- env                     # Run a command without interactive shell
kubectl exec -it my-pod -c main-container -- sh  # Exec into specific container

# Apply/delete resources
kubectl apply -f deployment.yaml               # Create or update (declarative)
kubectl delete -f deployment.yaml              # Delete by YAML
kubectl delete pod my-pod --grace-period=0     # Force delete a stuck pod

# Scaling
kubectl scale deployment my-app --replicas=10  # Scale to 10 pods

# Rolling restart (force new pods without changing the spec)
kubectl rollout restart deployment my-app

Chapter 3 — Project 1

Multi-Node Cluster Setup with High Availability

Scenario: You are setting up a development cluster for a 20-person engineering team at RetailMax Inc. They need a cluster with 1 control plane and 3 worker nodes, proper namespaces for each team (backend, frontend, data), resource quotas per namespace, and RBAC so developers can only access their own namespace.

# After the cluster is set up with kubeadm (3 workers joined)...

# 1. Create namespaces for each team
kubectl create namespace backend-team
kubectl create namespace frontend-team
kubectl create namespace data-team

# 2. Apply resource quotas to prevent one team from consuming all resources
cat > backend-quota.yaml <<EOF
apiVersion: v1
kind: ResourceQuota
metadata:
  name: backend-quota
  namespace: backend-team
spec:
  hard:
    requests.cpu: "4"         # Max 4 CPU cores requested by all pods
    requests.memory: 8Gi      # Max 8GB RAM requested
    limits.cpu: "8"           # Max 8 CPU cores limited
    limits.memory: 16Gi       # Max 16GB RAM limited
    pods: "20"                # Max 20 pods in this namespace
    persistentvolumeclaims: "10"  # Max 10 PVCs
EOF
kubectl apply -f backend-quota.yaml

# 3. Create a LimitRange (default resource limits for pods that don't specify them)
cat > backend-limitrange.yaml <<EOF
apiVersion: v1
kind: LimitRange
metadata:
  name: backend-limitrange
  namespace: backend-team
spec:
  limits:
  - type: Container
    defaultRequest:           # Applied when pod doesn't specify requests
      cpu: "100m"
      memory: "128Mi"
    default:                  # Applied when pod doesn't specify limits
      cpu: "500m"
      memory: "512Mi"
    max:                      # No single container can exceed these
      cpu: "2"
      memory: "4Gi"
EOF
kubectl apply -f backend-limitrange.yaml

# 4. Create a ServiceAccount for the backend team
kubectl create serviceaccount backend-developer -n backend-team

# 5. Create a Role (namespace-scoped permissions)
cat > backend-role.yaml <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: backend-team
  name: developer-role
rules:
- apiGroups: ["", "apps", "batch"]
  resources: ["pods", "services", "deployments", "jobs", "configmaps"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]
  # Developers can NOT delete or modify secrets in this namespace
EOF
kubectl apply -f backend-role.yaml

# 6. Bind the role to the service account
kubectl create rolebinding backend-developer-binding \
  --role=developer-role \
  --serviceaccount=backend-team:backend-developer \
  -n backend-team

# 7. Verify the setup
kubectl auth can-i create pods --as=system:serviceaccount:backend-team:backend-developer -n backend-team
# yes
kubectl auth can-i delete secrets --as=system:serviceaccount:backend-team:backend-developer -n backend-team
# no
kubectl get resourcequota -n backend-team

Common Installation Errors

Error: [ERROR Swap]: running with swap on is not supported

Fix: sudo swapoff -a and comment out the swap entry in /etc/fstab for persistence.

Error: Nodes remain in NotReady after joining — coredns pods stuck Pending

Fix: You forgot to install a CNI plugin. The cluster has no network layer. Run: kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml

Error: The connection to the server localhost:8080 was refused

Fix: kubectl cannot find your cluster config. Run: mkdir -p $HOME/.kube && sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config && sudo chown $(id -u):$(id -g) $HOME/.kube/config

Chapter Four

Pods and Containers

The Pod is the fundamental building block of Kubernetes. Not the container — the pod. Understanding pods deeply, including multi-container patterns, lifecycle hooks, probes, and init containers, is the difference between a beginner and a production engineer.

What Is a Pod?

A Pod is one or more containers that share the same network namespace, the same Linux namespaces (UTS, IPC), and the same storage volumes. They run on the same host and communicate with each other via localhost.

Why group containers in a pod instead of running them independently? Because some workloads are tightly coupled. An application container and its log shipper sidecar need to share the application’s log files. An application and its service mesh proxy (Envoy/Istio) need to share the network stack. A pod provides the colocation and shared context for these patterns.

Critical Rule: You should almost never create a pod directly in production. Pods created directly are not restarted if they terminate. Always use a higher-level controller: Deployment (stateless apps), StatefulSet (databases), DaemonSet (per-node agents), Job (batch tasks). These controllers create and manage pods on your behalf.

Complete Pod Specification

# pod-full.yaml — A production-grade pod spec with all key fields explained
apiVersion: v1
kind: Pod
metadata:
  name: api-server-pod
  namespace: production
  labels:
    app: api-server
    version: "2.1"
    environment: production
  annotations:
    deployment-date: "2024-01-15"
    team: backend
spec:
  # Init containers run sequentially BEFORE the main containers start
  # If any init container fails, Kubernetes retries until it succeeds or the pod fails
  initContainers:
  - name: wait-for-db
    image: busybox:1.35
    command: ['sh', '-c', 'until nc -z postgres-service 5432; do echo waiting for database; sleep 2; done']
    # This init container loops until the database port is reachable
    # Only after this succeeds will the main api-server container start

  containers:
  - name: api-server
    image: mycompany/api:v2.1.0
    
    # Environment variables passed into the container
    env:
    - name: NODE_ENV
      value: "production"
    - name: DB_HOST
      value: "postgres-service"    # Uses Kubernetes service DNS
    - name: DB_PASSWORD
      valueFrom:
        secretKeyRef:
          name: db-credentials     # References a Kubernetes Secret
          key: password            # The key within the secret
    - name: APP_CONFIG_FILE
      valueFrom:
        configMapKeyRef:
          name: api-config         # References a ConfigMap
          key: config.json

    ports:
    - name: http
      containerPort: 3000
      protocol: TCP
    
    # Resource management: ALWAYS set these in production
    resources:
      requests:
        memory: "256Mi"     # Pod won't be scheduled on a node with less than 256Mi available
        cpu: "250m"         # Requests 0.25 CPU core
      limits:
        memory: "512Mi"     # OOM killer activates if this is exceeded
        cpu: "500m"         # Throttled (not killed) if CPU exceeds this
    
    # Liveness probe: "Is the container alive? Should Kubernetes restart it?"
    livenessProbe:
      httpGet:
        path: /health       # Must return HTTP 200
        port: 3000
      initialDelaySeconds: 30   # Wait 30s before first check (let the app start)
      periodSeconds: 10         # Check every 10 seconds
      failureThreshold: 3       # Restart after 3 consecutive failures
    
    # Readiness probe: "Is the container ready to receive traffic?"
    # A pod fails readiness = removed from Service load balancer (no traffic)
    readinessProbe:
      httpGet:
        path: /ready
        port: 3000
      initialDelaySeconds: 10
      periodSeconds: 5
      failureThreshold: 2
    
    # Startup probe: special probe for apps with slow startup
    # Liveness/readiness probes are disabled until startup probe succeeds
    startupProbe:
      httpGet:
        path: /health
        port: 3000
      failureThreshold: 30    # Allow up to 30*10s = 5 minutes to start
      periodSeconds: 10
    
    # Lifecycle hooks
    lifecycle:
      postStart:               # Runs immediately after container starts
        exec:
          command: ["/bin/sh", "-c", "echo 'Container started' >> /var/log/startup.log"]
      preStop:                 # Runs before container is terminated (graceful shutdown)
        exec:
          command: ["/bin/sh", "-c", "sleep 5 && kill -SIGTERM 1"]
          # Give in-flight requests 5 seconds to complete before killing the process
    
    # Volume mounts
    volumeMounts:
    - name: config-volume
      mountPath: /etc/config
      readOnly: true
    - name: log-storage
      mountPath: /var/log/app

  # Sidecar container: runs alongside the main container
  - name: log-shipper
    image: fluentd:v1.16
    volumeMounts:
    - name: log-storage
      mountPath: /var/log/app   # Shares the same volume as the main container
    
  volumes:
  - name: config-volume
    configMap:
      name: api-config
  - name: log-storage
    emptyDir: {}                # Ephemeral storage shared between containers in the pod
  
  # Scheduling constraints
  nodeSelector:
    kubernetes.io/os: linux
    node-type: application      # Only schedule on nodes with this label
  
  # Graceful termination: Kubernetes waits this long before force-killing
  terminationGracePeriodSeconds: 60
  
  # Restart policy
  restartPolicy: Always         # Always (default), OnFailure (jobs), Never

Multi-Container Pod Patterns

Sidecar Pattern

A helper container that extends and enhances the main container. Examples: Fluentd log shipper reading application logs, Envoy proxy handling all inbound/outbound traffic for the main app (Istio service mesh), a git-sync container pulling config updates from a repo.

Ambassador Pattern

A proxy container that abstracts network connectivity for the main container. The main app connects to localhost, and the ambassador handles connecting to the real external service. Used for database connection pooling (PgBouncer as ambassador to PostgreSQL), or abstracting legacy service endpoints.

Adapter Pattern

Transforms the main container’s output to match external expectations. Classic example: your legacy app exposes metrics in a proprietary format, but Prometheus expects the OpenMetrics format. An adapter sidecar translates between them.

Init Container Pattern

Specialized containers that run and complete before the main containers start. Use cases: wait for a database to be ready, migrate the database schema, clone a git repo with configuration, register the service with a service directory. Each init container must exit 0 for the next to run.

Chapter 4 — Project 1

NodeJS API with Fluentd Sidecar and PostgreSQL Init Container

Scenario: LogiTech Shipping needs a production-ready API pod that: (1) waits for PostgreSQL to be available before starting, (2) ships logs to Elasticsearch via Fluentd sidecar, (3) has proper health checks so rolling deployments never send traffic to unready pods.

# logitech-api-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: shipping-api
  namespace: production
  labels:
    app: shipping-api
    tier: backend
spec:
  initContainers:
  - name: wait-for-postgres
    image: postgres:15-alpine
    command:
    - /bin/sh
    - -c
    - |
      until pg_isready -h postgres-service -U appuser -d shippingdb; do
        echo "$(date) - waiting for PostgreSQL to be ready..."
        sleep 3
      done
      echo "PostgreSQL is ready - starting API server"
    env:
    - name: PGPASSWORD
      valueFrom:
        secretKeyRef:
          name: db-credentials
          key: password

  containers:
  - name: shipping-api
    image: logitech/shipping-api:v3.2.1
    ports:
    - containerPort: 8080
    env:
    - name: DB_HOST
      value: postgres-service
    - name: DB_PORT
      value: "5432"
    - name: DB_NAME
      value: shippingdb
    - name: DB_USER
      value: appuser
    - name: DB_PASS
      valueFrom:
        secretKeyRef:
          name: db-credentials
          key: password
    resources:
      requests: { memory: "256Mi", cpu: "200m" }
      limits:   { memory: "512Mi", cpu: "500m" }
    livenessProbe:
      httpGet: { path: /health, port: 8080 }
      initialDelaySeconds: 30
      periodSeconds: 10
      failureThreshold: 3
    readinessProbe:
      httpGet: { path: /ready, port: 8080 }
      initialDelaySeconds: 15
      periodSeconds: 5
    volumeMounts:
    - name: app-logs
      mountPath: /app/logs
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 10"]  # Wait for load balancer to drain

  - name: fluentd-sidecar
    image: fluent/fluentd:v1.16-1
    volumeMounts:
    - name: app-logs
      mountPath: /var/log/app
    - name: fluentd-config
      mountPath: /fluentd/etc
    resources:
      requests: { memory: "64Mi", cpu: "50m" }
      limits:   { memory: "128Mi", cpu: "100m" }

  volumes:
  - name: app-logs
    emptyDir: {}
  - name: fluentd-config
    configMap:
      name: fluentd-config
  terminationGracePeriodSeconds: 30

# Deploy and validate
kubectl apply -f logitech-api-pod.yaml

# Watch init container complete, then main containers start
kubectl get pod shipping-api -w
# NAME           READY   STATUS            RESTARTS   AGE
# shipping-api   0/2     Init:0/1          0          5s
# shipping-api   0/2     PodInitializing   0          20s
# shipping-api   2/2     Running           0          35s

# Check logs from each container separately
kubectl logs shipping-api -c wait-for-postgres   # Init container logs
kubectl logs shipping-api -c shipping-api        # Main app logs
kubectl logs shipping-api -c fluentd-sidecar     # Sidecar logs

Pod Troubleshooting Guide

CrashLoopBackOff

Container crashes immediately after starting. Check: kubectl logs pod-name --previous to see what happened before the crash. Common causes: missing environment variable, wrong config, application bug, missing dependency.

OOMKilled (Exit Code 137)

Container exceeded its memory limit. Check actual memory usage with kubectl top pod pod-name. Either increase the memory limit or fix a memory leak in the application.

ImagePullBackOff

Kubernetes cannot pull the container image. Check: typo in image name, wrong tag, private registry without credentials. Fix: create an imagePullSecret and reference it in the pod spec, or verify the image name and tag exist.

Pod stays in Pending status

Scheduler cannot find a node. Check kubectl describe pod pod-name and look at the Events section. Common causes: insufficient CPU/memory on nodes, node selectors that don’t match any nodes, PersistentVolumeClaim not bound.

Interview Questions — Chapter 4

What is the difference between a liveness probe and a readiness probe? When would a pod fail one but pass the other?
Why do we use init containers instead of just putting startup logic in the main container’s entrypoint?
A pod is in CrashLoopBackOff. Walk me through your debugging process step by step.
What happens to a pod’s data when it restarts? How is this different from when the pod is rescheduled to a new node?
Explain the sidecar pattern with a real production use case.
What is a PodDisruptionBudget and when would you use it?
How do resource requests differ from resource limits? What happens when a container exceeds its CPU limit vs its memory limit?
What does the terminationGracePeriodSeconds field do and why is it important for zero-downtime deployments?

Chapter Five

Deployments and ReplicaSets

Deployments are the most used resource in Kubernetes. Every stateless application you run in production — APIs, web servers, workers — should be managed by a Deployment. Understanding deployments deeply, including rolling update strategies, rollback mechanics, and pod disruption budgets, is core to your job as a Kubernetes engineer.

ReplicaSet: Keeping the Right Number of Pods

A ReplicaSet ensures that a specified number of identical pods are running at all times. If you specify 5 replicas and one pod dies, the ReplicaSet controller sees the deficit (desired: 5, actual: 4) and creates a replacement. If a node dies with 2 pods on it, the controller creates 2 new pods on surviving nodes.

You rarely create ReplicaSets directly. You create a Deployment, which manages ReplicaSets on your behalf — creating new ones during updates and cleaning up old ones after successful rollouts.

Production Deployment Configuration

# production-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
  namespace: production
  labels:
    app: payment-service
    version: "4.0"
spec:
  replicas: 5                       # Run 5 pods for high availability
  
  selector:
    matchLabels:
      app: payment-service          # This deployment manages pods with this label
  
  # Deployment update strategy
  strategy:
    type: RollingUpdate             # Default. Replaces pods gradually.
    rollingUpdate:
      maxSurge: 2                   # Allow up to 7 pods (5+2) during update
      maxUnavailable: 1             # At most 1 pod can be down during update
      # This means: at least 4 pods are always serving traffic during the rollout
  
  # How long to wait for a pod to be ready before considering the rollout healthy
  minReadySeconds: 10
  
  # How long to wait before marking a deployment as failed
  progressDeadlineSeconds: 600     # 10 minutes
  
  # Keep history of old ReplicaSets for rollback capability
  revisionHistoryLimit: 10
  
  template:
    metadata:
      labels:
        app: payment-service
        version: "4.0"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      # Spread pods across different availability zones
      topologySpreadConstraints:
      - maxSkew: 1                          # Max difference in pod count between zones
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule    # Fail to schedule if constraint cannot be met
        labelSelector:
          matchLabels:
            app: payment-service
      
      # Anti-affinity: ensure no two pods land on the same node
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values: ["payment-service"]
              topologyKey: kubernetes.io/hostname
      
      containers:
      - name: payment-service
        image: mycompany/payment:v4.0.1
        ports:
        - containerPort: 8080
        resources:
          requests: { memory: "512Mi", cpu: "500m" }
          limits:   { memory: "1Gi",   cpu: "1000m" }
        livenessProbe:
          httpGet: { path: /health, port: 8080 }
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet: { path: /ready, port: 8080 }
          initialDelaySeconds: 15
          periodSeconds: 5
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]  # Drain connections before termination

Rolling Updates and Rollbacks: Complete Workflow

# Deploy the initial version
kubectl apply -f production-deployment.yaml --record
# --record adds the command to the revision history for tracking

# Verify the deployment
kubectl rollout status deployment/payment-service -n production
# Waiting for deployment "payment-service" rollout to finish: 0 of 5 updated replicas are available...
# deployment "payment-service" successfully rolled out

# Update the image to a new version (triggers rolling update)
kubectl set image deployment/payment-service \
  payment-service=mycompany/payment:v4.1.0 \
  -n production
# deployment.apps/payment-service image updated

# Watch the rolling update happen in real time
kubectl rollout status deployment/payment-service -n production -w
# Waiting for deployment "payment-service" rollout to finish: 1 out of 5 new replicas have been updated...
# Waiting for deployment "payment-service" rollout to finish: 2 out of 5 new replicas have been updated...
# ...
# deployment "payment-service" successfully rolled out

# View rollout history
kubectl rollout history deployment/payment-service -n production
# REVISION  CHANGE-CAUSE
# 1         kubectl apply --filename=production-deployment.yaml --record=true
# 2         kubectl set image deployment/payment-service payment-service=mycompany/payment:v4.1.0

# View details of a specific revision
kubectl rollout history deployment/payment-service --revision=2 -n production

# If the new version has a bug, ROLLBACK immediately
kubectl rollout undo deployment/payment-service -n production
# deployment.apps/payment-service rolled back

# Roll back to a specific revision
kubectl rollout undo deployment/payment-service --to-revision=1 -n production

# Pause a rollout (e.g., to check that the first few pods are healthy)
kubectl rollout pause deployment/payment-service -n production
# deployment.apps/payment-service paused

# Resume after verification
kubectl rollout resume deployment/payment-service -n production

Horizontal Pod Autoscaler (HPA)

HPA automatically scales the number of pods in a deployment based on observed metrics. The metrics-server must be installed in your cluster for CPU-based autoscaling. For custom metrics (requests per second, queue depth), you need Prometheus Adapter or KEDA.

# hpa.yaml — Horizontal Pod Autoscaler configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 5                    # Never scale below 5 pods
  maxReplicas: 50                   # Never scale above 50 pods
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60      # Scale out when avg CPU across pods > 60%
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: 400Mi         # Scale out when avg memory > 400Mi
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Don't scale up more than once per minute
      policies:
      - type: Pods
        value: 5                        # Add at most 5 pods at a time
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 minutes before scaling down
      policies:
      - type: Percent
        value: 10                       # Remove at most 10% of pods at a time
        periodSeconds: 60

---
kubectl apply -f hpa.yaml

# Watch the HPA in action
kubectl get hpa -n production -w
# NAME                    REFERENCE                     TARGETS   MINPODS   MAXPODS   REPLICAS
# payment-service-hpa     Deployment/payment-service    23%/60%   5         50        5

# Simulate load to trigger autoscaling
kubectl run load-generator --image=busybox --rm -it -- \
  /bin/sh -c "while true; do wget -q -O- http://payment-service:8080/api/pay; done"

# Watch pods scale up
kubectl get hpa -n production -w
# payment-service-hpa   Deployment/payment-service   78%/60%   5   50   5
# payment-service-hpa   Deployment/payment-service   81%/60%   5   50   8
# payment-service-hpa   Deployment/payment-service   65%/60%   5   50   13

Chapter 5 — Project 1

Zero-Downtime E-Commerce API Deployment with Auto-Scaling

Scenario: ShopFast E-Commerce runs a product catalog API. During flash sales, traffic spikes 20x in minutes. They need: zero-downtime rolling deployments, auto-scaling from 3 to 100 pods, a PodDisruptionBudget to prevent all pods from being evicted at once during node maintenance, and the ability to instantly roll back bad releases.

# Full deployment stack for ShopFast

# 1. Deployment
cat > shopfast-deployment.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: catalog-api
  namespace: production
  annotations:
    deployment.kubernetes.io/revision: "1"
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0          # NEVER have zero pods during update
  selector:
    matchLabels: { app: catalog-api }
  template:
    metadata:
      labels: { app: catalog-api, version: "1.0" }
    spec:
      containers:
      - name: catalog-api
        image: shopfast/catalog-api:v1.0.0
        resources:
          requests: { memory: "128Mi", cpu: "100m" }
          limits:   { memory: "256Mi", cpu: "500m" }
        readinessProbe:
          httpGet: { path: /health, port: 8080 }
          initialDelaySeconds: 10
          periodSeconds: 3
          successThreshold: 2     # Must pass 2 times in a row to become ready
        lifecycle:
          preStop:
            exec: { command: ["/bin/sh", "-c", "sleep 5"] }
EOF

# 2. PodDisruptionBudget - protects against voluntary disruptions
cat > shopfast-pdb.yaml <<EOF
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: catalog-api-pdb
  namespace: production
spec:
  minAvailable: 2           # At least 2 pods must always be available
  selector:
    matchLabels: { app: catalog-api }
  # When a node is drained (maintenance/upgrade), Kubernetes will
  # evict pods one by one, but respect this budget
  # If you have 3 pods, it will only evict 1 at a time
EOF

# 3. HPA for auto-scaling
cat > shopfast-hpa.yaml <<EOF
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: catalog-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: catalog-api
  minReplicas: 3
  maxReplicas: 100
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
EOF

kubectl apply -f shopfast-deployment.yaml
kubectl apply -f shopfast-pdb.yaml
kubectl apply -f shopfast-hpa.yaml

# Deploy v2 with zero downtime
kubectl set image deployment/catalog-api \
  catalog-api=shopfast/catalog-api:v2.0.0 \
  -n production

# Monitor the rollout
kubectl rollout status deployment/catalog-api -n production

# Oops - v2 has a bug! Rollback immediately
kubectl rollout undo deployment/catalog-api -n production
# Rollback takes ~10 seconds, traffic never fully interrupted

Interview Questions — Chapter 5

What is the relationship between a Deployment, ReplicaSet, and Pod? Draw the hierarchy.
During a rolling update with maxSurge=1 and maxUnavailable=1, what is the minimum number of pods serving traffic if you have 5 replicas?
A deployment rollout is stuck at 3/5 new pods. How do you diagnose and fix this?
What is the difference between kubectl set image and editing the deployment YAML? Which is better for production and why?
Explain how the HPA works internally — what does it check, how often, and what does it do?
What is a PodDisruptionBudget and when is it enforced?
How do topologySpreadConstraints differ from podAntiAffinity?
You have a Deployment with 10 pods. You change the node selector so no current nodes match. What happens?
What is revisionHistoryLimit and what is the cost of setting it too high or too low?

Sumit Sharma

11 Posts View All Posts