Self:
In my previous position as DevOps/SRE engineer, my day-to-day tasks were a mix of both infrastructure operational activities and deployments and audit and analysis of system performance and logs, root cause analysis, Coordinate Migration tasks etc
kubernet cluster management like, node scaling, resource limiting, troubleshoot pod failure, network issues etc
Kubernetes:
CrashLoopBackOff - Container fails repeatedly on startup
🔹 Check logs: kubectl logs <pod>
🔹 Check readiness/liveness probes
🔹 Validate CMD/ENTRYPOINT in Dockerfile
ImagePullBackOff / ErrImagePull - Image not found or authentication failed
🔹 Check image name/tag
🔹 Verify registry credentials / secret
🔹 Run kubectl describe pod <pod>
RunContainerError - Container failed to start
🔹 Check Dockerfile CMD/ENTRYPOINT
🔹 Ensure image isn't corrupted
CreateContainerConfigError - Pod spec config issue
🔹 Check ConfigMap/Secret references
🔹 Run: kubectl describe pod <pod>
Pending - Pod can’t be scheduled
🔹 No matching node (due to resource limits, taints)
🔹 Check with: kubectl describe pod <pod> and kubectl get nodes
NodeNotReady - Node is not available
🔹 Check kubectl get nodes
🔹 Restart kubelet or fix node issues
Back-off restarting failed container - App inside container is crashing
🔹 Check logs
🔹 Debug locally or update health checks
OOMKilled - Out Of Memory (container exceeded limits)
🔹 Increase memory limit in YAML
🔹 Optimize application memory usage
Evicted - Pod killed due to low node resources
🔹 Add more nodes
🔹 Tweak pod resource limits/requests
Unauthorized / Forbidden - RBAC issues
🔹 Review RBAC roles and bindings
🔹 Check service account permissions
dns: service not found - DNS resolution failed
🔹 Check Service/Pod name
🔹 Check CoreDNS status: kubectl get pods -n kube-system
no matches for kind - Typo or wrong API version in YAML
🔹 Use kubectl api-resources or kubectl explain <kind>
timeout connecting to service - Network or port issue
🔹 Validate service, selector, targetPort
🔹 Use kubectl port-forward to test
Secret not found Missing secret in namespace
🔹 Run kubectl get secrets
🔹 Check namespace with -n <namespace>
Error from server (NotFound) - Resource doesn't exist
🔹 Double-check the name and namespace
invalid character 'i' looking for beginning of value
Bad YAML or JSON input
🔹 Validate YAML syntax (e.g. with yamllint)
kubectl describe pod <pod> - Deep inspection of pod issues
kubectl logs <pod> - View logs from containers
kubectl get events --sort-by='.metadata.creationTimestamp' - View cluster events
kubectl top pod - View resource usage
kubectl port-forward - Access services locally
kubectl get all -n <namespace> - Overview of resources in namespace
Docker
1. Build and Run Containers
🔹 Task: docker build -t myapp .
Error:failed to solve: failed to compute cache key: not found
Cause: Missing Dockerfile or bad context
Fix: Make sure the Dockerfile exists in the directory.
Use: docker build -f path/to/Dockerfile .
🔹 Task: docker run -d -p 8080:80 myapp
Error: port is already allocated
Cause: Host port 8080 already in use
Fix: Use a different port: docker run -d -p 8081:80 myapp
Or free the port: sudo lsof -i :8080
📦 2. Image & Container Management
🔹 Task: docker pull <image>
Error: repository does not exist or may require 'docker login'
Cause: Private repo or typo in name
Fix: Check image name spelling
Use: docker login if private registry
🔹 Task: docker rmi <image>
Error: image is being used by stopped container
Fix:Remove container first: docker rm <container_id>
Then: docker rmi <image_id>
🛠️ 3. Working with Volumes and Mounts
🔹 Error:
Mounts denied: file not shared from host
Cause: MacOS/Windows file sharing is restricted
Fix:Enable folder sharing in Docker Desktop → Settings → Resources → File Sharing
🔹 Error:
permission denied when writing inside mounted volume
Cause: Host file system permissions
Fix:Use chown inside Dockerfile
Or run container with same UID as host user
📶 4. Network and Connectivity
🔹 Error: curl: (6) Could not resolve host
Cause: DNS resolution failed inside container
Fix: Restart Docker: systemctl restart docker
Check Docker’s DNS settings (in /etc/docker/daemon.json)
🔹 Error: connection refused
Cause: Target service in another container is not reachable
Fix: Ensure containers are on same network
Use Docker network aliases
🧼 5. Cleanup & Pruning
🔹 Task: docker system prune -a
Error: Error response from daemon: conflict: unable to remove repository
Fix: Remove dependent containers/images first
Use: docker container prune
docker image prune -a
🔍 6. Logs, Exec, and Debug
🔹 Task: docker logs <container>
Error: no such container
Fix: List containers: docker ps -a
Check name or ID is correct
🔹 Task: docker exec -it <container> bash
Error: exec failed: container_linux.go: no such file or directory
Cause: Image doesn’t have bash
Fix:Try: docker exec -it <container> sh
🧰 General Troubleshooting Commands
docker ps -a = List all containers
docker logs <id> = View container logs
docker inspect <id> = View detailed metadata
docker events = Monitor Docker daemon events
docker-compose logs = Check logs in multi-container setup
What is Ansible?
Agentless automation tool for configuration management, application deployment, and orchestration.
How is it different from Puppet or Chef?
No agents, simpler syntax (YAML), uses SSH, easier to set up.
What is idempotency?
Running the same playbook multiple times results in the same system state.
What is a handler?
A task triggered only when notified (e.g., restart a service after a change).
What are facts?
System information gathered by Ansible automatically (ansible_facts).
What is a role?
A way to structure playbooks into reusable components.
What are the Disadvantages of Ansible?
Stateless - it ensures a task runs, but not inherently track resource states.
No Active roleback mechanism, we should write another Playbook do do that
Lackof GUI,
Lack of security if the vault is not used properly. Also poor SSH key practise may be vulnurable
Limited windows support
What are the advantages of Ansible?
Agent Less
Idempotency,
Modular and Scalable
Easy integration with CICD
ansible all -m ping - Ping all hosts
ansible all -a "df -h" - Run shell command
ansible-playbook site.yml - Run playbook
ansible-playbook -i hosts.ini playbook.yml
Run playbook with custom inventory
ansible -i inventory web -m shell -a "uptime"
Run command on web group
ansible-inventory --list -i inventory.ini
View parsed inventory
ansible-doc -l
List all modules
ansible-vault
Encrypt/decrypt secrets (e.g. passwords, keys)
---
- name: sample yaml file to Install and configure Apache
hosts: web
become: yes
tasks:
- name: Install apache2
apt:
name: apache2
state: present
when: ansible_os_family == "Debian"
- name: Copy custom config
copy:
src: apache.conf
dest: /etc/apache2/sites-available/000-default.conf
notify: Restart Apache
handlers:
- name: Restart Apache
service:
name: apache2
state: restarted
TERRAFORM
What is Terraform?
An open-source infrastructure as code (IaC) tool by HashiCorp used to provision and manage cloud resources declaratively.
What is a provider?
Plugin used to interact with APIs (e.g., AWS, Azure, Kubernetes).
What is terraform.tfstate?
File that holds the current state of the managed infrastructure.
What is the difference between terraform apply and terraform plan?
plan previews changes; apply executes them.
What is idempotency in Terraform?
Ensures applying the same code repeatedly does not change infrastructure unnecessarily.
How do you manage secrets?
Avoid hardcoding, use environment variables or external secrets management tools like Vault.
Best Practice file structure
project/
├── main.tf # Resource definitions
├── variables.tf # Variable declarations
├── outputs.tf # Output values
├── terraform.tfvars # Variable values
├── backend.tf # Remote backend config
├── modules/ # Reusable code modules
Sample Terraform Code
provider "aws" {
region = "ap-south-1"
}
resource "aws_instance" "web" {
ami = "ami-0abcdef1234567890"
instance_type = "t2.micro"
tags = {
Name = "WebServer"
}
}
output "instance_ip" {
value = aws_instance.web.public_ip
}
SHELL SCRIPTING
What is a shell?
nterface to interact with OS using commands (e.g., bash, zsh)
What’s the difference between bash and sh?
bash is more advanced with additional features
What is $??
Last command's exit status
What does #!/bin/bash mean?
Shebang — specifies script interpreter
What is && and `
Always quote variables: "$var" to avoid globbing or word splitting.
Use set -e at top of scripts to exit on error.
Use trap to handle script interrupts or clean up temporary files.
Test scripts with bash -x script.sh for step-by-step debugging.
Use cron with logs to debug scheduled script runs:
AWS
Common AWS CLI Commands
🔹 EC2
aws ec2 describe-instances
aws ec2 start-instances --instance-ids i-0123abcd
aws ec2 stop-instances --instance-ids i-0123abcd
🔹 S3
aws s3 ls
aws s3 cp file.txt s3://my-bucket/
aws s3 sync ./local-dir s3://my-bucket/ --delete
🔹 IAM
aws iam list-users
aws iam attach-user-policy --user-name dev --policy-arn arn:aws:iam::aws:policy/ReadOnlyAccess
🔹 VPC & Networking
aws ec2 describe-vpcs
aws ec2 create-security-group --group-name my-sg --description "Allow SSH"
🔹 CloudWatch
aws logs describe-log-groups
aws logs get-log-events --log-group-name my-log-group --log-stream-name my-stream
🔹 Lambda
aws lambda list-functions
aws lambda invoke --function-name my-func output.json
EC2 - Instance types, EBS, key pairs, security groups
VPC - Subnets, route tables, internet/NAT gateways
S3 - Storage classes, lifecycle policies, versioning, encryption
IAM - Users, roles, policies, least privilege
CloudWatch - Monitoring, logs, metrics, alarms
Lambda - Triggers, memory/time limits, event sources
ELB & ASG - Load distribution and auto scaling logic
RDS & DynamoDB - Managed databases, backups, high availability
LINUX
How to patch a RHEL system?
yum update (RHEL 7)
dnf update (RHEL 8/9)
Use yum-cron / dnf-automatic for automation
How do you check what packages were updated recently?
yum history list
yum history info <ID>
dnf history list
How do you rollback a patch or update?
yum history undo <transaction-id>
dnf history rollback <id>
What tools would you use to manage kernel versions post-patching?
grubby to view and set default kernel
rpm -qa | grep kernel
uname -r to verify current kernel
What are the runlevels in RHEL and how do they map to systemd targets?
Runlevel 0 → poweroff.target
Runlevel 1 → rescue.target
Runlevel 3 → multi-user.target
Runlevel 5 → graphical.target
How would you troubleshoot a server stuck at boot?
Boot into grub and select rescue kernel
Use rd.break to access root shell
Check /var/log/messages or journalctl
Validate fstab or corrupted initramfs
How do you regenerate initramfs?
dracut -f
How do you analyze system performance?
top, htop, vmstat, iostat, sar, pidstat
journalctl -xe for logs
systemctl status checks
How to identify and fix high memory usage?
free -m, top, smem, ps aux --sort=-%mem
Clear caches: sync; echo 3 > /proc/sys/vm/drop_caches (Use with caution)
After patching, the server is not reachable — what are your steps?
Access via console/IPMI/iLO
Boot into rescue mode or previous kernel
Check /boot/grub2/grub.cfg and initramfs
Validate network configs or firewall blocks
Rollback if required