Shell Scripting Interview Questions for DevOps

Prepare for DevOps interview with the help of these Shell scripting interview questions for DevOps and make sure you place yourself right in your career.

Prepare for DevOps interview with the help of these Shell scripting interview questions for DevOps and make sure you place yourself right in your career.

Blog Author

Karan Gupta

Published on

Nov 20, 2020

Views

10136

Read Time

20 Mins

Blog Author

Karan Gupta

Published on

Nov 20, 2020

Views

10136

Read Time

20 Mins

Table of Content

Shell Scripting Interview Questions for DevOps

Top 5 Entry-level Shell Scripting Interview Questions for DevOps

Top 5 Intermediate shell scripting for DevOps interview questions

Top 5 DevOps shell scripting interview questions for experts

Recommended Articles

The Complete Guide to DevOps Evolution

Explore

Top 8 Platforms for DevOps Freelanc...

Explore

What is Shell Scripting

Shell scripting is basically writing a series of commands in a text file that your computer's terminal can run automatically. Think of it like creating a recipe where, instead of cooking steps, you're telling your computer what tasks to do in order. It's super handy for automating repetitive stuff like backing up files, managing system tasks, or running multiple programs with just one command. Most people use Bash shell on Linux/Mac, but there's also PowerShell on Windows - either way, it saves tons of time once you get the hang of it.

How to prepare for a Shell Scripting Interview

Get your basics rock solid - Make sure you're comfortable with fundamental commands like ls, grep, sed, awk, and pipes. Most interview questions for shell scripting start with these building blocks, so practice combining them in different ways.
Practice writing actual scripts - Don't just memorize syntax. Write scripts for real tasks like file backups, log parsing, or system monitoring. Interviewers love seeing practical problem-solving skills.
Master the common shell scripting interview questions - You'll definitely get asked about variables, loops, conditionals, and functions. Practice explaining the difference between $@ and $*, when to use double vs single quotes, and how exit codes work.
Debug like crazy - Learn to use set -x for debugging and understand error handling with trap commands. Being able to troubleshoot broken scripts on the spot is a huge plus.
Know your shell differences - Understand what makes bash different from sh, ksh, or zsh. Some companies are picky about POSIX compliance.
Study real-world scenarios - Look up common automation tasks like monitoring disk space, rotating logs, or deploying applications. These make great discussion points during interviews.
Practice explaining your code - The best script in the world won't help if you can't walk through your logic clearly. Practice talking through your problem-solving approach out loud.

Shell Scripting Interview Questions and Answers For Beginners
Starting your journey with shell scripting interviews? Don't worry - most bash scripting interview questions for beginners focus on practical basics rather than complex theory. We'll walk through the most common bash shell interview questions that entry-level positions typically cover, helping you understand not just what to answer, but why things work the way they do. These questions come up constantly because they test whether you can actually write useful scripts, not just memorize commands.
Q1: What exactly is a shell script?
Think of it as a text file filled with commands you'd normally type one by one in the terminal. Instead of typing them manually every time, you save them in a file and run them all at once. It's like creating a macro for your terminal - super helpful when you need to do the same tasks repeatedly.
Q2: How do I create my first shell script?
Open any text editor (nano, vim, or even notepad), type your commands, and save it with a .sh extension. Here's the simplest example:
bash
#!/bin/bash
echo "My first script!"
Save it as myscript.sh, make it executable with chmod +x myscript.sh, then run with ./myscript.sh.
Q3: Why do some variables have $ and others don't?
You use $ when you want to get the value from a variable, but not when you're setting it. Setting: name="John" (no $). Using: echo "Hello $name" (with $). It's like the difference between writing on a nametag versus reading what's already written there.
Q4: What are those -f, -d, -e things I see in if statements?
These are test operators that check file conditions. -f checks if something is a regular file, -d checks for directories, -e checks if anything exists at that path. There are tons more: -r for readable, -w for writable, -x for executable. They're your script's way of looking before it leaps.
Q5: How do I do basic math in bash?
Bash is quirky with math - you need special syntax. Use $((expression)) for arithmetic: result=$((5 + 3)). For decimals, bash can't handle them natively - you'd need bc: echo "5.5 + 2.3" | bc. Remember, spaces don't matter inside $(( )), which is unusual for bash.
Q6: What's this 2>&1 thing I keep seeing?
This redirects error messages to wherever regular output is going. The 2 represents stderr (error messages), 1 represents stdout (normal output), and & tells bash "I mean the file descriptor, not a file named 1". So command > output.txt 2>&1 sends both regular output and errors to output.txt.
Q7: How do I check if a command succeeded?
Every command returns an exit code - 0 for success, anything else for failure. Check it right after the command:
bash
cp file1 file2
if [ $? -eq 0 ]; then
echo "Copy worked!"
else
echo "Copy failed!"
fi
Q8: What's the deal with spaces in bash?
Bash is super picky about spaces, especially in conditions. if [ $a = $b ] needs spaces around the brackets and operators. But a=5 can't have spaces around the =. It's annoying at first, but you'll develop muscle memory for it pretty quickly.
Q9: How do I loop through a list of items?
The for loop is your friend here. For a simple list:
bash
for fruit in apple banana orange; do
echo "I like $fruit"
done
For files: for file in *.txt; do something; done. The loop automatically splits on spaces unless you mess with IFS.
Q10: Can I use variables from outside my script?
Yes! These are environment variables. Your script can read system variables like $HOME, $USER, $PATH. To pass your own, either export them first (export MYVAR="value") or set them inline: MYVAR="value" ./myscript.sh. Just remember they're readable, not writable - changes inside the script won't affect the outside.
Q11: What happens if my script crashes midway?
By default, bash keeps going even if commands fail. To make it stop on errors, add set -e near the top. Want to see what's happening? Add set -x for debug mode. Combine them as set -ex. Just remember these make your script stricter - sometimes you want to handle errors yourself instead.
Q12: How do I get input from users?
The read command is what you need. Basic usage: read username waits for input and stores it. Fancy it up with read -p "What's your name? " username for a prompt. Want a password? Use read -s to hide what they type. Set a timeout with read -t 10 to wait only 10 seconds.
Shell Scripting Coding Interview Questions and Answers For Beginners
Ready to tackle actual shell scripting coding interview questions? These hands-on problems test whether you can write working scripts, not just talk about concepts. We'll cover the classic coding challenges that beginners face in interviews - the ones where you need to write real bash code to solve practical problems. Each solution includes the full script with explanations, so you can understand the logic and adapt it for similar questions.
Q1: Write a script to check if a number is even or odd
This tests your understanding of arithmetic operations and conditionals:
bash
#!/bin/bash

echo "Enter a number: "
read num

# Check if the input is actually a number
if ! [[ "$num" =~ ^[0-9]+$ ]]; then
echo "Please enter a valid number"
exit 1
fi

# Use modulo operator to check even/odd
if [ $((num % 2)) -eq 0 ]; then
echo "$num is even"
else
echo "$num is odd"
fi
The key here is using the modulo operator (%) inside $(( )). If a number divided by 2 has no remainder, it's even. The regex check prevents the script from breaking on non-numeric input.
Q2: Create a script that counts files in a directory by extension
This problem tests file handling and associative arrays:
bash
#!/bin/bash

# Check if directory argument provided
if [ $# -eq 0 ]; then
echo "Usage: $0 <directory>"
exit 1
fi

directory="$1"

# Check if directory exists
if [ ! -d "$directory" ]; then
echo "Directory $directory doesn't exist"
exit 1
fi

# Count files by extension
echo "File count by extension in $directory:"
for file in "$directory"/*; do
# Skip if no files found
[ -e "$file" ] || continue

# Skip directories
[ -f "$file" ] || continue

# Extract extension
extension="${file##*.}"

# Handle files without extension
if [ "$extension" = "$file" ]; then
extension="no_extension"
fi

# Count using associative array would be ideal, but
# for beginners, let's use a simpler approach
echo "$extension"
done | sort | uniq -c | sort -rn
This solution uses parameter expansion (${file##*.}) to extract extensions and pipes the output through sort and uniq to count occurrences. It's a common pattern in shell scripting - using Unix tools together.
Q3: Write a script to backup files modified in the last 7 days
This tests your knowledge of the find command and date handling:
bash
#!/bin/bash

# Set source and backup directories
source_dir="${1:-.}" # Use current directory if not specified
backup_dir="backup_$(date +%Y%m%d)"

# Create backup directory
mkdir -p "$backup_dir"

echo "Starting backup of files modified in last 7 days..."

# Find and copy files modified in last 7 days
count=0
find "$source_dir" -type f -mtime -7 | while read -r file; do
# Skip if file is in backup directory
[[ "$file" == *"$backup_dir"* ]] && continue

# Create subdirectory structure in backup
rel_path="${file#$source_dir/}"
backup_path="$backup_dir/$rel_path"
backup_subdir="$(dirname "$backup_path")"

mkdir -p "$backup_subdir"

# Copy file preserving timestamps
if cp -p "$file" "$backup_path" 2>/dev/null; then
((count++))
echo "Backed up: $rel_path"
fi
done

echo "Backup complete! $count files copied to $backup_dir"
The script uses find with -mtime -7 to locate recent files. The while read loop processes filenames with spaces correctly. We preserve the directory structure in the backup and use cp -p to keep timestamps.
Q4: Create a script that monitors disk usage and alerts when it exceeds 80%
This demonstrates system monitoring and parsing command output:
bash
#!/bin/bash

# Set threshold
threshold=80

echo "Checking disk usage..."
echo "========================"

# Get disk usage and process each line
df -h | grep -E '^/dev/' | while read -r line; do
# Parse the df output
device=$(echo "$line" | awk '{print $1}')
usage=$(echo "$line" | awk '{print $5}' | sed 's/%//')
mount=$(echo "$line" | awk '{print $6}')

# Check if usage exceeds threshold
if [ "$usage" -gt "$threshold" ]; then
echo "WARNING: $mount ($device) is at $usage% capacity!"

# You could add email notification here
# echo "Disk $mount is at $usage%" | mail -s "Disk Alert" admin@example.com
else
echo "OK: $mount ($device) is at $usage% capacity"
fi
done

# Also check if any partition is critically full (>90%)
critical=$(df -h | grep -E '^/dev/' | awk '{print $5}' | sed 's/%//' | sort -n | tail -1)
if [ "$critical" -gt 90 ]; then
echo ""
echo "CRITICAL: At least one partition is over 90% full!"
exit 1
fi
This script parses df output using awk to extract specific fields. The percentage sign is stripped with sed so we can do numeric comparison. Real-world scripts might send emails or write to log files.
Q5: Write a script that validates and processes a CSV file
This tests file parsing and error handling:
bash
#!/bin/bash

# Check if file provided
if [ $# -eq 0 ]; then
echo "Usage: $0 <csv_file>"
exit 1
fi

csv_file="$1"

# Check if file exists and is readable
if [ ! -r "$csv_file" ]; then
echo "Error: Cannot read file $csv_file"
exit 1
fi

echo "Processing CSV file: $csv_file"
echo "================================"

# Process header
header=$(head -1 "$csv_file")
num_fields=$(echo "$header" | awk -F',' '{print NF}')
echo "Header fields ($num_fields): $header"
echo ""

# Process data lines
line_num=1
error_count=0
valid_count=0

tail -n +2 "$csv_file" | while IFS=',' read -r field1 field2 field3 remainder; do
((line_num++))

# Check if line has correct number of fields
current_fields=$(echo "$field1,$field2,$field3,$remainder" | awk -F',' '{print NF}')

if [ "$current_fields" -ne "$num_fields" ]; then
echo "Error on line $line_num: Expected $num_fields fields, got $current_fields"
((error_count++))
continue
fi

# Basic validation - check if fields are not empty
if [ -z "$field1" ] || [ -z "$field2" ] || [ -z "$field3" ]; then
echo "Error on line $line_num: Empty required fields"
((error_count++))
continue
fi

# Process valid line (example: just echo it)
echo "Processing: $field1 | $field2 | $field3"
((valid_count++))
done

echo ""
echo "Summary: Processed $valid_count valid lines, found $error_count errors"
This script shows proper CSV handling using IFS (Internal Field Separator) and read. It validates the number of fields per line and checks for empty values. The tail -n +2 skips the header for processing.

Shell Scripting Interview Questions and Answers For Intermediate (2-4 years Exp).
Moving beyond the basics, intermediate shell scripting coding interview questions focus on real-world problem solving, performance optimization, and handling complex scenarios you've likely encountered in production. We'll explore questions that test your experience with error handling, process management, and writing maintainable scripts that other developers can understand and modify. These questions reflect the challenges you face when scripts move from personal tools to team resources that need to be reliable and efficient.
Q1: How do you handle errors gracefully in production scripts?
At this level, you need multiple strategies working together. I use trap to catch errors and cleanup, set -euo pipefail to stop on failures, and custom error functions:
#!/bin/bash
set -euo pipefail

# Global error handler
error_exit() {
echo "Error on line $1: $2" >&2
cleanup
exit 1
}

cleanup() {
# Remove temp files, unlock resources, etc
rm -f /tmp/mylockfile
echo "Cleanup completed"
}

# Set trap for errors and exit
trap 'error_exit $LINENO "Command failed"' ERR
trap cleanup EXIT

# Your actual script logic here

The key is combining these techniques - set -e stops on errors, -u catches undefined variables, -o pipefail catches errors in pipes, and trap ensures cleanup happens no matter what.
Q2: Explain process substitution and when you'd use it
Process substitution <(command) creates a temporary file descriptor that acts like a file. It's brilliant for comparing outputs or feeding multiple processes:
# Compare two command outputs
diff <(ls dir1) <(ls dir2)

# Feed sorted data to a command expecting a file
join <(sort file1) <(sort file2)

# Multiple inputs to a single command
paste <(cut -d: -f1 /etc/passwd) <(cut -d: -f3 /etc/passwd)

I use it when I need to avoid temporary files or when working with commands that only accept file arguments but I want to give them dynamic data.
Q3: How do you implement proper logging in shell scripts?
Production scripts need structured logging with timestamps, levels, and rotation. Here's my go-to pattern:
LOG_FILE="/var/log/myscript.log"
LOG_LEVEL=${LOG_LEVEL:-"INFO"} # Can override via environment

log() {
local level=$1
shift
local message="$@"
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')

# Check log level
case $LOG_LEVEL in
ERROR) [[ $level =~ ^(ERROR)$ ]] || return ;;
WARN) [[ $level =~ ^(ERROR|WARN)$ ]] || return ;;
INFO) [[ $level =~ ^(ERROR|WARN|INFO)$ ]] || return ;;
DEBUG) ;; # Log everything
esac

echo "[$timestamp] [$level] $message" | tee -a "$LOG_FILE"
}

# Usage
log INFO "Script started"
log ERROR "Connection failed"
log DEBUG "Variable X = $x"

Don't forget log rotation - either use logrotate or implement size checking in your script.
Q4: How do you handle concurrent script execution?
Preventing race conditions is crucial. I typically use flock for reliable locking:
LOCK_FILE="/var/run/myscript.lock"
exec 200>"$LOCK_FILE"

if ! flock -n 200; then
echo "Another instance is running. Exiting."
exit 1
fi

# Alternative: wait for lock with timeout
if ! flock -w 10 200; then
echo "Could not acquire lock after 10 seconds"
exit 1
fi

# Script continues here with exclusive lock

For more complex scenarios, I might use mkdir as an atomic operation or implement a PID file system that checks if the process is actually still running.
Q5: What's your approach to parsing complex command line arguments?
For production scripts, I use getopts for short options and manual parsing for long options:
usage() {
cat << EOF
Usage: $0 [-h] [-v] [-f FILE] [--debug] [--config CONFIG]
Options:
-h, --help Show this help
-v, --verbose Enable verbose mode
-f FILE Input file
--debug Enable debug mode
--config FILE Configuration file
EOF
}

# Parse arguments
VERBOSE=0
DEBUG=0
while [[ $# -gt 0 ]]; do
case $1 in
-h|--help)
usage
exit 0
;;
-v|--verbose)
VERBOSE=1
shift
;;
-f)
INPUT_FILE="$2"
shift 2
;;
--debug)
DEBUG=1
set -x # Enable bash debugging
shift
;;
--config)
CONFIG_FILE="$2"
shift 2
;;
--)
shift
break
;;
-*)
echo "Unknown option: $1" >&2
usage
exit 1
;;
*)
break
;;
esac
done

# Remaining arguments are in $@

Q6: How do you optimize scripts that process large files?
Never load entire files into memory. Instead, stream process them:
# Bad: loads entire file
for line in $(cat hugefile.txt); do
process "$line"
done

# Good: streams line by line
while IFS= read -r line; do
process "$line"
done < hugefile.txt

# Better: parallel processing for CPU-bound tasks
cat hugefile.txt | parallel -j 4 process {}

# For simple transformations, use tools designed for it
awk '{sum += $3} END {print sum}' hugefile.txt # Sum third column
sed -i 's/old/new/g' hugefile.txt # In-place replacement

Also consider splitting large files and processing chunks in parallel, or using tools like split, sort -S for memory limits, and join instead of loading everything into associative arrays.
Q7: Explain your strategy for making scripts portable across different systems
Portability is tricky. I start with POSIX compliance when possible, but pragmatically handle differences:
#!/usr/bin/env bash # More portable than /bin/bash

# Detect OS
if [[ "$OSTYPE" == "linux-gnu"* ]]; then
OS="linux"
elif [[ "$OSTYPE" == "darwin"* ]]; then
OS="macos"
elif [[ "$OSTYPE" == "freebsd"* ]]; then
OS="freebsd"
fi

# Handle command differences
if command -v gsed &> /dev/null; then
SED=gsed # GNU sed on Mac
else
SED=sed
fi

# Check for required commands
for cmd in awk grep "$SED"; do
if ! command -v "$cmd" &> /dev/null; then
echo "Required command '$cmd' not found" >&2
exit 1
fi
done

# Use portable options
find . -type f -name "*.txt" -print0 | xargs -0 grep "pattern" # Works everywhere

Q8: How do you implement timeout functionality for long-running commands?
The timeout command is great when available, but here's a portable solution:
# Using timeout command (if available)
if command -v timeout &> /dev/null; then
timeout 30s long_running_command
else
# Manual implementation
long_running_command &
pid=$!
count=0
while kill -0 $pid 2>/dev/null && [ $count -lt 30 ]; do
sleep 1
((count++))
done

if kill -0 $pid 2>/dev/null; then
echo "Command timed out after 30 seconds"
kill -TERM $pid
sleep 2
kill -0 $pid 2>/dev/null && kill -KILL $pid
fi
fi

# Alternative: using read with timeout for user input
if read -t 10 -p "Enter value (10 sec timeout): " value; then
echo "You entered: $value"
else
echo "Timeout or cancelled"
fi

Q9: What's your approach to debugging complex shell scripts?
Beyond set -x, I use structured debugging:
# Debug function with levels
DEBUG_LEVEL=${DEBUG_LEVEL:-0}

debug() {
local level=$1
shift
if [ "$DEBUG_LEVEL" -ge "$level" ]; then
echo "[DEBUG$level] $*" >&2
fi
}

# Trace function execution
trace() {
local func=$1
shift
debug 1 "Entering $func with args: $*"
local start=$(date +%s.%N)

"$func" "$@"
local ret=$?

local end=$(date +%s.%N)
local duration=$(echo "$end - $start" | bc)
debug 1 "Exiting $func (duration: ${duration}s, return: $ret)"
return $ret
}

# PS4 for better trace output
export PS4='+(${BASH_SOURCE}:${LINENO}): ${FUNCNAME[0]:+${FUNCNAME[0]}(): }'

# Conditional debugging
[ "$DEBUG" = "1" ] && set -x

I also use ShellCheck religiously, and for really tough bugs, I'll use bashdb or add extensive logging.
Q10: How do you manage configuration in shell scripts?
I layer configurations from multiple sources with clear precedence:
# Default configuration
declare -A CONFIG=(
[db_host]="localhost"
[db_port]="5432"
[log_level]="INFO"
)

# Load system config
[ -f /etc/myapp/config ] && source /etc/myapp/config

# Load user config
[ -f ~/.myapp/config ] && source ~/.myapp/config

# Load project config
[ -f ./config ] && source ./config

# Environment variables override everything
[ -n "$MYAPP_DB_HOST" ] && CONFIG[db_host]=$MYAPP_DB_HOST
[ -n "$MYAPP_DB_PORT" ] && CONFIG[db_port]=$MYAPP_DB_PORT

# Validate required settings
for key in db_host db_port; do
if [ -z "${CONFIG[$key]}" ]; then
echo "Error: Missing required config: $key" >&2
exit 1
fi
done

# Export for child processes if needed
export MYAPP_CONFIG_DB_HOST="${CONFIG[db_host]}"

This gives flexibility while maintaining predictable behavior - defaults, system-wide settings, user preferences, project overrides, and finally environment variables.

Shell Scripting Coding Interview Questions and Answers For Intermediate (2-4 years Exp)
At the intermediate level, shell scripting coding interview questions shift from basic syntax to solving real production challenges - the kind where your script needs to handle thousands of files, recover from failures, and play nice with other systems. We'll tackle the practical problems that someone with a few years under their belt should handle confidently, focusing on efficiency, reliability, and maintainability. These are the scenarios where your experience shows through in how you structure your solution, not just whether it works.
Q1: Write a script that monitors a log file in real-time and sends alerts for specific error patterns
This tests your ability to handle continuous file monitoring and pattern matching:
#!/bin/bash

# Configuration
LOG_FILE="${1:-/var/log/application.log}"
ALERT_EMAIL="admin@company.com"
ERROR_PATTERNS=("ERROR" "CRITICAL" "FATAL" "OutOfMemory")
ALERT_THRESHOLD=5 # Alert after 5 errors in 60 seconds
WINDOW_SIZE=60

# Check if log file exists
if [ ! -f "$LOG_FILE" ]; then
echo "Error: Log file $LOG_FILE not found"
exit 1
fi

# Initialize error tracking
declare -A error_counts
declare -A error_timestamps

# Function to send alert
send_alert() {
local pattern=$1
local count=$2
local message="Alert: $count occurrences of '$pattern' in last $WINDOW_SIZE seconds"

# In real scenario, you'd send email or post to Slack
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $message"

# Example email command (commented out)
# echo "$message" | mail -s "Log Alert: $pattern" $ALERT_EMAIL

# Reset counter after alert
error_counts[$pattern]=0
error_timestamps[$pattern]=""
}

# Function to clean old timestamps
clean_old_timestamps() {
local pattern=$1
local current_time=$(date +%s)
local new_timestamps=""

# Keep only timestamps within window
for ts in ${error_timestamps[$pattern]}; do
if [ $((current_time - ts)) -lt $WINDOW_SIZE ]; then
new_timestamps="$new_timestamps $ts"
fi
done

error_timestamps[$pattern]="$new_timestamps"
error_counts[$pattern]=$(echo "$new_timestamps" | wc -w)
}

echo "Monitoring $LOG_FILE for error patterns..."
echo "Patterns: ${ERROR_PATTERNS[*]}"
echo "Alert threshold: $ALERT_THRESHOLD errors in $WINDOW_SIZE seconds"
echo "Press Ctrl+C to stop"

# Monitor log file
tail -F "$LOG_FILE" | while read -r line; do
# Check each pattern
for pattern in "${ERROR_PATTERNS[@]}"; do
if [[ "$line" =~ $pattern ]]; then
current_time=$(date +%s)

# Add timestamp
error_timestamps[$pattern]="${error_timestamps[$pattern]} $current_time"

# Clean old timestamps and update count
clean_old_timestamps "$pattern"

# Check if threshold exceeded
if [ "${error_counts[$pattern]}" -ge "$ALERT_THRESHOLD" ]; then
send_alert "$pattern" "${error_counts[$pattern]}"
fi

# Debug output
echo "[$(date '+%H:%M:%S')] Found: $pattern (count: ${error_counts[$pattern]})"
fi
done
done

The script uses tail -F (capital F) to follow log rotation, tracks errors within a time window, and only alerts when thresholds are exceeded. This prevents alert spam while catching real issues.
Q2: Create a script that performs parallel backup of multiple directories with progress tracking
This demonstrates process management and parallel execution:
#!/bin/bash

# Source directories to backup
SOURCE_DIRS=(
"/home/user/documents"
"/home/user/projects"
"/var/www/html"
"/etc"
)

BACKUP_ROOT="/backup/$(date +%Y%m%d_%H%M%S)"
MAX_PARALLEL=3
COMPRESSION="gzip" # or "bzip2", "xz"

# Create backup directory
mkdir -p "$BACKUP_ROOT"

# File to track progress
PROGRESS_FILE="/tmp/backup_progress_$$"
rm -f "$PROGRESS_FILE"
touch "$PROGRESS_FILE"

# Function to backup a single directory
backup_directory() {
local src_dir=$1
local dir_name=$(basename "$src_dir")
local backup_file="$BACKUP_ROOT/${dir_name}.tar.gz"
local pid=$$
local start_time=$(date +%s)

echo "$pid:$dir_name:0:STARTING" >> "$PROGRESS_FILE"

# Calculate directory size for progress
local total_size=$(du -sb "$src_dir" 2>/dev/null | awk '{print $1}')

# Create backup with progress monitoring
tar cf - "$src_dir" 2>/dev/null | \
pv -s "$total_size" -n 2>/tmp/pv_progress_$$ | \
gzip > "$backup_file" &

local tar_pid=$!

# Monitor progress
while kill -0 $tar_pid 2>/dev/null; do
if [ -f /tmp/pv_progress_$$ ]; then
progress=$(tail -1 /tmp/pv_progress_$$ 2>/dev/null || echo "0")
echo "$pid:$dir_name:$progress:RUNNING" >> "$PROGRESS_FILE"
fi
sleep 1
done

# Check if backup succeeded
wait $tar_pid
local exit_code=$?

local end_time=$(date +%s)
local duration=$((end_time - start_time))

if [ $exit_code -eq 0 ]; then
local final_size=$(stat -f%z "$backup_file" 2>/dev/null || stat -c%s "$backup_file")
echo "$pid:$dir_name:100:COMPLETED:$duration:$final_size" >> "$PROGRESS_FILE"
else
echo "$pid:$dir_name:0:FAILED:$duration:0" >> "$PROGRESS_FILE"
fi

rm -f /tmp/pv_progress_$$
}

# Function to display progress
show_progress() {
while true; do
clear
echo "Backup Progress - $(date '+%Y-%m-%d %H:%M:%S')"
echo "=================================================="

# Read current progress
while IFS=':' read -r pid dir progress status duration size; do
case $status in
STARTING)
printf "%-30s [%s]\n" "$dir" "Initializing..."
;;
RUNNING)
# Create progress bar
bar_length=30
filled=$((progress * bar_length / 100))
bar=$(printf '%*s' "$filled" | tr ' ' '=')
empty=$((bar_length - filled))
bar="${bar}$(printf '%*s' "$empty" | tr ' ' '-')"
printf "%-30s [%s] %3d%%\n" "$dir" "$bar" "$progress"
;;
COMPLETED)
size_mb=$((size / 1024 / 1024))
printf "%-30s [DONE] %dMB in %ds\n" "$dir" "$size_mb" "$duration"
;;
FAILED)
printf "%-30s [FAILED]\n" "$dir"
;;
esac
done < "$PROGRESS_FILE"

# Check if all backups are done
if ! grep -q "RUNNING\|STARTING" "$PROGRESS_FILE" 2>/dev/null; then
echo ""
echo "All backups completed!"
break
fi

sleep 1
done
}

# Start progress display in background
show_progress &
progress_pid=$!

# Run backups in parallel with job control
job_count=0
for dir in "${SOURCE_DIRS[@]}"; do
# Wait if we've hit the parallel limit
while [ $(jobs -r | wc -l) -ge $MAX_PARALLEL ]; do
sleep 0.5
done

# Start backup job
backup_directory "$dir" &
((job_count++))
done

# Wait for all backup jobs
wait

# Stop progress display
kill $progress_pid 2>/dev/null
show_progress # Show final status

# Cleanup
rm -f "$PROGRESS_FILE"

# Generate summary report
echo ""
echo "Backup Summary"
echo "=============="
echo "Location: $BACKUP_ROOT"
echo "Total archives: $(ls -1 "$BACKUP_ROOT"/*.tar.gz 2>/dev/null | wc -l)"
echo "Total size: $(du -sh "$BACKUP_ROOT" | awk '{print $1}')"

# List all backups with sizes
ls -lh "$BACKUP_ROOT"/*.tar.gz 2>/dev/null

This script showcases parallel job management, real-time progress tracking with pv, and proper cleanup. It handles multiple directories simultaneously while keeping the user informed.
Q3: Write a script that synchronizes configuration files across multiple servers
This tests your understanding of remote execution and error handling:
#!/bin/bash

# Configuration
CONFIG_DIR="/etc/myapp"
SERVERS=("web01.example.com" "web02.example.com" "web03.example.com")
SSH_USER="deploy"
SSH_KEY="$HOME/.ssh/deploy_key"
BACKUP_BEFORE_SYNC=true
DRY_RUN=false

# Parse command line
while getopts "dnh" opt; do
case $opt in
d) DRY_RUN=true ;;
n) BACKUP_BEFORE_SYNC=false ;;
h) echo "Usage: $0 [-d] [-n] [-h]"
echo " -d Dry run (show what would be done)"
echo " -n No backup before sync"
exit 0 ;;
esac
done

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

# Logging function
log() {
local level=$1
shift
local msg="$@"
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')

case $level in
ERROR) echo -e "${RED}[$timestamp] ERROR: $msg${NC}" >&2 ;;
SUCCESS) echo -e "${GREEN}[$timestamp] SUCCESS: $msg${NC}" ;;
INFO) echo -e "[$timestamp] INFO: $msg" ;;
WARN) echo -e "${YELLOW}[$timestamp] WARN: $msg${NC}" ;;
esac

# Also log to file
echo "[$timestamp] $level: $msg" >> "/var/log/config_sync.log"
}

# Function to check server connectivity
check_server() {
local server=$1
ssh -q -o ConnectTimeout=5 -o BatchMode=yes \
-i "$SSH_KEY" "$SSH_USER@$server" exit 2>/dev/null
}

# Function to backup configs on remote server
backup_remote_configs() {
local server=$1
local backup_name="config_backup_$(date +%Y%m%d_%H%M%S).tar.gz"

log INFO "Creating backup on $server"

ssh -i "$SSH_KEY" "$SSH_USER@$server" "
if [ -d '$CONFIG_DIR' ]; then
sudo tar czf /tmp/$backup_name $CONFIG_DIR 2>/dev/null
sudo mv /tmp/$backup_name /var/backups/
echo 'Backup created: /var/backups/$backup_name'
else
echo 'Config directory not found, skipping backup'
fi
"
}

# Function to sync single file
sync_file() {
local server=$1
local file=$2
local rel_path=${file#$CONFIG_DIR/}

# Calculate checksum
local local_sum=$(md5sum "$file" | awk '{print $1}')
local remote_sum=$(ssh -i "$SSH_KEY" "$SSH_USER@$server" \
"sudo md5sum '$file' 2>/dev/null | awk '{print \$1}'")

if [ "$local_sum" = "$remote_sum" ]; then
echo " ✓ $rel_path (unchanged)"
return 0
fi

if [ "$DRY_RUN" = true ]; then
echo " → Would sync: $rel_path"
return 0
fi

# Copy file
if scp -i "$SSH_KEY" "$file" "$SSH_USER@$server:/tmp/$(basename "$file")" >/dev/null 2>&1; then
# Move to final location with sudo
if ssh -i "$SSH_KEY" "$SSH_USER@$server" "
sudo mkdir -p $(dirname "$file")
sudo mv /tmp/$(basename "$file") $file
sudo chown root:root $file
sudo chmod 644 $file
"; then
echo " ✓ $rel_path (updated)"
return 0
fi
fi

echo " ✗ $rel_path (failed)"
return 1
}

# Main sync process
log INFO "Starting configuration sync"
[ "$DRY_RUN" = true ] && log WARN "Running in DRY RUN mode"

# Check all servers first
log INFO "Checking server connectivity..."
available_servers=()
for server in "${SERVERS[@]}"; do
if check_server "$server"; then
available_servers+=("$server")
echo " ✓ $server"
else
log ERROR "Cannot connect to $server"
echo " ✗ $server"
fi
done

if [ ${#available_servers[@]} -eq 0 ]; then
log ERROR "No servers available for sync"
exit 1
fi

# Find all config files
config_files=$(find "$CONFIG_DIR" -type f -name "*.conf" -o -name "*.yml" -o -name "*.json")
file_count=$(echo "$config_files" | wc -l)

log INFO "Found $file_count configuration files to sync"

# Sync to each server
for server in "${available_servers[@]}"; do
log INFO "Syncing to $server"

# Backup if requested
if [ "$BACKUP_BEFORE_SYNC" = true ] && [ "$DRY_RUN" = false ]; then
backup_remote_configs "$server"
fi

# Sync each file
success_count=0
fail_count=0

while IFS= read -r file; do
if sync_file "$server" "$file"; then
((success_count++))
else
((fail_count++))
fi
done <<< "$config_files"

# Reload services if sync succeeded
if [ $fail_count -eq 0 ] && [ "$DRY_RUN" = false ]; then
log INFO "Reloading services on $server"
ssh -i "$SSH_KEY" "$SSH_USER@$server" "
sudo systemctl reload nginx 2>/dev/null || true
sudo systemctl reload myapp 2>/dev/null || true
"
fi

# Report for this server
if [ $fail_count -eq 0 ]; then
log SUCCESS "$server: $success_count files synced successfully"
else
log ERROR "$server: $fail_count files failed (${success_count} succeeded)"
fi
done

log INFO "Configuration sync completed"

This script handles multiple servers, does checksums to avoid unnecessary transfers, creates backups, and includes proper error handling with dry-run support.
Q4: Create a script that analyzes system performance and generates an HTML report
This shows data collection, processing, and formatted output:
#!/bin/bash

# Output file
REPORT_FILE="system_report_$(hostname)_$(date +%Y%m%d_%H%M%S).html"
TEMP_DIR="/tmp/sysreport_$$"
mkdir -p "$TEMP_DIR"

# Thresholds for warnings
CPU_THRESHOLD=80
MEM_THRESHOLD=85
DISK_THRESHOLD=90

# Function to get CSS color based on value
get_color() {
local value=$1
local threshold=$2

if [ $(echo "$value >= $threshold" | bc) -eq 1 ]; then
echo "#ff4444" # Red
elif [ $(echo "$value >= $threshold * 0.8" | bc) -eq 1 ]; then
echo "#ff9944" # Orange
else
echo "#44ff44" # Green
fi
}

# Collect system information
echo "Collecting system information..."

# Basic info
HOSTNAME=$(hostname)
UPTIME=$(uptime -p)
KERNEL=$(uname -r)
CPU_MODEL=$(grep "model name" /proc/cpuinfo | head -1 | cut -d: -f2)
CPU_CORES=$(nproc)
TOTAL_MEM=$(free -h | awk '/^Mem:/ {print $2}')

# CPU usage (over 5 seconds)
echo "Analyzing CPU usage..."
sar -u 1 5 > "$TEMP_DIR/cpu_stats.txt" 2>/dev/null || {
# Fallback if sar not available
top -b -n 2 -d 5 | grep "Cpu(s)" | tail -1 > "$TEMP_DIR/cpu_stats.txt"
}
CPU_USAGE=$(awk '/Average:/ {print 100 - $NF}' "$TEMP_DIR/cpu_stats.txt" || \
awk '{print $2}' "$TEMP_DIR/cpu_stats.txt" | tr -d '%id,' || echo "0")

# Memory usage
MEM_STATS=$(free -m | awk '/^Mem:/ {printf "%.1f", ($3/$2) * 100}')

# Disk usage
echo "Analyzing disk usage..."
DISK_DATA=""
while read -r line; do
device=$(echo "$line" | awk '{print $1}')
mount=$(echo "$line" | awk '{print $6}')
usage=$(echo "$line" | awk '{print $5}' | tr -d '%')
size=$(echo "$line" | awk '{print $2}')
used=$(echo "$line" | awk '{print $3}')
avail=$(echo "$line" | awk '{print $4}')

color=$(get_color "$usage" "$DISK_THRESHOLD")
DISK_DATA="${DISK_DATA}
<tr>
<td>$device</td>
<td>$mount</td>
<td>$size</td>
<td>$used</td>
<td>$avail</td>
<td style='color: $color; font-weight: bold;'>$usage%</td>
</tr>"
done < <(df -h | grep '^/dev/' | grep -v '/dev/loop')

# Top processes
echo "Analyzing processes..."
TOP_CPU=$(ps aux --sort=-%cpu | head -6 | tail -5)
TOP_MEM=$(ps aux --sort=-%mem | head -6 | tail -5)

# Network statistics
NETSTAT=$(ss -s 2>/dev/null | grep -E "TCP:|UDP:" | head -4)

# Recent system logs (errors/warnings)
RECENT_ERRORS=$(journalctl -p err -n 10 --no-pager 2>/dev/null || \
dmesg | grep -iE "error|fail|warn" | tail -10)

# Generate HTML report
cat > "$REPORT_FILE" << 'EOF'
<!DOCTYPE html>
<html>
<head>
<title>System Performance Report</title>
<meta charset="utf-8">
<style>
body {
font-family: Arial, sans-serif;
margin: 20px;
background-color: #f5f5f5;
}
.container {
max-width: 1200px;
margin: 0 auto;
background-color: white;
padding: 20px;
box-shadow: 0 0 10px rgba(0,0,0,0.1);
}
h1, h2 {
color: #333;
border-bottom: 2px solid #4CAF50;
padding-bottom: 10px;
}
.info-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
gap: 20px;
margin: 20px 0;
}
.info-box {
background-color: #f9f9f9;
padding: 15px;
border-radius: 5px;
border-left: 4px solid #4CAF50;
}
table {
width: 100%;
border-collapse: collapse;
margin: 20px 0;
}
th, td {
text-align: left;
padding: 12px;
border-bottom: 1px solid #ddd;
}
th {
background-color: #4CAF50;
color: white;
}
tr:hover {
background-color: #f5f5f5;
}
.metric {
font-size: 24px;
font-weight: bold;
margin: 10px 0;
}
.warning {
background-color: #fff3cd;
border-color: #ffeaa7;
color: #856404;
padding: 10px;
border-radius: 5px;
margin: 10px 0;
}
.timestamp {
text-align: right;
color: #666;
font-style: italic;
}
pre {
background-color: #f4f4f4;
padding: 10px;
border-radius: 5px;
overflow-x: auto;
}
</style>
</head>
<body>
<div class="container">
EOF

# Add content
cat >> "$REPORT_FILE" << EOF
<h1>System Performance Report - $HOSTNAME</h1>
<p class="timestamp">Generated on $(date '+%Y-%m-%d %H:%M:%S')</p>

<div class="info-grid">
<div class="info-box">
<strong>Hostname:</strong> $HOSTNAME<br>
<strong>Uptime:</strong> $UPTIME<br>
<strong>Kernel:</strong> $KERNEL
</div>
<div class="info-box">
<strong>CPU Model:</strong> $CPU_MODEL<br>
<strong>CPU Cores:</strong> $CPU_CORES<br>
<strong>Total Memory:</strong> $TOTAL_MEM
</div>
<div class="info-box">
<strong>CPU Usage:</strong>
<div class="metric" style="color: $(get_color $CPU_USAGE $CPU_THRESHOLD)">
${CPU_USAGE}%
</div>
</div>
<div class="info-box">
<strong>Memory Usage:</strong>
<div class="metric" style="color: $(get_color $MEM_STATS $MEM_THRESHOLD)">
${MEM_STATS}%
</div>
</div>
</div>
EOF

# Add warnings if thresholds exceeded
if [ $(echo "$CPU_USAGE >= $CPU_THRESHOLD" | bc) -eq 1 ]; then
echo '<div class="warning">⚠️ High CPU usage detected!</div>' >> "$REPORT_FILE"
fi
if [ $(echo "$MEM_STATS >= $MEM_THRESHOLD" | bc) -eq 1 ]; then
echo '<div class="warning">⚠️ High memory usage detected!</div>' >> "$REPORT_FILE"
fi

# Add disk usage table
cat >> "$REPORT_FILE" << EOF
<h2>Disk Usage</h2>
<table>
<tr>
<th>Device</th>
<th>Mount Point</th>
<th>Size</th>
<th>Used</th>
<th>Available</th>
<th>Usage %</th>
</tr>
$DISK_DATA
</table>

<h2>Top CPU Consuming Processes</h2>
<pre>$(echo "$TOP_CPU" | awk '{printf "%-8s %-8s %6s %s\n", $1, $2, $9, $11}')</pre>

<h2>Top Memory Consuming Processes</h2>
<pre>$(echo "$TOP_MEM" | awk '{printf "%-8s %-8s %6s %s\n", $1, $2, $10, $11}')</pre>

<h2>Network Statistics</h2>
<pre>$NETSTAT</pre>

<h2>Recent System Errors/Warnings</h2>
<pre>$(echo "$RECENT_ERRORS" | head -20)</pre>

</div>
</body>
</html>
EOF

# Cleanup
rm -rf "$TEMP_DIR"

echo "Report generated: $REPORT_FILE"
# Optional: open in browser
if command -v xdg-open >/dev/null 2>&1; then
xdg-open "$REPORT_FILE"
elif command -v open >/dev/null 2>&1; then
open "$REPORT_FILE"
fi

This creates a professional-looking HTML report with color-coded metrics, responsive design, and comprehensive system analysis.
Q5: Write a script that implements a job queue system with worker processes
This demonstrates advanced process control and inter-process communication:
#!/bin/bash

# Configuration
QUEUE_DIR="/var/spool/job_queue"
WORKERS=4
PID_FILE="/var/run/job_queue.pid"
LOG_FILE="/var/log/job_queue.log"
WORKER_TIMEOUT=300 # 5 minutes max per job

# Create necessary directories
mkdir -p "$QUEUE_DIR"/{pending,processing,completed,failed}

# Logging function
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG_FILE"
}

# Signal handlers
cleanup() {
log "Shutting down job queue system..."

# Signal all workers to stop
touch "$QUEUE_DIR/.shutdown"

# Wait for workers to finish current jobs
for pid in $(jobs -p); do
log "Waiting for worker $pid to finish..."
wait $pid 2>/dev/null
done

rm -f "$PID_FILE" "$QUEUE_DIR/.shutdown"
log "Shutdown complete"
exit 0
}

trap cleanup SIGINT SIGTERM

# Function to add a job to queue
add_job() {
local job_type=$1
local job_data=$2
local priority=${3:-5} # Default priority 5 (1=highest, 9=lowest)
local job_id=$(date +%s%N)_$$
local job_file="$QUEUE_DIR/pending/${priority}_${job_id}.job"

cat > "$job_file" << EOF
JOB_ID=$job_id
JOB_TYPE=$job_type
JOB_DATA=$job_data
SUBMIT_TIME=$(date +%s)
SUBMIT_USER=$USER
STATUS=pending
EOF

log "Job added: $job_id (type: $job_type, priority: $priority)"
echo "$job_id"
}

# Worker function
worker_process() {
local worker_id=$1
log "Worker $worker_id started (PID: $$)"

while [ ! -f "$QUEUE_DIR/.shutdown" ]; do
# Find next job (sorted by priority and age)
local job_file=$(ls -1 "$QUEUE_DIR/pending/"*.job 2>/dev/null | sort | head -1)

if [ -z "$job_file" ]; then
# No jobs, wait a bit
sleep 2
continue
fi

# Try to claim the job (atomic operation)
local processing_file="$QUEUE_DIR/processing/$(basename "$job_file")"
if ! mv "$job_file" "$processing_file" 2>/dev/null; then
# Another worker got it first
continue
fi

# Load job details
source "$processing_file"
log "Worker $worker_id processing job $JOB_ID (type: $JOB_TYPE)"

# Update job status
echo "WORKER=$worker_id" >> "$processing_file"
echo "START_TIME=$(date +%s)" >> "$processing_file"
echo "STATUS=processing" >> "$processing_file"

# Execute job with timeout
local job_output="$QUEUE_DIR/processing/${JOB_ID}.out"
local job_result=0

case "$JOB_TYPE" in
"compress")
timeout $WORKER_TIMEOUT tar czf "${JOB_DATA}.tar.gz" "$JOB_DATA" \
> "$job_output" 2>&1
job_result=$?
;;

"backup")
timeout $WORKER_TIMEOUT rsync -av "$JOB_DATA" "/backup/$JOB_DATA" \
> "$job_output" 2>&1
job_result=$?
;;

"report")
timeout $WORKER_TIMEOUT /usr/local/bin/generate_report.sh "$JOB_DATA" \
> "$job_output" 2>&1
job_result=$?
;;

"custom")
# Execute custom command (careful with security!)
timeout $WORKER_TIMEOUT bash -c "$JOB_DATA" \
> "$job_output" 2>&1
job_result=$?
;;

*)
echo "Unknown job type: $JOB_TYPE" > "$job_output"
job_result=1
;;
esac

# Move to completed or failed
local end_time=$(date +%s)
local duration=$((end_time - START_TIME))

echo "END_TIME=$end_time" >> "$processing_file"
echo "DURATION=$duration" >> "$processing_file"
echo "EXIT_CODE=$job_result" >> "$processing_file"

if [ $job_result -eq 0 ]; then
echo "STATUS=completed" >> "$processing_file"
mv "$processing_file" "$QUEUE_DIR/completed/"
mv "$job_output" "$QUEUE_DIR/completed/"
log "Worker $worker_id completed job $JOB_ID in ${duration}s"
else
echo "STATUS=failed" >> "$processing_file"
mv "$processing_file" "$QUEUE_DIR/failed/"
mv "$job_output" "$QUEUE_DIR/failed/"
log "Worker $worker_id: job $JOB_ID failed (exit code: $job_result)"
fi
done

log "Worker $worker_id stopped"
}

# Status monitoring function
monitor_status() {
while [ ! -f "$QUEUE_DIR/.shutdown" ]; do
clear
echo "Job Queue Status - $(date)"
echo "================================"

# Count jobs by status
local pending=$(ls -1 "$QUEUE_DIR/pending/"*.job 2>/dev/null | wc -l)
local processing=$(ls -1 "$QUEUE_DIR/processing/"*.job 2>/dev/null | wc -l)
local completed=$(ls -1 "$QUEUE_DIR/completed/"*.job 2>/dev/null | wc -l)
local failed=$(ls -1 "$QUEUE_DIR/failed/"*.job 2>/dev/null | wc -l)

echo "Pending: $pending"
echo "Processing: $processing"
echo "Completed: $completed"
echo "Failed: $failed"
echo ""

# Show current processing jobs
if [ $processing -gt 0 ]; then
echo "Currently Processing:"
echo "--------------------"
for job in "$QUEUE_DIR/processing/"*.job; do
[ -f "$job" ] || continue
source "$job"
local runtime=$(($(date +%s) - START_TIME))
printf " Job %s (Worker %d): %s - %ds\n" \
"$JOB_ID" "$WORKER" "$JOB_TYPE" "$runtime"
done
echo ""
fi

# Show worker status
echo "Workers:"
echo "--------"
for i in $(seq 1 $WORKERS); do
if kill -0 ${WORKER_PIDS[$i]} 2>/dev/null; then
echo " Worker $i: Running (PID: ${WORKER_PIDS[$i]})"
else
echo " Worker $i: Stopped"
fi
done

sleep 5
done
}

# Main execution
if [ "$1" = "add" ]; then
# Add job mode
shift
add_job "$@"
exit 0
fi

if [ "$1" = "status" ]; then
# Status mode
monitor_status
exit 0
fi

# Start queue manager
log "Starting job queue system with $WORKERS workers"
echo $$ > "$PID_FILE"

# Start workers
declare -a WORKER_PIDS
for i in $(seq 1 $WORKERS); do
worker_process $i &
WORKER_PIDS[$i]=$!
done

# Wait for all workers
log "Job queue system running. Press Ctrl+C to stop."
wait

This implements a complete job queue with priority handling, multiple workers, atomic job claiming, timeout protection, and comprehensive logging. It's the kind of system you'd build to handle background tasks in production.

Shell Scripting Interview Questions and Answers For Experienced (5+ years Exp)
At the senior level, advanced shell scripting interview questions dig deep into architectural decisions, performance at scale, and the wisdom that comes from maintaining production systems through their entire lifecycle. These questions explore how you handle the messy realities of enterprise environments - legacy system integration, cross-platform compatibility nightmares, and scripts that need to run reliably for years. The focus shifts from "can you write it" to "have you lived through the consequences of different approaches" and whether you can design solutions that other engineers can maintain long after you've moved on.
Q1: How do you design shell scripts for high-availability environments where failure isn't an option?
After years of 3 AM calls, I've learned that HA scripts need multiple layers of protection. First, I implement circuit breakers - if a script fails repeatedly, it stops trying and alerts instead of hammering a broken system. I use distributed locking across nodes (often with Redis or etcd) to prevent split-brain scenarios. Here's my approach:
Health checks before actions: Never assume a service is up. Always verify endpoints are responding correctly before sending traffic.
Idempotency everywhere: Every operation should be safe to run multiple times. I use checksums, version checks, and state files to ensure this.
Graceful degradation: If a non-critical component fails, the script continues with reduced functionality rather than failing completely.
Audit trails: Every action gets logged with who/what/when/why, often shipped to a central logging system for correlation.
The key insight I've gained is that HA isn't about preventing failures - it's about failing gracefully and recovering automatically. I've seen too many "clever" scripts that made things worse during outages.
Q2: Explain your approach to handling sensitive data in shell scripts
This is where experience really shows. Never, ever put secrets in scripts. I've cleaned up too many messes where passwords were in version control. My approach:
Environment variables for local development only, never in production
Secrets management tools like HashiCorp Vault, AWS Secrets Manager, or Kubernetes secrets
Temporary credentials with short TTLs whenever possible
Audit logging for every secret access
I also use set +x before any line that might expose secrets in debug output, and I'm paranoid about temporary files - always created with mktemp and proper permissions, cleaned up in trap handlers. I've seen scripts that wrote database passwords to /tmp with world-readable permissions. That's a career-limiting move.
Q3: How do you optimize shell scripts that process millions of records?
The biggest lesson I've learned: the shell isn't always the answer. But when it is, here's what works:
Streaming over loading: Never load large datasets into memory. Use pipes and process line by line.
GNU Parallel for CPU-bound tasks. It's incredible how much faster things run with proper parallelization.
Sort/join over nested loops: Unix tools are optimized for this. A sort | join pipeline beats nested loops every time.
Minimize process spawning: That for loop calling sed 1000 times? Rewrite it as a single awk script.
Real example: I once replaced a 6-hour customer data processing script with a sort | join | awk pipeline that ran in 12 minutes. The original developer was reading a 2GB file into an associative array. Sometimes the old Unix philosophy of small, focused tools really shines.
Q4: Describe your most complex debugging experience with shell scripts
The worst one that comes to mind involved a script that worked perfectly for 3 years, then started randomly failing on Tuesdays. Turned out a log rotation job was creating a race condition, but only when the log file exceeded 2GB, which only happened on our busiest day.
My debugging toolkit has evolved from painful experiences:
strace/dtrace to see actual system calls
bash -x is just the starting point; I often add custom debug functions that can be toggled
Reproducing environments exactly - same shell version, same locale settings, same everything
Binary search debugging - comment out half the script, see if it still fails
The real skill is knowing when to stop debugging and rewrite. I've learned that if I'm spending more than a day debugging a complex script, it's probably too complex and needs to be simplified or rewritten in a proper programming language.
Q5: How do you handle shell script deployment and versioning across hundreds of servers?
Configuration management is crucial here. I've used Puppet, Ansible, and Chef, but the principles remain the same:
Immutable deployments: Scripts are versioned packages, not edited in place
Canary deployments: Roll out to 1%, then 10%, then 50%, monitoring metrics at each stage
Feature flags: Even in shell scripts, I implement toggles for risky changes
Rollback procedures: Always have a quick way back. I version with symlinks for instant rollback
One hard lesson: never trust system package managers alone. I've been burned by different versions of bash, different coreutils implementations, even different versions of basic commands like 'date'. Now I always include compatibility checks and sometimes ship specific binary versions with my scripts.
Q6: What's your philosophy on when to use shell scripts versus a "real" programming language?
This is the question that separates experienced engineers from script kiddies. My rule of thumb:
Shell scripts are perfect for:
Gluing systems together
System administration tasks
Quick prototypes
Build and deployment automation
But I switch to Python/Go/Ruby when:
The script exceeds 200-300 lines
I need complex data structures
Error handling becomes more complex than the actual logic
I'm parsing anything more complex than simple delimited data
Performance is critical
I've maintained 2000-line bash scripts. It's not fun. The maintenance cost grows exponentially with complexity. Now I'm quick to recognize when bash has served its purpose and it's time to rewrite.
Q7: How do you ensure shell script security in a zero-trust environment?
Security has evolved way beyond just checking inputs. In zero-trust environments:
No permanent credentials: Everything uses temporary tokens with the minimum required scope
Mutual TLS for script-to-service communication
Signed scripts: We GPG sign critical scripts and verify signatures before execution
Minimal attack surface: Scripts run in containers or VMs with only required tools installed
Security scanning: All scripts go through static analysis tools looking for common vulnerabilities
I've also learned to be paranoid about seemingly innocent things. That curl command downloading a script from GitHub? What if someone compromises the repo? Now I verify checksums and use pinned versions for everything.
Q8: Describe your approach to making shell scripts cloud-agnostic
After migrating between AWS, GCP, and Azure multiple times, I've learned:
Abstract provider-specific calls: Create wrapper functions for cloud operations
Use cloud SDK tools carefully: They change. Always pin versions and have fallbacks
Metadata services are different: Each cloud has its own way. Abstract this early
Storage is never just storage: S3, GCS, and Azure Blob have subtle differences
Example approach:
cloud_provider_detect() {
if curl -s -f -m 1 http://169.254.169.254/latest/meta-data/instance-id >/dev/null 2>&1; then
echo "aws"
elif curl -s -f -m 1 -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/id >/dev/null 2>&1; then
echo "gcp"
else
echo "azure" # Or onprem, needs more logic
fi
}

The key is planning for portability from day one. Retrofitting cloud-agnostic behavior is painful.
Q9: How do you handle backwards compatibility when system tools get updated?
This is a constant battle. GNU vs BSD tools, different bash versions, changing command options. My strategies:
Feature detection over version detection: Don't check if it's bash 4.3, check if it supports associative arrays
Wrapper functions: Abstract system commands behind functions that handle differences
Comprehensive testing matrix: Test on minimum supported versions of everything
Document requirements explicitly: Every script starts with a requirements block
I maintain compatibility libraries for common issues:
# Example: readlink portability
portable_readlink() {
if readlink -f "$1" 2>/dev/null; then
return
elif command -v greadlink >/dev/null; then
greadlink -f "$1"
else
# Fallback implementation
python -c "import os; print(os.path.realpath('$1'))"
fi
}

Q10: What are the most important lessons you've learned about shell scripting in production?
The scars teach the best lessons:
Observability beats cleverness: A simple script with great logging beats a clever script you can't debug
Plan for failure from line 1: Not just error handling, but operational failure - what happens when your script runs during a datacenter failover?
Other people will maintain your code: Write for them, not to show off. Comment the "why", not the "what"
Test the unhappy paths: Everyone tests success. Test what happens when that API returns 500, when the disk is full, when DNS is flaking
Know when to stop: Some problems shouldn't be solved in bash. Recognizing this saves weeks of pain
The meta-lesson: shell scripting in production is 20% writing code and 80% thinking about what could go wrong. Every senior engineer has war stories about simple scripts causing major outages. The difference is we've learned to be paranoid in productive ways.

Shell Scripting Coding Interview Questions and Answers For Experienced (5+ years Exp)
Senior-level advanced shell scripting interview questions go beyond algorithms to test battle-hardened production experience - can you write code that survives server crashes, handles race conditions, and scales across distributed systems? These coding challenges reflect real scenarios you've probably debugged at 3 AM: service orchestration failures, data corruption recovery, and the kind of edge cases that only show up after months in production. The solutions here demonstrate not just working code, but the defensive programming and architectural thinking that comes from years of learning things the hard way.
Q1: Write a distributed lock manager for coordinating jobs across multiple servers
This solves the classic problem of preventing duplicate cron jobs in clustered environments:
#!/bin/bash

# Distributed lock implementation using shared filesystem or Redis
# Handles stale locks, network partitions, and crash recovery

LOCK_DIR="/shared/locks"
REDIS_HOST="${REDIS_HOST:-localhost}"
REDIS_PORT="${REDIS_PORT:-6379}"
BACKEND="${LOCK_BACKEND:-filesystem}" # or "redis"

# Configuration
readonly SCRIPT_NAME=$(basename "$0")
readonly HOSTNAME=$(hostname -f)
readonly PID=$$
readonly LOCK_TIMEOUT=${LOCK_TIMEOUT:-300} # 5 minutes default
readonly LOCK_RETRY_INTERVAL=2
readonly STALE_LOCK_THRESHOLD=600 # 10 minutes

# Generate unique lock identifier
generate_lock_id() {
echo "${HOSTNAME}-${PID}-$(date +%s%N)"
}

# Filesystem-based lock implementation
fs_acquire_lock() {
local lock_name=$1
local lock_file="$LOCK_DIR/$lock_name.lock"
local lock_id=$(generate_lock_id)
local temp_file=$(mktemp)

# Write lock metadata
cat > "$temp_file" << EOF
{
"holder": "$lock_id",
"hostname": "$HOSTNAME",
"pid": $PID,
"acquired": $(date +%s),
"timeout": $LOCK_TIMEOUT,
"command": "$0 $*"
}
EOF

# Try atomic lock creation
if ln "$temp_file" "$lock_file" 2>/dev/null; then
rm -f "$temp_file"
echo "$lock_id"
return 0
fi

# Lock exists, check if stale
if [ -f "$lock_file" ]; then
local lock_age=$(($(date +%s) - $(stat -c %Y "$lock_file" 2>/dev/null || stat -f %m "$lock_file")))
local lock_data=$(cat "$lock_file" 2>/dev/null)
local lock_pid=$(echo "$lock_data" | grep -o '"pid": [0-9]*' | grep -o '[0-9]*')
local lock_host=$(echo "$lock_data" | grep -o '"hostname": "[^"]*"' | cut -d'"' -f4)

# Check if lock is stale
if [ $lock_age -gt $STALE_LOCK_THRESHOLD ]; then
echo "Removing stale lock (age: ${lock_age}s)" >&2
rm -f "$lock_file"
# Retry
fs_acquire_lock "$lock_name"
return $?
fi

# Check if process still exists (same host only)
if [ "$lock_host" = "$HOSTNAME" ] && [ -n "$lock_pid" ]; then
if ! kill -0 "$lock_pid" 2>/dev/null; then
echo "Removing lock from dead process $lock_pid" >&2
rm -f "$lock_file"
# Retry
fs_acquire_lock "$lock_name"
return $?
fi
fi
fi

rm -f "$temp_file"
return 1
}

fs_release_lock() {
local lock_name=$1
local lock_id=$2
local lock_file="$LOCK_DIR/$lock_name.lock"

# Verify we own the lock before releasing
if [ -f "$lock_file" ]; then
local current_holder=$(grep -o '"holder": "[^"]*"' "$lock_file" 2>/dev/null | cut -d'"' -f4)
if [ "$current_holder" = "$lock_id" ]; then
rm -f "$lock_file"
return 0
else
echo "Cannot release lock owned by $current_holder" >&2
return 1
fi
fi
return 0
}

# Redis-based lock implementation
redis_acquire_lock() {
local lock_name=$1
local lock_id=$(generate_lock_id)
local lock_key="dlock:$lock_name"

# Try to acquire lock with NX (only if not exists) and EX (expiry)
local result=$(redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" \
SET "$lock_key" "$lock_id" NX EX "$LOCK_TIMEOUT" 2>/dev/null)

if [ "$result" = "OK" ]; then
# Store metadata separately
redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" >/dev/null 2>&1 <<EOF
HSET "${lock_key}:meta" holder "$lock_id"
HSET "${lock_key}:meta" hostname "$HOSTNAME"
HSET "${lock_key}:meta" pid "$PID"
HSET "${lock_key}:meta" acquired "$(date +%s)"
EXPIRE "${lock_key}:meta" "$LOCK_TIMEOUT"
EOF
echo "$lock_id"
return 0
fi

# Check if lock is held by dead process on same host
local lock_host=$(redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" \
HGET "${lock_key}:meta" hostname 2>/dev/null)
local lock_pid=$(redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" \
HGET "${lock_key}:meta" pid 2>/dev/null)

if [ "$lock_host" = "$HOSTNAME" ] && [ -n "$lock_pid" ]; then
if ! kill -0 "$lock_pid" 2>/dev/null; then
echo "Removing lock from dead process $lock_pid" >&2
redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" DEL "$lock_key" "${lock_key}:meta" >/dev/null
# Retry
redis_acquire_lock "$lock_name"
return $?
fi
fi

return 1
}

redis_release_lock() {
local lock_name=$1
local lock_id=$2
local lock_key="dlock:$lock_name"

# Use Lua script for atomic check-and-delete
redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" --eval - "$lock_key" "$lock_id" <<'EOF' >/dev/null
if redis.call("GET", KEYS[1]) == ARGV[1] then
redis.call("DEL", KEYS[1], KEYS[1] .. ":meta")
return 1
else
return 0
end
EOF
}

# Main lock interface
acquire_lock() {
local lock_name=$1
local max_wait=${2:-0}
local waited=0

while true; do
local lock_id
case "$BACKEND" in
filesystem)
lock_id=$(fs_acquire_lock "$lock_name")
;;
redis)
lock_id=$(redis_acquire_lock "$lock_name")
;;
*)
echo "Unknown backend: $BACKEND" >&2
return 1
;;
esac

if [ -n "$lock_id" ]; then
echo "$lock_id"
return 0
fi

if [ $max_wait -gt 0 ] && [ $waited -ge $max_wait ]; then
return 1
fi

sleep $LOCK_RETRY_INTERVAL
waited=$((waited + LOCK_RETRY_INTERVAL))
done
}

release_lock() {
local lock_name=$1
local lock_id=$2

case "$BACKEND" in
filesystem)
fs_release_lock "$lock_name" "$lock_id"
;;
redis)
redis_release_lock "$lock_name" "$lock_id"
;;
esac
}

# Lock manager with automatic cleanup
with_lock() {
local lock_name=$1
shift

local lock_id
lock_id=$(acquire_lock "$lock_name" 30) # Wait up to 30 seconds

if [ -z "$lock_id" ]; then
echo "Failed to acquire lock: $lock_name" >&2
return 1
fi

# Set up cleanup trap
trap "release_lock '$lock_name' '$lock_id'" EXIT INT TERM

echo "Lock acquired: $lock_name (id: $lock_id)" >&2

# Execute command
"$@"
local exit_code=$?

# Release lock
release_lock "$lock_name" "$lock_id"
trap - EXIT INT TERM

return $exit_code
}

# Example usage for the interview
if [ "${BASH_SOURCE[0]}" = "${0}" ]; then
# Demo: critical section protection
critical_task() {
echo "Starting critical task on $HOSTNAME..."
sleep 10
echo "Critical task completed"
}

# Use distributed lock
with_lock "my-critical-task" critical_task
fi

This implementation handles real-world edge cases: stale locks from crashed processes, network partitions, and provides both filesystem and Redis backends for different infrastructure setups.
Q2: Create a self-healing service monitor that automatically restarts failed services with exponential backoff
This handles the complexity of service management in production:
#!/bin/bash

# Self-healing service monitor with intelligent restart logic
# Prevents restart loops and implements circuit breaker pattern

readonly CONFIG_DIR="/etc/service-monitor"
readonly STATE_DIR="/var/lib/service-monitor"
readonly LOG_FILE="/var/log/service-monitor.log"

# Initialize directories
mkdir -p "$CONFIG_DIR" "$STATE_DIR"

# Service configuration example:
# {
# "name": "web-app",
# "check_cmd": "curl -sf http://localhost:8080/health",
# "start_cmd": "systemctl start web-app",
# "stop_cmd": "systemctl stop web-app",
# "restart_cmd": "systemctl restart web-app",
# "max_restarts": 5,
# "restart_window": 3600,
# "backoff_base": 2,
# "max_backoff": 300
# }

# Logging with severity
log() {
local level=$1
shift
echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $*" | tee -a "$LOG_FILE"

# Also log to syslog if available
if command -v logger >/dev/null 2>&1; then
logger -t "service-monitor" -p "daemon.$level" "$*"
fi
}

# Load service configuration
load_service_config() {
local service_name=$1
local config_file="$CONFIG_DIR/${service_name}.json"

if [ ! -f "$config_file" ]; then
log "error" "Configuration not found for service: $service_name"
return 1
fi

# Parse JSON (using jq if available, fallback to python)
if command -v jq >/dev/null 2>&1; then
jq -r 'to_entries | .[] | "\(.key)=\(.value)"' "$config_file"
else
python -c "
import json, sys
with open('$config_file') as f:
config = json.load(f)
for k, v in config.items():
print(f'{k}={v}')
"
fi
}

# Get service state
get_service_state() {
local service_name=$1
local state_file="$STATE_DIR/${service_name}.state"

if [ ! -f "$state_file" ]; then
# Initialize state
cat > "$state_file" << EOF
restart_count=0
last_restart=0
consecutive_failures=0
backoff_seconds=1
status=unknown
last_check=0
total_restarts=0
circuit_breaker=closed
EOF
fi

source "$state_file"
}

# Update service state
update_service_state() {
local service_name=$1
local state_file="$STATE_DIR/${service_name}.state"
shift

# Read current state
local temp_file=$(mktemp)
cp "$state_file" "$temp_file" 2>/dev/null || touch "$temp_file"

# Update specified values
while [ $# -gt 0 ]; do
local key="${1%%=*}"
local value="${1#*=}"

# Update or add key
if grep -q "^${key}=" "$temp_file"; then
sed -i "s/^${key}=.*/${key}=${value}/" "$temp_file"
else
echo "${key}=${value}" >> "$temp_file"
fi
shift
done

# Atomic update
mv "$temp_file" "$state_file"
}

# Calculate exponential backoff
calculate_backoff() {
local base=$1
local attempt=$2
local max_backoff=$3

local backoff=$((base ** attempt))
if [ $backoff -gt $max_backoff ]; then
backoff=$max_backoff
fi

# Add jitter (±25%) to prevent thundering herd
local jitter=$((backoff / 4))
local random_jitter=$((RANDOM % (jitter * 2) - jitter))
backoff=$((backoff + random_jitter))

echo $backoff
}

# Check if service is healthy
check_service_health() {
local check_cmd=$1

# Execute health check with timeout
timeout 30 bash -c "$check_cmd" >/dev/null 2>&1
}

# Restart service with backoff
restart_service() {
local service_name=$1
local restart_cmd=$2
local backoff_base=$3
local max_backoff=$4

# Get current state
get_service_state "$service_name"

# Calculate backoff
local backoff=$(calculate_backoff "$backoff_base" "$consecutive_failures" "$max_backoff")

log "warning" "Restarting $service_name (attempt $((consecutive_failures + 1)), waiting ${backoff}s)"

# Wait with exponential backoff
sleep "$backoff"

# Attempt restart
if $restart_cmd; then
log "info" "Successfully restarted $service_name"
update_service_state "$service_name" \
"last_restart=$(date +%s)" \
"backoff_seconds=1" \
"total_restarts=$((total_restarts + 1))"
return 0
else
log "error" "Failed to restart $service_name"
return 1
fi
}

# Main monitoring loop for a service
monitor_service() {
local service_name=$1

# Load configuration
eval "$(load_service_config "$service_name")"

# Configuration validation
for required in check_cmd restart_cmd max_restarts restart_window; do
if [ -z "${!required}" ]; then
log "error" "Missing required config: $required for $service_name"
return 1
fi
done

# Set defaults
backoff_base=${backoff_base:-2}
max_backoff=${max_backoff:-300}

log "info" "Starting monitor for $service_name"

while true; do
# Get current state
get_service_state "$service_name"

# Check if circuit breaker is open
if [ "$circuit_breaker" = "open" ]; then
local circuit_break_duration=$(($(date +%s) - last_restart))
if [ $circuit_break_duration -lt 3600 ]; then
sleep 60
continue
else
# Try to close circuit breaker
log "info" "Attempting to close circuit breaker for $service_name"
update_service_state "$service_name" "circuit_breaker=half-open"
fi
fi

# Perform health check
if check_service_health "$check_cmd"; then
# Service is healthy
if [ "$status" != "healthy" ] || [ "$consecutive_failures" -gt 0 ]; then
log "info" "$service_name is healthy"
update_service_state "$service_name" \
"status=healthy" \
"consecutive_failures=0" \
"circuit_breaker=closed"
fi
else
# Service is unhealthy
log "error" "$service_name health check failed"

# Update failure count
consecutive_failures=$((consecutive_failures + 1))
update_service_state "$service_name" \
"status=unhealthy" \
"consecutive_failures=$consecutive_failures" \
"last_check=$(date +%s)"

# Check restart window
current_time=$(date +%s)
window_start=$((current_time - restart_window))

# Count restarts in window
recent_restarts=0
if [ -f "$STATE_DIR/${service_name}.history" ]; then
recent_restarts=$(awk -v start="$window_start" '$1 > start' \
"$STATE_DIR/${service_name}.history" | wc -l)
fi

# Check if we should restart
if [ $recent_restarts -ge $max_restarts ]; then
if [ "$circuit_breaker" != "open" ]; then
log "error" "Circuit breaker OPEN for $service_name (too many restarts)"
update_service_state "$service_name" "circuit_breaker=open"

# Send alert
send_alert "$service_name" "Circuit breaker opened - manual intervention required"
fi
else
# Attempt restart
if restart_service "$service_name" "$restart_cmd" "$backoff_base" "$max_backoff"; then
# Log restart
echo "$(date +%s) restart" >> "$STATE_DIR/${service_name}.history"

# Wait before next check to allow service to stabilize
sleep 30
else
# Restart failed, increase backoff
update_service_state "$service_name" \
"consecutive_failures=$consecutive_failures"
fi
fi
fi

# Wait before next check
sleep "${check_interval:-60}"
done
}

# Send alerts (integrate with your alerting system)
send_alert() {
local service=$1
local message=$2

# Example integrations
if [ -n "$SLACK_WEBHOOK" ]; then
curl -X POST "$SLACK_WEBHOOK" \
-H "Content-Type: application/json" \
-d "{\"text\": \"Service Monitor Alert: $service - $message\"}" \
2>/dev/null || true
fi

# Email alert
if command -v mail >/dev/null 2>&1 && [ -n "$ALERT_EMAIL" ]; then
echo "$message" | mail -s "Service Monitor: $service" "$ALERT_EMAIL"
fi

log "alert" "$service: $message"
}

# Main execution
main() {
# Monitor all configured services
for config_file in "$CONFIG_DIR"/*.json; do
[ -f "$config_file" ] || continue

service_name=$(basename "$config_file" .json)
monitor_service "$service_name" &

# Store PID for management
echo $! > "$STATE_DIR/${service_name}.pid"
done

# Wait for all monitors
wait
}

# Handle signals
trap 'log "info" "Shutting down service monitor"; pkill -P $$; exit 0' TERM INT

if [ "${BASH_SOURCE[0]}" = "${0}" ]; then
main "$@"
fi

Q3: Implement a log aggregation and analysis system that detects anomalies across distributed services
This demonstrates handling big data streams and pattern recognition:
#!/bin/bash

# Distributed log analysis system with anomaly detection
# Handles multiple log formats, performs statistical analysis, and alerts on anomalies

readonly AGGREGATOR_PORT=9514
readonly ANALYSIS_INTERVAL=60
readonly BASELINE_WINDOW=3600 # 1 hour
readonly ANOMALY_THRESHOLD=3 # Standard deviations
readonly STATE_DIR="/var/lib/log-analyzer"
readonly PATTERNS_FILE="/etc/log-analyzer/patterns.conf"

mkdir -p "$STATE_DIR"

# Pattern definitions for different log types
declare -A LOG_PATTERNS=(
["nginx"]='(?P<ip>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) \S+" (?P<status>\d+) (?P<size>\d+)'
["apache"]='(?P<ip>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) \S+" (?P<status>\d+) (?P<size>\d+)'
["syslog"]='(?P<time>\S+ \S+ \S+) (?P<host>\S+) (?P<process>[^\[]+)\[(?P<pid>\d+)\]: (?P<message>.*)'
["json"]='^\{.*\}$'
)

# Statistical functions
calculate_stats() {
# Input: newline-separated numbers
# Output: count mean stddev min max p95
awk '
{
data[NR] = $1
sum += $1
if (NR == 1 || $1 < min) min = $1
if (NR == 1 || $1 > max) max = $1
}
END {
if (NR == 0) {
print "0 0 0 0 0 0"
exit
}
mean = sum / NR

# Calculate variance
for (i = 1; i <= NR; i++) {
variance += (data[i] - mean) ^ 2
}
variance = variance / NR
stddev = sqrt(variance)

# Calculate 95th percentile
n = int(NR * 0.95)
if (n < 1) n = 1

# Sort for percentile
for (i = 1; i <= NR; i++) {
for (j = i + 1; j <= NR; j++) {
if (data[i] > data[j]) {
temp = data[i]
data[i] = data[j]
data[j] = temp
}
}
}

print NR, mean, stddev, min, max, data[n]
}
'
}

# Z-score anomaly detection
is_anomaly() {
local value=$1
local mean=$2
local stddev=$3
local threshold=${4:-$ANOMALY_THRESHOLD}

# Avoid division by zero
if (( $(echo "$stddev == 0" | bc -l) )); then
return 1
fi

# Calculate z-score
local zscore=$(echo "scale=2; ($value - $mean) / $stddev" | bc -l)
local abs_zscore=$(echo "scale=2; if ($zscore < 0) -$zscore else $zscore" | bc -l)

if (( $(echo "$abs_zscore > $threshold" | bc -l) )); then
return 0 # Is anomaly
else
return 1 # Not anomaly
fi
}

# Parse log line based on format
parse_log_line() {
local line=$1
local log_type=$2

case "$log_type" in
json)
# Parse JSON log
if command -v jq >/dev/null 2>&1; then
echo "$line" | jq -r '
@tsv "\(.timestamp // now | todate) \(.level // "INFO") \(.service // "unknown") \(.message // "")"
' 2>/dev/null
else
# Fallback to python
python3 -c "
import json, sys, datetime
try:
data = json.loads('$line')
timestamp = data.get('timestamp', datetime.datetime.now().isoformat())
level = data.get('level', 'INFO')
service = data.get('service', 'unknown')
message = data.get('message', '')
print(f'{timestamp}\t{level}\t{service}\t{message}')
except:
pass
"
fi
;;

nginx|apache)
# Parse using regex
python3 -c "
import re
pattern = r'${LOG_PATTERNS[$log_type]}'
match = re.match(pattern, '''$line''')
if match:
data = match.groupdict()
print(f\"{data.get('time', '')}\t{data.get('status', '')}\t{data.get('method', '')}\t{data.get('path', '')}\")
"
;;

*)
# Generic parsing
echo "$line" | awk '{print $1, $2, $3, $0}'
;;
esac
}

# Aggregate logs from multiple sources
aggregate_logs() {
local output_file=$1
local duration=$2

# Start log receiver
nc -l -k -p "$AGGREGATOR_PORT" > "$output_file" 2>/dev/null &
local nc_pid=$!

# Also collect local logs
if command -v journalctl >/dev/null 2>&1; then
journalctl -f --since="$duration seconds ago" >> "$output_file" 2>/dev/null &
local journal_pid=$!
fi

# Collect for specified duration
sleep "$duration"

# Stop collectors
kill $nc_pid 2>/dev/null
[ -n "$journal_pid" ] && kill $journal_pid 2>/dev/null

# Return line count
wc -l < "$output_file"
}

# Analyze log patterns and detect anomalies
analyze_logs() {
local log_file=$1
local analysis_output="$STATE_DIR/analysis_$(date +%s).json"

# Initialize metrics
declare -A error_counts
declare -A response_times
declare -A status_codes
declare -A service_errors

# Process each log line
while IFS= read -r line; do
# Detect log type
local log_type="generic"
if [[ "$line" =~ ^\{ ]]; then
log_type="json"
elif [[ "$line" =~ \"[A-Z]+\ .*HTTP ]]; then
log_type="nginx"
fi

# Parse line
local parsed=$(parse_log_line "$line" "$log_type")
[ -z "$parsed" ] && continue

# Extract fields based on log type
case "$log_type" in
json)
IFS=$'\t' read -r timestamp level service message <<< "$parsed"

# Count errors by service
if [[ "$level" =~ ERROR|CRITICAL|FATAL ]]; then
((service_errors["$service"]++))
fi
;;

nginx|apache)
IFS=$'\t' read -r timestamp status method path <<< "$parsed"

# Count status codes
((status_codes["$status"]++))

# Track error rates
if [[ "$status" =~ ^[45] ]]; then
((error_counts["http_errors"]++))
fi
;;
esac
done < "$log_file"

# Load historical baselines
local baseline_file="$STATE_DIR/baseline.json"
if [ -f "$baseline_file" ]; then
# Compare with baseline
for service in "${!service_errors[@]}"; do
local current_errors=${service_errors[$service]}
local baseline_mean=$(jq -r ".services.$service.error_rate.mean // 0" "$baseline_file")
local baseline_stddev=$(jq -r ".services.$service.error_rate.stddev // 1" "$baseline_file")

if is_anomaly "$current_errors" "$baseline_mean" "$baseline_stddev"; then
log "alert" "Anomaly detected: $service error rate is $current_errors (baseline: $baseline_mean ± $baseline_stddev)"
send_anomaly_alert "$service" "error_rate" "$current_errors" "$baseline_mean" "$baseline_stddev"
fi
done
fi

# Generate analysis report
cat > "$analysis_output" << EOF
{
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"total_lines": $(wc -l < "$log_file"),
"services": {
EOF

# Add service metrics
local first=true
for service in "${!service_errors[@]}"; do
[ "$first" = true ] && first=false || echo "," >> "$analysis_output"
cat >> "$analysis_output" << EOF
"$service": {
"error_count": ${service_errors[$service]},
"error_rate": $(echo "scale=2; ${service_errors[$service]} * 100 / $(wc -l < "$log_file")" | bc)
}
EOF
done

echo -e "\n },\n \"status_codes\": {" >> "$analysis_output"

# Add status code distribution
first=true
for status in "${!status_codes[@]}"; do
[ "$first" = true ] && first=false || echo "," >> "$analysis_output"
echo -n " \"$status\": ${status_codes[$status]}" >> "$analysis_output"
done

echo -e "\n }\n}" >> "$analysis_output"

# Update baseline with exponential moving average
update_baseline "$analysis_output"

echo "$analysis_output"
}

# Update baseline metrics
update_baseline() {
local current_analysis=$1
local baseline_file="$STATE_DIR/baseline.json"
local alpha=0.3 # EMA weight for new data

if [ ! -f "$baseline_file" ]; then
# Initialize baseline
cp "$current_analysis" "$baseline_file"
return
fi

# Update baseline using exponential moving average
python3 - "$baseline_file" "$current_analysis" "$alpha" << 'EOF'
import json
import sys

baseline_file, current_file, alpha = sys.argv[1:4]
alpha = float(alpha)

with open(baseline_file) as f:
baseline = json.load(f)

with open(current_file) as f:
current = json.load(f)

# Update service metrics
for service, metrics in current.get('services', {}).items():
if service not in baseline.get('services', {}):
baseline.setdefault('services', {})[service] = {
'error_rate': {'mean': 0, 'stddev': 1}
}

# Update using EMA
old_mean = baseline['services'][service]['error_rate']['mean']
new_value = metrics['error_rate']
new_mean = alpha * new_value + (1 - alpha) * old_mean

# Update variance estimate
old_var = baseline['services'][service]['error_rate'].get('variance', 1)
new_var = alpha * ((new_value - new_mean) ** 2) + (1 - alpha) * old_var

baseline['services'][service]['error_rate'] = {
'mean': new_mean,
'stddev': new_var ** 0.5,
'variance': new_var
}

# Save updated baseline
with open(baseline_file, 'w') as f:
json.dump(baseline, f, indent=2)
EOF
}

# Send anomaly alerts
send_anomaly_alert() {
local service=$1
local metric=$2
local current_value=$3
local baseline_mean=$4
local baseline_stddev=$5

local zscore=$(echo "scale=2; ($current_value - $baseline_mean) / $baseline_stddev" | bc -l)

# Create alert payload
local alert_data=$(cat << EOF
{
"service": "$service",
"metric": "$metric",
"current_value": $current_value,
"baseline_mean": $baseline_mean,
"baseline_stddev": $baseline_stddev,
"z_score": $zscore,
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"severity": "$([ $(echo "$zscore > 5" | bc) -eq 1 ] && echo "critical" || echo "warning")"
}
EOF
)

# Send to various channels
if [ -n "$WEBHOOK_URL" ]; then
curl -X POST "$WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d "$alert_data" \
2>/dev/null || true
fi

# Log alert
echo "$alert_data" >> "$STATE_DIR/alerts.jsonl"

# Execute custom alert handler if defined
if [ -x "/etc/log-analyzer/alert-handler.sh" ]; then
echo "$alert_data" | /etc/log-analyzer/alert-handler.sh
fi
}

# Main monitoring loop
main() {
log "info" "Starting distributed log analyzer"

while true; do
local temp_log=$(mktemp)

# Aggregate logs
log "info" "Collecting logs for $ANALYSIS_INTERVAL seconds"
local line_count=$(aggregate_logs "$temp_log" "$ANALYSIS_INTERVAL")
log "info" "Collected $line_count log lines"

# Analyze logs
if [ "$line_count" -gt 0 ]; then
local analysis_file=$(analyze_logs "$temp_log")
log "info" "Analysis complete: $analysis_file"

# Clean up old analysis files (keep last 24 hours)
find "$STATE_DIR" -name "analysis_*.json" -mtime +1 -delete
fi

rm -f "$temp_log"

# Wait before next cycle
sleep 10
done
}

# Logging function
log() {
local level=$1
shift
echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $*" >&2
}

if [ "${BASH_SOURCE[0]}" = "${0}" ]; then
main "$@"
fi

Q4: Build a zero-downtime deployment system with automatic rollback on failure
This handles complex deployment orchestration:
#!/bin/bash

# Zero-downtime deployment system with health checks and automatic rollback
# Supports blue-green and rolling deployments across multiple servers

readonly DEPLOY_DIR="/opt/deployments"
readonly CONFIG_FILE="/etc/deploy/config.json"
readonly STATE_FILE="/var/lib/deploy/state.json"
readonly HEALTH_CHECK_RETRIES=10
readonly HEALTH_CHECK_INTERVAL=5
readonly CANARY_PERCENTAGE=10

# Deployment strategies
declare -A STRATEGIES=(
["blue-green"]="deploy_blue_green"
["rolling"]="deploy_rolling"
["canary"]="deploy_canary"
)

# Initialize directories
mkdir -p "$(dirname "$STATE_FILE")" "$DEPLOY_DIR"

# Atomic state management
update_state() {
local temp_file=$(mktemp)
if [ -f "$STATE_FILE" ]; then
cp "$STATE_FILE" "$temp_file"
else
echo "{}" > "$temp_file"
fi

# Update state using jq
local key=$1
local value=$2

jq --arg k "$key" --arg v "$value" '.[$k] = $v' "$temp_file" > "${temp_file}.new"
mv "${temp_file}.new" "$STATE_FILE"
rm -f "$temp_file"
}

get_state() {
local key=$1
if [ -f "$STATE_FILE" ]; then
jq -r --arg k "$key" '.[$k] // empty' "$STATE_FILE"
fi
}

# Load balancer management (HAProxy example)
update_load_balancer() {
local action=$1
local server=$2
local backend=${3:-"webservers"}

case "$action" in
drain)
echo "set server $backend/$server state drain" | \
socat stdio /var/run/haproxy.sock
;;
ready)
echo "set server $backend/$server state ready" | \
socat stdio /var/run/haproxy.sock
;;
maint)
echo "disable server $backend/$server" | \
socat stdio /var/run/haproxy.sock
;;
enable)
echo "enable server $backend/$server" | \
socat stdio /var/run/haproxy.sock
;;
esac
}

# Wait for connection draining
wait_for_drain() {
local server=$1
local backend=${2:-"webservers"}
local max_wait=60
local waited=0

log "info" "Waiting for connections to drain from $server"

while [ $waited -lt $max_wait ]; do
local conn_count=$(echo "show servers state $backend" | \
socat stdio /var/run/haproxy.sock | \
grep "$server" | awk '{print $7}')

if [ "$conn_count" -eq 0 ]; then
log "info" "Server $server drained successfully"
return 0
fi

sleep 2
waited=$((waited + 2))
done

log "warning" "Timeout waiting for $server to drain (still $conn_count connections)"
return 1
}

# Health check implementation
perform_health_check() {
local server=$1
local health_endpoint=$2
local expected_response=${3:-"200"}

local attempt=1
while [ $attempt -le $HEALTH_CHECK_RETRIES ]; do
log "debug" "Health check attempt $attempt for $server"

local response=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 5 \
--max-time 10 \
"http://$server$health_endpoint")

if [[ "$response" =~ $expected_response ]]; then
log "info" "Health check passed for $server"
return 0
fi

log "warning" "Health check failed for $server (got $response, expected $expected_response)"
sleep $HEALTH_CHECK_INTERVAL
attempt=$((attempt + 1))
done

log "error" "Health check failed for $server after $HEALTH_CHECK_RETRIES attempts"
return 1
}

# Deploy to single server
deploy_to_server() {
local server=$1
local artifact=$2
local deploy_user=${3:-"deploy"}
local deploy_path=${4:-"/var/www/app"}

log "info" "Deploying $artifact to $server"

# Create deployment directory with timestamp
local timestamp=$(date +%Y%m%d_%H%M%S)
local new_release_path="${deploy_path}/releases/${timestamp}"

# Pre-deployment commands
ssh "$deploy_user@$server" "mkdir -p ${deploy_path}/releases"

# Transfer artifact
if ! scp "$artifact" "$deploy_user@$server:/tmp/deploy_${timestamp}.tar.gz"; then
log "error" "Failed to transfer artifact to $server"
return 1
fi

# Extract and prepare
if ! ssh "$deploy_user@$server" "
set -e
cd ${deploy_path}/releases
mkdir -p ${timestamp}
tar -xzf /tmp/deploy_${timestamp}.tar.gz -C ${timestamp}
rm -f /tmp/deploy_${timestamp}.tar.gz

# Run pre-deployment hooks if exists
if [ -x ${timestamp}/deploy/pre-deploy.sh ]; then
cd ${timestamp}
./deploy/pre-deploy.sh
fi
"; then
log "error" "Failed to prepare deployment on $server"
return 1
fi

# Store current version for rollback
local current_link=$(ssh "$deploy_user@$server" "readlink ${deploy_path}/current" 2>/dev/null || echo "")
if [ -n "$current_link" ]; then
update_state "previous_release_${server}" "$current_link"
fi

# Atomic switch
if ! ssh "$deploy_user@$server" "
ln -sfn ${new_release_path} ${deploy_path}/current

# Run post-deployment hooks
if [ -x ${new_release_path}/deploy/post-deploy.sh ]; then
cd ${new_release_path}
./deploy/post-deploy.sh
fi

# Reload/restart services
sudo systemctl reload nginx || true
sudo systemctl restart app || true
"; then
log "error" "Failed to activate deployment on $server"
return 1
fi

log "info" "Successfully deployed to $server"
return 0
}

# Rollback single server
rollback_server() {
local server=$1
local deploy_user=${2:-"deploy"}
local deploy_path=${3:-"/var/www/app"}

local previous_release=$(get_state "previous_release_${server}")
if [ -z "$previous_release" ]; then
log "error" "No previous release found for $server"
return 1
fi

log "warning" "Rolling back $server to $previous_release"

ssh "$deploy_user@$server" "
ln -sfn $previous_release ${deploy_path}/current
sudo systemctl reload nginx || true
sudo systemctl restart app || true
"
}

# Blue-Green deployment strategy
deploy_blue_green() {
local artifact=$1
local config=$2

# Parse configuration
local blue_servers=($(echo "$config" | jq -r '.blue_servers[]'))
local green_servers=($(echo "$config" | jq -r '.green_servers[]'))
local health_endpoint=$(echo "$config" | jq -r '.health_check.endpoint // "/health"')

# Determine which environment is currently live
local live_env=$(get_state "live_environment")
local target_env
local target_servers

if [ "$live_env" = "blue" ]; then
target_env="green"
target_servers=("${green_servers[@]}")
else
target_env="blue"
target_servers=("${blue_servers[@]}")
fi

log "info" "Starting blue-green deployment to $target_env environment"
update_state "deployment_id" "$(date +%s)"
update_state "deployment_status" "in_progress"

# Deploy to target environment
local failed_servers=()
for server in "${target_servers[@]}"; do
if ! deploy_to_server "$server" "$artifact"; then
failed_servers+=("$server")
fi
done

if [ ${#failed_servers[@]} -gt 0 ]; then
log "error" "Deployment failed on servers: ${failed_servers[*]}"
update_state "deployment_status" "failed"
return 1
fi

# Health check all target servers
log "info" "Running health checks on $target_env environment"
for server in "${target_servers[@]}"; do
if ! perform_health_check "$server" "$health_endpoint"; then
log "error" "Health check failed for $server"
update_state "deployment_status" "failed"

# Rollback is not needed in blue-green as we haven't switched traffic
return 1
fi
done

# Switch traffic
log "info" "Switching traffic to $target_env environment"
for server in "${target_servers[@]}"; do
update_load_balancer "enable" "$server"
done

# Drain old environment
local old_servers
if [ "$target_env" = "blue" ]; then
old_servers=("${green_servers[@]}")
else
old_servers=("${blue_servers[@]}")
fi

for server in "${old_servers[@]}"; do
update_load_balancer "drain" "$server"
done

# Wait for draining
for server in "${old_servers[@]}"; do
wait_for_drain "$server"
update_load_balancer "maint" "$server"
done

# Update state
update_state "live_environment" "$target_env"
update_state "deployment_status" "completed"
update_state "deployment_end" "$(date +%s)"

log "info" "Blue-green deployment completed successfully"
}

# Rolling deployment strategy
deploy_rolling() {
local artifact=$1
local config=$2

# Parse configuration
local servers=($(echo "$config" | jq -r '.servers[]'))
local batch_size=$(echo "$config" | jq -r '.batch_size // 1')
local health_endpoint=$(echo "$config" | jq -r '.health_check.endpoint // "/health"')
local pause_between_batches=$(echo "$config" | jq -r '.pause_seconds // 30')

log "info" "Starting rolling deployment (batch size: $batch_size)"
update_state "deployment_id" "$(date +%s)"
update_state "deployment_status" "in_progress"

# Process servers in batches
local total_servers=${#servers[@]}
local deployed=0

while [ $deployed -lt $total_servers ]; do
local batch_servers=()
local batch_end=$((deployed + batch_size))

# Get servers for this batch
for ((i=deployed; i<batch_end && i<total_servers; i++)); do
batch_servers+=("${servers[$i]}")
done

log "info" "Processing batch: ${batch_servers[*]}"

# Drain servers in batch
for server in "${batch_servers[@]}"; do
update_load_balancer "drain" "$server"
done

# Wait for all to drain
for server in "${batch_servers[@]}"; do
wait_for_drain "$server"
done

# Deploy to batch
local batch_failed=false
for server in "${batch_servers[@]}"; do
if ! deploy_to_server "$server" "$artifact"; then
log "error" "Deployment failed on $server"
batch_failed=true
break
fi

# Health check immediately
if ! perform_health_check "$server" "$health_endpoint"; then
log "error" "Health check failed for $server"
batch_failed=true
break
fi

# Re-enable in load balancer
update_load_balancer "ready" "$server"
done

if [ "$batch_failed" = true ]; then
log "error" "Batch deployment failed, initiating rollback"

# Rollback all deployed servers
for ((i=0; i<deployed+${#batch_servers[@]}; i++)); do
rollback_server "${servers[$i]}"
update_load_balancer "ready" "${servers[$i]}"
done

update_state "deployment_status" "failed_rollback_completed"
return 1
fi

deployed=$((deployed + ${#batch_servers[@]}))

# Pause between batches if not the last batch
if [ $deployed -lt $total_servers ]; then
log "info" "Pausing $pause_between_batches seconds before next batch"
sleep "$pause_between_batches"
fi
done

update_state "deployment_status" "completed"
update_state "deployment_end" "$(date +%s)"
log "info" "Rolling deployment completed successfully"
}

# Canary deployment strategy
deploy_canary() {
local artifact=$1
local config=$2

# Parse configuration
local servers=($(echo "$config" | jq -r '.servers[]'))
local canary_duration=$(echo "$config" | jq -r '.canary_duration // 300')
local success_rate_threshold=$(echo "$config" | jq -r '.success_rate_threshold // 99.5')
local health_endpoint=$(echo "$config" | jq -r '.health_check.endpoint // "/health"')

# Calculate canary size
local total_servers=${#servers[@]}
local canary_count=$((total_servers * CANARY_PERCENTAGE / 100))
[ $canary_count -lt 1 ] && canary_count=1

log "info" "Starting canary deployment ($canary_count of $total_servers servers)"

# Deploy to canary servers
local canary_servers=("${servers[@]:0:$canary_count}")
for server in "${canary_servers[@]}"; do
update_load_balancer "drain" "$server"
wait_for_drain "$server"

if ! deploy_to_server "$server" "$artifact"; then
log "error" "Canary deployment failed on $server"
rollback_server "$server"
update_load_balancer "ready" "$server"
return 1
fi

if ! perform_health_check "$server" "$health_endpoint"; then
log "error" "Canary health check failed for $server"
rollback_server "$server"
update_load_balancer "ready" "$server"
return 1
fi

update_load_balancer "ready" "$server"
done

# Monitor canary servers
log "info" "Monitoring canary servers for $canary_duration seconds"
local start_time=$(date +%s)
local end_time=$((start_time + canary_duration))

while [ $(date +%s) -lt $end_time ]; do
# Check error rates
local error_rate=$(calculate_error_rate "${canary_servers[@]}")
local success_rate=$(echo "100 - $error_rate" | bc)

if (( $(echo "$success_rate < $success_rate_threshold" | bc -l) )); then
log "error" "Canary success rate ($success_rate%) below threshold ($success_rate_threshold%)"

# Rollback canary servers
for server in "${canary_servers[@]}"; do
update_load_balancer "drain" "$server"
done

for server in "${canary_servers[@]}"; do
wait_for_drain "$server"
rollback_server "$server"
update_load_balancer "ready" "$server"
done

return 1
fi

sleep 10
done

log "info" "Canary phase successful, proceeding with full deployment"

# Deploy to remaining servers using rolling strategy
local remaining_servers=("${servers[@]:$canary_count}")
local remaining_config=$(echo "$config" | jq --argjson servers "$(printf '%s\n' "${remaining_servers[@]}" | jq -R . | jq -s .)" '.servers = $servers')

deploy_rolling "$artifact" "$remaining_config"
}

# Calculate error rate from logs/metrics
calculate_error_rate() {
local servers=("$@")

# This would integrate with your metrics system
# For demo, returning mock value
echo "0.5"
}

# Logging
log() {
local level=$1
shift
echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $*" | tee -a /var/log/deploy.log
}

# Main deployment orchestrator
main() {
local artifact=$1
local strategy=${2:-"rolling"}

if [ ! -f "$artifact" ]; then
log "error" "Artifact not found: $artifact"
exit 1
fi

if [ ! -f "$CONFIG_FILE" ]; then
log "error" "Configuration not found: $CONFIG_FILE"
exit 1
fi

local config=$(cat "$CONFIG_FILE")

# Validate strategy
if [ -z "${STRATEGIES[$strategy]}" ]; then
log "error" "Unknown deployment strategy: $strategy"
exit 1
fi

# Execute deployment
${STRATEGIES[$strategy]} "$artifact" "$config"
}

if [ "${BASH_SOURCE[0]}" = "${0}" ]; then
main "$@"
fi

Q5: Create a disaster recovery automation system that handles failover between regions
This demonstrates complex orchestration and state management:
#!/bin/bash

# Multi-region disaster recovery automation system
# Handles database replication, DNS failover, and application state synchronization

readonly DR_CONFIG="/etc/disaster-recovery/config.json"
readonly STATE_DIR="/var/lib/dr-automation"
readonly HEALTH_CHECK_INTERVAL=30
readonly FAILOVER_THRESHOLD=3 # Consecutive failures before failover
readonly RPO_WARNING_THRESHOLD=300 # 5 minutes

mkdir -p "$STATE_DIR"

# Region health tracking
declare -A REGION_HEALTH
declare -A REGION_FAILURES
declare -A LAST_REPLICATION_LAG

# Load configuration
load_config() {
if [ ! -f "$DR_CONFIG" ]; then
log "error" "DR configuration not found"
exit 1
fi

# Export configuration as environment variables
eval "$(jq -r '
.regions[] |
"REGION_\(.name | ascii_upcase)_ENDPOINT=\"\(.endpoint)\"",
"REGION_\(.name | ascii_upcase)_DB_HOST=\"\(.database.host)\"",
"REGION_\(.name | ascii_upcase)_DB_PORT=\"\(.database.port // 5432)\""
' "$DR_CONFIG")"

# Get all regions
REGIONS=($(jq -r '.regions[].name' "$DR_CONFIG"))
PRIMARY_REGION=$(jq -r '.primary_region' "$DR_CONFIG")
}

# Comprehensive health check for a region
check_region_health() {
local region=$1
local endpoint_var="REGION_${region^^}_ENDPOINT"
local endpoint="${!endpoint_var}"

log "debug" "Checking health of region: $region"

# Application health check
local app_health=false
local http_status=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 5 --max-time 10 \
"https://$endpoint/health" || echo "000")

if [ "$http_status" = "200" ]; then
app_health=true
fi

# Database health check
local db_health=false
local db_host_var="REGION_${region^^}_DB_HOST"
local db_port_var="REGION_${region^^}_DB_PORT"
local db_host="${!db_host_var}"
local db_port="${!db_port_var}"

if pg_isready -h "$db_host" -p "$db_port" -t 5 >/dev/null 2>&1; then
db_health=true
fi

# Check replication lag if not primary
local replication_ok=true
if [ "$region" != "$PRIMARY_REGION" ]; then
local lag=$(check_replication_lag "$region")
LAST_REPLICATION_LAG[$region]=$lag

if [ "$lag" -gt "$RPO_WARNING_THRESHOLD" ]; then
log "warning" "High replication lag for $region: ${lag}s"
replication_ok=false
fi
fi

# Overall health status
if [ "$app_health" = true ] && [ "$db_health" = true ] && [ "$replication_ok" = true ]; then
REGION_HEALTH[$region]="healthy"
REGION_FAILURES[$region]=0
return 0
else
REGION_HEALTH[$region]="unhealthy"
REGION_FAILURES[$region]=$((${REGION_FAILURES[$region]:-0} + 1))

log "error" "Region $region health check failed - App: $app_health, DB: $db_health, Replication: $replication_ok"
return 1
fi
}

# Check database replication lag
check_replication_lag() {
local region=$1
local db_host_var="REGION_${region^^}_DB_HOST"
local db_host="${!db_host_var}"

# Get replication lag in seconds
local lag=$(psql -h "$db_host" -U postgres -t -c "
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int
WHERE pg_is_in_recovery();" 2>/dev/null | tr -d ' ')

if [ -z "$lag" ] || [ "$lag" = "NULL" ]; then
echo "0"
else
echo "$lag"
fi
}

# Perform DNS failover
perform_dns_failover() {
local from_region=$1
local to_region=$2
local domain=$(jq -r '.domain' "$DR_CONFIG")
local dns_provider=$(jq -r '.dns_provider' "$DR_CONFIG")

log "warning" "Initiating DNS failover from $from_region to $to_region"

case "$dns_provider" in
route53)
# AWS Route53 failover
local hosted_zone=$(aws route53 list-hosted-zones-by-name \
--query "HostedZones[?Name=='${domain}.'].Id" \
--output text | cut -d'/' -f3)

local to_endpoint_var="REGION_${to_region^^}_ENDPOINT"
local to_endpoint="${!to_endpoint_var}"

# Update DNS record
aws route53 change-resource-record-sets \
--hosted-zone-id "$hosted_zone" \
--change-batch "{
\"Changes\": [{
\"Action\": \"UPSERT\",
\"ResourceRecordSet\": {
\"Name\": \"${domain}\",
\"Type\": \"CNAME\",
\"TTL\": 60,
\"ResourceRecords\": [{\"Value\": \"${to_endpoint}\"}]
}
}]
}"
;;

cloudflare)
# Cloudflare API
local zone_id=$(curl -s -X GET "https://api.cloudflare.com/client/v4/zones?name=$domain" \
-H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" | \
jq -r '.result[0].id')

local record_id=$(curl -s -X GET \
"https://api.cloudflare.com/client/v4/zones/$zone_id/dns_records?name=$domain" \
-H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" | \
jq -r '.result[0].id')

local to_endpoint_var="REGION_${to_region^^}_ENDPOINT"
local to_endpoint="${!to_endpoint_var}"

curl -X PUT "https://api.cloudflare.com/client/v4/zones/$zone_id/dns_records/$record_id" \
-H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
-H "Content-Type: application/json" \
--data "{\"type\":\"CNAME\",\"name\":\"$domain\",\"content\":\"$to_endpoint\",\"ttl\":60}"
;;
esac

log "info" "DNS failover initiated, propagation may take up to 60 seconds"
}

# Promote standby database to primary
promote_database() {
local region=$1
local db_host_var="REGION_${region^^}_DB_HOST"
local db_host="${!db_host_var}"

log "warning" "Promoting database in region $region to primary"

# Execute promotion (PostgreSQL example)
ssh "postgres@$db_host" "pg_ctl promote -D /var/lib/postgresql/data"

# Update application configuration to point to new primary
local app_servers=($(jq -r ".regions[] | select(.name==\"$region\") | .app_servers[]" "$DR_CONFIG"))

for server in "${app_servers[@]}"; do
ssh "deploy@$server" "
sed -i 's/^DB_HOST=.*/DB_HOST=$db_host/' /etc/app/config
sudo systemctl restart app
"
done
}

# Sync application state
sync_application_state() {
local from_region=$1
local to_region=$2

log "info" "Syncing application state from $from_region to $to_region"

# Sync Redis/cache data
local from_redis=$(jq -r ".regions[] | select(.name==\"$from_region\") | .redis.host" "$DR_CONFIG")
local to_redis=$(jq -r ".regions[] | select(.name==\"$to_region\") | .redis.host" "$DR_CONFIG")

# Create Redis dump and restore
ssh "redis@$from_redis" "redis-cli BGSAVE && sleep 2"
ssh "redis@$from_redis" "cat /var/lib/redis/dump.rdb" | \
ssh "redis@$to_redis" "cat > /var/lib/redis/dump.rdb && redis-cli FLUSHALL && redis-server --loadmodule"

# Sync session data if using file-based sessions
local from_servers=($(jq -r ".regions[] | select(.name==\"$from_region\") | .app_servers[]" "$DR_CONFIG"))
local to_servers=($(jq -r ".regions[] | select(.name==\"$to_region\") | .app_servers[]" "$DR_CONFIG"))

# Use parallel for efficiency
printf '%s\n' "${from_servers[@]}" | parallel -j 4 "
rsync -av

Bash Shell Scripting Interview Questions And Answers
Whether you're just starting out or have years of experience, bash shell interview questions test your understanding of the shell's unique features, syntax quirks, and best practices that make bash the go-to choice for system automation. We'll cover the essential bash-specific concepts that interviewers love to explore - from array manipulation and string operations to process substitution and advanced parameter expansion. These questions focus on bash's distinctive capabilities that set it apart from basic POSIX shell scripting.
Q1: What's the difference between [ ] and [[ ]] in bash?
This is a classic that trips up many people. The single bracket [ ] is actually a command (same as test), while double brackets [[ ]] are a bash keyword with enhanced functionality:
[[ ]] supports pattern matching: [[ $string == pa* ]] works, but [ $string == pa* ] doesn't
[[ ]] handles empty variables better: [[ -z $empty ]] works fine, while [ -z $empty ] needs quotes
[[ ]] supports regex matching with =~ operator
[[ ]] allows && and || inside: [[ $a -eq 1 && $b -eq 2 ]]
The key insight: always use [[ ]] in bash unless you need POSIX compatibility. It's safer and more powerful.
Q2: Explain bash arrays and how they differ from regular variables
Bash supports both indexed and associative arrays, which many people don't fully utilize:
Indexed arrays:
arr=(apple banana cherry)
echo ${arr[0]} # apple
echo ${arr[@]} # all elements
echo ${#arr[@]} # array length (3)

Associative arrays (bash 4+):
declare -A hash
hash[name]="John"
hash[age]=30
echo ${hash[name]} # John

The tricky part is that bash arrays are sparse - you can have arr[0] and arr[10] without arr[1-9]. Also, array indices start at 0, not 1 like in some shells.
Q3: How does command substitution work and what's the difference between `` and $()?
Both capture command output, but $() is strongly preferred:
$() nests easily: echo $(echo $(date))
Backticks require escaping for nesting: echo `echo \`date\ ``
$() is clearer to read, especially with syntax highlighting
$() handles quotes more predictably
Example showing the difference:
# This works fine
result=$(echo "hello $(echo "world")")

# This is a nightmare with backticks
result=`echo "hello \`echo \"world\"\`"`

Q4: What are bash parameter expansions and give some advanced examples?
Parameter expansion is bash's Swiss Army knife for string manipulation:
${var:-default} - Use default if var is unset/null
${var:=default} - Set var to default if unset/null
${var:?error} - Exit with error if var is unset/null
${var:+alternate} - Use alternate if var is set
Advanced string manipulation:
path="/home/user/documents/file.txt"
echo ${path##*/} # file.txt (basename)
echo ${path%/*} # /home/user/documents (dirname)
echo ${path/documents/archive} # String replacement
echo ${path^^} # Convert to uppercase
echo ${path,,} # Convert to lowercase

The pattern: # removes from beginning, % from end. Double it (##, %%) for greedy matching.
Q5: Explain the different types of expansions in bash and their order
Bash performs expansions in a specific order, and understanding this prevents many bugs:
Brace expansion: {a,b,c} or {1..10}
Tilde expansion: ~ becomes home directory
Parameter/variable expansion: $var or ${var}
Command substitution: $(command) or `command`
Arithmetic expansion: $((2 + 2))
Word splitting: Based on IFS
Pathname expansion: *.txt (globbing)
Why this matters:
# This doesn't work as expected
var="*.txt"
ls $var # Word splitting happens after parameter expansion

# This works
ls *.txt # Pathname expansion happens directly

Q6: What is process substitution and when would you use it?
Process substitution <(command) creates a temporary file descriptor that acts like a file:
# Compare outputs of two commands
diff <(ls dir1) <(ls dir2)

# Read from multiple commands simultaneously
while read -u 3 line1 && read -u 4 line2; do
echo "File1: $line1 | File2: $line2"
done 3< <(cat file1) 4< <(cat file2)

# Use with commands that only accept files
grep "pattern" <(curl -s https://example.com)

It's incredibly useful when you need to treat command output as a file without creating temporary files. The key limitation: it's not POSIX, so it's bash/zsh specific.
Q7: How do you handle signals and traps in bash?
Traps let you intercept signals and run cleanup code:
# Basic trap syntax
trap 'echo "Ctrl+C pressed"' INT
trap 'cleanup_function' EXIT
trap 'echo "Error on line $LINENO"' ERR

# Remove trap
trap - INT

# List all signals
trap -l

# Useful pattern: cleanup on any exit
cleanup() {
rm -f /tmp/tempfile.$$
echo "Cleaned up"
}
trap cleanup EXIT INT TERM

Pro tip: Always use single quotes in trap commands to prevent premature expansion. The ERR trap is especially useful with set -e for debugging.
Q8: Explain bash's set options and their impact on script behavior
The set command changes bash's behavior fundamentally:
set -e (errexit): Exit on any command failure
set -u (nounset): Exit on undefined variable
set -o pipefail: Pipe fails if any command fails
set -x (xtrace): Print commands before execution
Best practice combo:
set -euo pipefail # Strict mode
set -E # ERR trap inherits to functions
set -T # DEBUG/RETURN traps inherit to functions

You can also use set +e to temporarily disable, and $- contains current options. The gotcha: set -e doesn't work in certain contexts like if conditions.
Q9: What are coprocesses in bash and how do you use them?
Coprocesses (bash 4+) let you run background processes with two-way communication:
# Start a coprocess
coproc BC { bc; }

# Send data to coprocess
echo "2+2" >&${BC[1]}

# Read result
read result <&${BC[0]}
echo $result # 4

# More complex example with named coprocess
coproc GREP { grep --line-buffered "error"; }
echo "this is fine" >&${GREP[1]}
echo "error occurred" >&${GREP[1]}
read -t 1 line <&${GREP[0]} && echo "Got: $line"

Coprocesses are useful for maintaining persistent connections to programs like bc, awk, or database clients without the overhead of starting new processes.
Q10: How does bash handle arithmetic and what are the different ways to perform calculations?
Bash offers multiple arithmetic methods, each with different capabilities:
# Arithmetic expansion - integers only
result=$((5 + 3))
((count++)) # C-style increment

# let command
let "result = 5 + 3"

# expr command (external, avoid)
result=$(expr 5 + 3)

# bc for floating point
result=$(echo "scale=2; 5/3" | bc)

# Arithmetic comparison in conditions
if ((5 > 3)); then echo "yes"; fi

# C-style for loop
for ((i=0; i<10; i++)); do
echo $i
done

Bash only supports integer arithmetic natively. For floating-point, use bc or awk. The ((...)) construct returns 0 for true, opposite of normal bash commands.
Q11: What's the difference between source, ., and executing a script?
These three methods have crucial differences:
./script.sh - Runs in a subshell, can't modify parent environment
source script.sh or . script.sh - Runs in current shell, can modify environment
exec script.sh - Replaces current shell process entirely
Example showing the difference:
# In script.sh:
export MY_VAR="hello"

# Method 1: Won't see MY_VAR after
./script.sh
echo $MY_VAR # Empty

# Method 2: Will see MY_VAR
source script.sh
echo $MY_VAR # hello

# Method 3: Never returns to original shell
exec script.sh
echo "Never printed"

Use source for configuration files, ./ for independent scripts, and exec for wrapper scripts.
Q12: How do you properly handle spaces and special characters in bash?
This is where most scripts break. Key strategies:
# Always quote variables
bad: rm $file
good: rm "$file"

# Use arrays for multiple items
files=("file 1.txt" "file 2.txt")
for f in "${files[@]}"; do
cat "$f"
done

# Read with proper IFS handling
while IFS= read -r line; do
echo "$line"
done < file.txt

# Handle filenames with newlines
find . -name "*.txt" -print0 | while IFS= read -r -d '' file; do
echo "Processing: $file"
done

# Safe command construction
cmd=(ls -la "/path with spaces/")
"${cmd[@]}" # Executes correctly

The golden rule: If it's not a numeric value you're absolutely sure about, quote it. Use "$@" not $* for arguments, and always use -r with read to prevent backslash interpretation.

Stay Ahead in DevOps Interviews

Join Our DevOps Course with Placement & Interview Support

Conclusion

We have discussed the top shell interview questions and answers than may be asked in the DevOps interview on the shell. For the requirement of brevity, we have listed only some questions and answers which are important. However, more questions and answers will be discussed when you take up the DevOps course at StarAgile institute. StarAgile is an industry-recognized institute for the DevOps training online which can be taken at the comforts of home and office. We recommend you to enquire about the DevOps training at StarAgile institute and register for the various certification benefits.

About Author

Karan Gupta

Cloud Engineer