Shell Scripting Interview Questions for DevOps

Image
Shell Scripting Interview Questions for DevOps
Prepare for DevOps interview with the help of these Shell scripting interview questions for DevOps and make sure you place yourself right in your career.
Blog Author
Published on
Nov 20, 2020
Views
9729
Read Time
20 Mins
Table of Content

What is Shell Scripting

Shell scripting is basically writing a series of commands in a text file that your computer's terminal can run automatically. Think of it like creating a recipe where, instead of cooking steps, you're telling your computer what tasks to do in order. It's super handy for automating repetitive stuff like backing up files, managing system tasks, or running multiple programs with just one command. Most people use Bash shell on Linux/Mac, but there's also PowerShell on Windows - either way, it saves tons of time once you get the hang of it.

How to prepare for a Shell Scripting Interview

Get your basics rock solid - Make sure you're comfortable with fundamental commands like ls, grep, sed, awk, and pipes. Most interview questions for shell scripting start with these building blocks, so practice combining them in different ways.
Practice writing actual scripts - Don't just memorize syntax. Write scripts for real tasks like file backups, log parsing, or system monitoring. Interviewers love seeing practical problem-solving skills.
Master the common shell scripting interview questions - You'll definitely get asked about variables, loops, conditionals, and functions. Practice explaining the difference between $@ and $*, when to use double vs single quotes, and how exit codes work.
Debug like crazy - Learn to use set -x for debugging and understand error handling with trap commands. Being able to troubleshoot broken scripts on the spot is a huge plus.
Know your shell differences - Understand what makes bash different from sh, ksh, or zsh. Some companies are picky about POSIX compliance.
Study real-world scenarios - Look up common automation tasks like monitoring disk space, rotating logs, or deploying applications. These make great discussion points during interviews.
Practice explaining your code - The best script in the world won't help if you can't walk through your logic clearly. Practice talking through your problem-solving approach out loud.

Shell Scripting Interview Questions and Answers For Beginners
Starting your journey with shell scripting interviews? Don't worry - most bash scripting interview questions for beginners focus on practical basics rather than complex theory. We'll walk through the most common bash shell interview questions that entry-level positions typically cover, helping you understand not just what to answer, but why things work the way they do. These questions come up constantly because they test whether you can actually write useful scripts, not just memorize commands.
Q1: What exactly is a shell script?
Think of it as a text file filled with commands you'd normally type one by one in the terminal. Instead of typing them manually every time, you save them in a file and run them all at once. It's like creating a macro for your terminal - super helpful when you need to do the same tasks repeatedly.
Q2: How do I create my first shell script?
Open any text editor (nano, vim, or even notepad), type your commands, and save it with a .sh extension. Here's the simplest example:
bash
#!/bin/bash
echo "My first script!"
Save it as myscript.sh, make it executable with chmod +x myscript.sh, then run with ./myscript.sh.
Q3: Why do some variables have $ and others don't?
You use $ when you want to get the value from a variable, but not when you're setting it. Setting: name="John" (no $). Using: echo "Hello $name" (with $). It's like the difference between writing on a nametag versus reading what's already written there.
Q4: What are those -f, -d, -e things I see in if statements?
These are test operators that check file conditions. -f checks if something is a regular file, -d checks for directories, -e checks if anything exists at that path. There are tons more: -r for readable, -w for writable, -x for executable. They're your script's way of looking before it leaps.
Q5: How do I do basic math in bash?
Bash is quirky with math - you need special syntax. Use $((expression)) for arithmetic: result=$((5 + 3)). For decimals, bash can't handle them natively - you'd need bc: echo "5.5 + 2.3" | bc. Remember, spaces don't matter inside $(( )), which is unusual for bash.
Q6: What's this 2>&1 thing I keep seeing?
This redirects error messages to wherever regular output is going. The 2 represents stderr (error messages), 1 represents stdout (normal output), and & tells bash "I mean the file descriptor, not a file named 1". So command > output.txt 2>&1 sends both regular output and errors to output.txt.
Q7: How do I check if a command succeeded?
Every command returns an exit code - 0 for success, anything else for failure. Check it right after the command:
bash
cp file1 file2
if [ $? -eq 0 ]; then
    echo "Copy worked!"
else
    echo "Copy failed!"
fi
Q8: What's the deal with spaces in bash?
Bash is super picky about spaces, especially in conditions. if [ $a = $b ] needs spaces around the brackets and operators. But a=5 can't have spaces around the =. It's annoying at first, but you'll develop muscle memory for it pretty quickly.
Q9: How do I loop through a list of items?
The for loop is your friend here. For a simple list:
bash
for fruit in apple banana orange; do
    echo "I like $fruit"
done
For files: for file in *.txt; do something; done. The loop automatically splits on spaces unless you mess with IFS.
Q10: Can I use variables from outside my script?
Yes! These are environment variables. Your script can read system variables like $HOME, $USER, $PATH. To pass your own, either export them first (export MYVAR="value") or set them inline: MYVAR="value" ./myscript.sh. Just remember they're readable, not writable - changes inside the script won't affect the outside.
Q11: What happens if my script crashes midway?
By default, bash keeps going even if commands fail. To make it stop on errors, add set -e near the top. Want to see what's happening? Add set -x for debug mode. Combine them as set -ex. Just remember these make your script stricter - sometimes you want to handle errors yourself instead.
Q12: How do I get input from users?
The read command is what you need. Basic usage: read username waits for input and stores it. Fancy it up with read -p "What's your name? " username for a prompt. Want a password? Use read -s to hide what they type. Set a timeout with read -t 10 to wait only 10 seconds.
Shell Scripting Coding Interview Questions and Answers For Beginners
Ready to tackle actual shell scripting coding interview questions? These hands-on problems test whether you can write working scripts, not just talk about concepts. We'll cover the classic coding challenges that beginners face in interviews - the ones where you need to write real bash code to solve practical problems. Each solution includes the full script with explanations, so you can understand the logic and adapt it for similar questions.
Q1: Write a script to check if a number is even or odd
This tests your understanding of arithmetic operations and conditionals:
bash
#!/bin/bash

echo "Enter a number: "
read num

# Check if the input is actually a number
if ! [[ "$num" =~ ^[0-9]+$ ]]; then
    echo "Please enter a valid number"
    exit 1
fi

# Use modulo operator to check even/odd
if [ $((num % 2)) -eq 0 ]; then
    echo "$num is even"
else
    echo "$num is odd"
fi
The key here is using the modulo operator (%) inside $(( )). If a number divided by 2 has no remainder, it's even. The regex check prevents the script from breaking on non-numeric input.
Q2: Create a script that counts files in a directory by extension
This problem tests file handling and associative arrays:
bash
#!/bin/bash

# Check if directory argument provided
if [ $# -eq 0 ]; then
    echo "Usage: $0 <directory>"
    exit 1
fi

directory="$1"

# Check if directory exists
if [ ! -d "$directory" ]; then
    echo "Directory $directory doesn't exist"
    exit 1
fi

# Count files by extension
echo "File count by extension in $directory:"
for file in "$directory"/*; do
    # Skip if no files found
    [ -e "$file" ] || continue
    
    # Skip directories
    [ -f "$file" ] || continue
    
    # Extract extension
    extension="${file##*.}"
    
    # Handle files without extension
    if [ "$extension" = "$file" ]; then
        extension="no_extension"
    fi
    
    # Count using associative array would be ideal, but 
    # for beginners, let's use a simpler approach
    echo "$extension"
done | sort | uniq -c | sort -rn
This solution uses parameter expansion (${file##*.}) to extract extensions and pipes the output through sort and uniq to count occurrences. It's a common pattern in shell scripting - using Unix tools together.
Q3: Write a script to backup files modified in the last 7 days
This tests your knowledge of the find command and date handling:
bash
#!/bin/bash

# Set source and backup directories
source_dir="${1:-.}"  # Use current directory if not specified
backup_dir="backup_$(date +%Y%m%d)"

# Create backup directory
mkdir -p "$backup_dir"

echo "Starting backup of files modified in last 7 days..."

# Find and copy files modified in last 7 days
count=0
find "$source_dir" -type f -mtime -7 | while read -r file; do
    # Skip if file is in backup directory
    [[ "$file" == *"$backup_dir"* ]] && continue
    
    # Create subdirectory structure in backup
    rel_path="${file#$source_dir/}"
    backup_path="$backup_dir/$rel_path"
    backup_subdir="$(dirname "$backup_path")"
    
    mkdir -p "$backup_subdir"
    
    # Copy file preserving timestamps
    if cp -p "$file" "$backup_path" 2>/dev/null; then
        ((count++))
        echo "Backed up: $rel_path"
    fi
done

echo "Backup complete! $count files copied to $backup_dir"
The script uses find with -mtime -7 to locate recent files. The while read loop processes filenames with spaces correctly. We preserve the directory structure in the backup and use cp -p to keep timestamps.
Q4: Create a script that monitors disk usage and alerts when it exceeds 80%
This demonstrates system monitoring and parsing command output:
bash
#!/bin/bash

# Set threshold
threshold=80

echo "Checking disk usage..."
echo "========================"

# Get disk usage and process each line
df -h | grep -E '^/dev/' | while read -r line; do
    # Parse the df output
    device=$(echo "$line" | awk '{print $1}')
    usage=$(echo "$line" | awk '{print $5}' | sed 's/%//')
    mount=$(echo "$line" | awk '{print $6}')
    
    # Check if usage exceeds threshold
    if [ "$usage" -gt "$threshold" ]; then
        echo "WARNING: $mount ($device) is at $usage% capacity!"
        
        # You could add email notification here
        # echo "Disk $mount is at $usage%" | mail -s "Disk Alert" admin@example.com
    else
        echo "OK: $mount ($device) is at $usage% capacity"
    fi
done

# Also check if any partition is critically full (>90%)
critical=$(df -h | grep -E '^/dev/' | awk '{print $5}' | sed 's/%//' | sort -n | tail -1)
if [ "$critical" -gt 90 ]; then
    echo ""
    echo "CRITICAL: At least one partition is over 90% full!"
    exit 1
fi
This script parses df output using awk to extract specific fields. The percentage sign is stripped with sed so we can do numeric comparison. Real-world scripts might send emails or write to log files.
Q5: Write a script that validates and processes a CSV file
This tests file parsing and error handling:
bash
#!/bin/bash

# Check if file provided
if [ $# -eq 0 ]; then
    echo "Usage: $0 <csv_file>"
    exit 1
fi

csv_file="$1"

# Check if file exists and is readable
if [ ! -r "$csv_file" ]; then
    echo "Error: Cannot read file $csv_file"
    exit 1
fi

echo "Processing CSV file: $csv_file"
echo "================================"

# Process header
header=$(head -1 "$csv_file")
num_fields=$(echo "$header" | awk -F',' '{print NF}')
echo "Header fields ($num_fields): $header"
echo ""

# Process data lines
line_num=1
error_count=0
valid_count=0

tail -n +2 "$csv_file" | while IFS=',' read -r field1 field2 field3 remainder; do
    ((line_num++))
    
    # Check if line has correct number of fields
    current_fields=$(echo "$field1,$field2,$field3,$remainder" | awk -F',' '{print NF}')
    
    if [ "$current_fields" -ne "$num_fields" ]; then
        echo "Error on line $line_num: Expected $num_fields fields, got $current_fields"
        ((error_count++))
        continue
    fi
    
    # Basic validation - check if fields are not empty
    if [ -z "$field1" ] || [ -z "$field2" ] || [ -z "$field3" ]; then
        echo "Error on line $line_num: Empty required fields"
        ((error_count++))
        continue
    fi
    
    # Process valid line (example: just echo it)
    echo "Processing: $field1 | $field2 | $field3"
    ((valid_count++))
done

echo ""
echo "Summary: Processed $valid_count valid lines, found $error_count errors"
This script shows proper CSV handling using IFS (Internal Field Separator) and read. It validates the number of fields per line and checks for empty values. The tail -n +2 skips the header for processing.

Shell Scripting Interview Questions and Answers For Intermediate (2-4 years Exp).
Moving beyond the basics, intermediate shell scripting coding interview questions focus on real-world problem solving, performance optimization, and handling complex scenarios you've likely encountered in production. We'll explore questions that test your experience with error handling, process management, and writing maintainable scripts that other developers can understand and modify. These questions reflect the challenges you face when scripts move from personal tools to team resources that need to be reliable and efficient.
Q1: How do you handle errors gracefully in production scripts?
At this level, you need multiple strategies working together. I use trap to catch errors and cleanup, set -euo pipefail to stop on failures, and custom error functions:
#!/bin/bash
set -euo pipefail

# Global error handler
error_exit() {
    echo "Error on line $1: $2" >&2
    cleanup
    exit 1
}

cleanup() {
    # Remove temp files, unlock resources, etc
    rm -f /tmp/mylockfile
    echo "Cleanup completed"
}

# Set trap for errors and exit
trap 'error_exit $LINENO "Command failed"' ERR
trap cleanup EXIT

# Your actual script logic here

The key is combining these techniques - set -e stops on errors, -u catches undefined variables, -o pipefail catches errors in pipes, and trap ensures cleanup happens no matter what.
Q2: Explain process substitution and when you'd use it
Process substitution <(command) creates a temporary file descriptor that acts like a file. It's brilliant for comparing outputs or feeding multiple processes:
# Compare two command outputs
diff <(ls dir1) <(ls dir2)

# Feed sorted data to a command expecting a file
join <(sort file1) <(sort file2)

# Multiple inputs to a single command
paste <(cut -d: -f1 /etc/passwd) <(cut -d: -f3 /etc/passwd)

I use it when I need to avoid temporary files or when working with commands that only accept file arguments but I want to give them dynamic data.
Q3: How do you implement proper logging in shell scripts?
Production scripts need structured logging with timestamps, levels, and rotation. Here's my go-to pattern:
LOG_FILE="/var/log/myscript.log"
LOG_LEVEL=${LOG_LEVEL:-"INFO"}  # Can override via environment

log() {
    local level=$1
    shift
    local message="$@"
    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
    
    # Check log level
    case $LOG_LEVEL in
        ERROR) [[ $level =~ ^(ERROR)$ ]] || return ;;
        WARN)  [[ $level =~ ^(ERROR|WARN)$ ]] || return ;;
        INFO)  [[ $level =~ ^(ERROR|WARN|INFO)$ ]] || return ;;
        DEBUG) ;; # Log everything
    esac
    
    echo "[$timestamp] [$level] $message" | tee -a "$LOG_FILE"
}

# Usage
log INFO "Script started"
log ERROR "Connection failed"
log DEBUG "Variable X = $x"

Don't forget log rotation - either use logrotate or implement size checking in your script.
Q4: How do you handle concurrent script execution?
Preventing race conditions is crucial. I typically use flock for reliable locking:
LOCK_FILE="/var/run/myscript.lock"
exec 200>"$LOCK_FILE"

if ! flock -n 200; then
    echo "Another instance is running. Exiting."
    exit 1
fi

# Alternative: wait for lock with timeout
if ! flock -w 10 200; then
    echo "Could not acquire lock after 10 seconds"
    exit 1
fi

# Script continues here with exclusive lock

For more complex scenarios, I might use mkdir as an atomic operation or implement a PID file system that checks if the process is actually still running.
Q5: What's your approach to parsing complex command line arguments?
For production scripts, I use getopts for short options and manual parsing for long options:
usage() {
    cat << EOF
Usage: $0 [-h] [-v] [-f FILE] [--debug] [--config CONFIG]
Options:
    -h, --help      Show this help
    -v, --verbose   Enable verbose mode
    -f FILE         Input file
    --debug         Enable debug mode
    --config FILE   Configuration file
EOF
}

# Parse arguments
VERBOSE=0
DEBUG=0
while [[ $# -gt 0 ]]; do
    case $1 in
        -h|--help)
            usage
            exit 0
            ;;
        -v|--verbose)
            VERBOSE=1
            shift
            ;;
        -f)
            INPUT_FILE="$2"
            shift 2
            ;;
        --debug)
            DEBUG=1
            set -x  # Enable bash debugging
            shift
            ;;
        --config)
            CONFIG_FILE="$2"
            shift 2
            ;;
        --)
            shift
            break
            ;;
        -*)
            echo "Unknown option: $1" >&2
            usage
            exit 1
            ;;
        *)
            break
            ;;
    esac
done

# Remaining arguments are in $@

Q6: How do you optimize scripts that process large files?
Never load entire files into memory. Instead, stream process them:
# Bad: loads entire file
for line in $(cat hugefile.txt); do
    process "$line"
done

# Good: streams line by line
while IFS= read -r line; do
    process "$line"
done < hugefile.txt

# Better: parallel processing for CPU-bound tasks
cat hugefile.txt | parallel -j 4 process {}

# For simple transformations, use tools designed for it
awk '{sum += $3} END {print sum}' hugefile.txt  # Sum third column
sed -i 's/old/new/g' hugefile.txt  # In-place replacement

Also consider splitting large files and processing chunks in parallel, or using tools like split, sort -S for memory limits, and join instead of loading everything into associative arrays.
Q7: Explain your strategy for making scripts portable across different systems
Portability is tricky. I start with POSIX compliance when possible, but pragmatically handle differences:
#!/usr/bin/env bash  # More portable than /bin/bash

# Detect OS
if [[ "$OSTYPE" == "linux-gnu"* ]]; then
    OS="linux"
elif [[ "$OSTYPE" == "darwin"* ]]; then
    OS="macos"
elif [[ "$OSTYPE" == "freebsd"* ]]; then
    OS="freebsd"
fi

# Handle command differences
if command -v gsed &> /dev/null; then
    SED=gsed  # GNU sed on Mac
else
    SED=sed
fi

# Check for required commands
for cmd in awk grep "$SED"; do
    if ! command -v "$cmd" &> /dev/null; then
        echo "Required command '$cmd' not found" >&2
        exit 1
    fi
done

# Use portable options
find . -type f -name "*.txt" -print0 | xargs -0 grep "pattern"  # Works everywhere

Q8: How do you implement timeout functionality for long-running commands?
The timeout command is great when available, but here's a portable solution:
# Using timeout command (if available)
if command -v timeout &> /dev/null; then
    timeout 30s long_running_command
else
    # Manual implementation
    long_running_command &
    pid=$!
    count=0
    while kill -0 $pid 2>/dev/null && [ $count -lt 30 ]; do
        sleep 1
        ((count++))
    done
    
    if kill -0 $pid 2>/dev/null; then
        echo "Command timed out after 30 seconds"
        kill -TERM $pid
        sleep 2
        kill -0 $pid 2>/dev/null && kill -KILL $pid
    fi
fi

# Alternative: using read with timeout for user input
if read -t 10 -p "Enter value (10 sec timeout): " value; then
    echo "You entered: $value"
else
    echo "Timeout or cancelled"
fi

Q9: What's your approach to debugging complex shell scripts?
Beyond set -x, I use structured debugging:
# Debug function with levels
DEBUG_LEVEL=${DEBUG_LEVEL:-0}

debug() {
    local level=$1
    shift
    if [ "$DEBUG_LEVEL" -ge "$level" ]; then
        echo "[DEBUG$level] $*" >&2
    fi
}

# Trace function execution
trace() {
    local func=$1
    shift
    debug 1 "Entering $func with args: $*"
    local start=$(date +%s.%N)
    
    "$func" "$@"
    local ret=$?
    
    local end=$(date +%s.%N)
    local duration=$(echo "$end - $start" | bc)
    debug 1 "Exiting $func (duration: ${duration}s, return: $ret)"
    return $ret
}

# PS4 for better trace output
export PS4='+(${BASH_SOURCE}:${LINENO}): ${FUNCNAME[0]:+${FUNCNAME[0]}(): }'

# Conditional debugging
[ "$DEBUG" = "1" ] && set -x

I also use ShellCheck religiously, and for really tough bugs, I'll use bashdb or add extensive logging.
Q10: How do you manage configuration in shell scripts?
I layer configurations from multiple sources with clear precedence:
# Default configuration
declare -A CONFIG=(
    [db_host]="localhost"
    [db_port]="5432"
    [log_level]="INFO"
)

# Load system config
[ -f /etc/myapp/config ] && source /etc/myapp/config

# Load user config
[ -f ~/.myapp/config ] && source ~/.myapp/config

# Load project config
[ -f ./config ] && source ./config

# Environment variables override everything
[ -n "$MYAPP_DB_HOST" ] && CONFIG[db_host]=$MYAPP_DB_HOST
[ -n "$MYAPP_DB_PORT" ] && CONFIG[db_port]=$MYAPP_DB_PORT

# Validate required settings
for key in db_host db_port; do
    if [ -z "${CONFIG[$key]}" ]; then
        echo "Error: Missing required config: $key" >&2
        exit 1
    fi
done

# Export for child processes if needed
export MYAPP_CONFIG_DB_HOST="${CONFIG[db_host]}"

This gives flexibility while maintaining predictable behavior - defaults, system-wide settings, user preferences, project overrides, and finally environment variables.

Shell Scripting Coding Interview Questions and Answers For Intermediate (2-4 years Exp)
At the intermediate level, shell scripting coding interview questions shift from basic syntax to solving real production challenges - the kind where your script needs to handle thousands of files, recover from failures, and play nice with other systems. We'll tackle the practical problems that someone with a few years under their belt should handle confidently, focusing on efficiency, reliability, and maintainability. These are the scenarios where your experience shows through in how you structure your solution, not just whether it works.
Q1: Write a script that monitors a log file in real-time and sends alerts for specific error patterns
This tests your ability to handle continuous file monitoring and pattern matching:
#!/bin/bash

# Configuration
LOG_FILE="${1:-/var/log/application.log}"
ALERT_EMAIL="admin@company.com"
ERROR_PATTERNS=("ERROR" "CRITICAL" "FATAL" "OutOfMemory")
ALERT_THRESHOLD=5  # Alert after 5 errors in 60 seconds
WINDOW_SIZE=60

# Check if log file exists
if [ ! -f "$LOG_FILE" ]; then
    echo "Error: Log file $LOG_FILE not found"
    exit 1
fi

# Initialize error tracking
declare -A error_counts
declare -A error_timestamps

# Function to send alert
send_alert() {
    local pattern=$1
    local count=$2
    local message="Alert: $count occurrences of '$pattern' in last $WINDOW_SIZE seconds"
    
    # In real scenario, you'd send email or post to Slack
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $message"
    
    # Example email command (commented out)
    # echo "$message" | mail -s "Log Alert: $pattern" $ALERT_EMAIL
    
    # Reset counter after alert
    error_counts[$pattern]=0
    error_timestamps[$pattern]=""
}

# Function to clean old timestamps
clean_old_timestamps() {
    local pattern=$1
    local current_time=$(date +%s)
    local new_timestamps=""
    
    # Keep only timestamps within window
    for ts in ${error_timestamps[$pattern]}; do
        if [ $((current_time - ts)) -lt $WINDOW_SIZE ]; then
            new_timestamps="$new_timestamps $ts"
        fi
    done
    
    error_timestamps[$pattern]="$new_timestamps"
    error_counts[$pattern]=$(echo "$new_timestamps" | wc -w)
}

echo "Monitoring $LOG_FILE for error patterns..."
echo "Patterns: ${ERROR_PATTERNS[*]}"
echo "Alert threshold: $ALERT_THRESHOLD errors in $WINDOW_SIZE seconds"
echo "Press Ctrl+C to stop"

# Monitor log file
tail -F "$LOG_FILE" | while read -r line; do
    # Check each pattern
    for pattern in "${ERROR_PATTERNS[@]}"; do
        if [[ "$line" =~ $pattern ]]; then
            current_time=$(date +%s)
            
            # Add timestamp
            error_timestamps[$pattern]="${error_timestamps[$pattern]} $current_time"
            
            # Clean old timestamps and update count
            clean_old_timestamps "$pattern"
            
            # Check if threshold exceeded
            if [ "${error_counts[$pattern]}" -ge "$ALERT_THRESHOLD" ]; then
                send_alert "$pattern" "${error_counts[$pattern]}"
            fi
            
            # Debug output
            echo "[$(date '+%H:%M:%S')] Found: $pattern (count: ${error_counts[$pattern]})"
        fi
    done
done

The script uses tail -F (capital F) to follow log rotation, tracks errors within a time window, and only alerts when thresholds are exceeded. This prevents alert spam while catching real issues.
Q2: Create a script that performs parallel backup of multiple directories with progress tracking
This demonstrates process management and parallel execution:
#!/bin/bash

# Source directories to backup
SOURCE_DIRS=(
    "/home/user/documents"
    "/home/user/projects"
    "/var/www/html"
    "/etc"
)

BACKUP_ROOT="/backup/$(date +%Y%m%d_%H%M%S)"
MAX_PARALLEL=3
COMPRESSION="gzip"  # or "bzip2", "xz"

# Create backup directory
mkdir -p "$BACKUP_ROOT"

# File to track progress
PROGRESS_FILE="/tmp/backup_progress_$$"
rm -f "$PROGRESS_FILE"
touch "$PROGRESS_FILE"

# Function to backup a single directory
backup_directory() {
    local src_dir=$1
    local dir_name=$(basename "$src_dir")
    local backup_file="$BACKUP_ROOT/${dir_name}.tar.gz"
    local pid=$$
    local start_time=$(date +%s)
    
    echo "$pid:$dir_name:0:STARTING" >> "$PROGRESS_FILE"
    
    # Calculate directory size for progress
    local total_size=$(du -sb "$src_dir" 2>/dev/null | awk '{print $1}')
    
    # Create backup with progress monitoring
    tar cf - "$src_dir" 2>/dev/null | \
    pv -s "$total_size" -n 2>/tmp/pv_progress_$$ | \
    gzip > "$backup_file" &
    
    local tar_pid=$!
    
    # Monitor progress
    while kill -0 $tar_pid 2>/dev/null; do
        if [ -f /tmp/pv_progress_$$ ]; then
            progress=$(tail -1 /tmp/pv_progress_$$ 2>/dev/null || echo "0")
            echo "$pid:$dir_name:$progress:RUNNING" >> "$PROGRESS_FILE"
        fi
        sleep 1
    done
    
    # Check if backup succeeded
    wait $tar_pid
    local exit_code=$?
    
    local end_time=$(date +%s)
    local duration=$((end_time - start_time))
    
    if [ $exit_code -eq 0 ]; then
        local final_size=$(stat -f%z "$backup_file" 2>/dev/null || stat -c%s "$backup_file")
        echo "$pid:$dir_name:100:COMPLETED:$duration:$final_size" >> "$PROGRESS_FILE"
    else
        echo "$pid:$dir_name:0:FAILED:$duration:0" >> "$PROGRESS_FILE"
    fi
    
    rm -f /tmp/pv_progress_$$
}

# Function to display progress
show_progress() {
    while true; do
        clear
        echo "Backup Progress - $(date '+%Y-%m-%d %H:%M:%S')"
        echo "=================================================="
        
        # Read current progress
        while IFS=':' read -r pid dir progress status duration size; do
            case $status in
                STARTING)
                    printf "%-30s [%s]\n" "$dir" "Initializing..."
                    ;;
                RUNNING)
                    # Create progress bar
                    bar_length=30
                    filled=$((progress * bar_length / 100))
                    bar=$(printf '%*s' "$filled" | tr ' ' '=')
                    empty=$((bar_length - filled))
                    bar="${bar}$(printf '%*s' "$empty" | tr ' ' '-')"
                    printf "%-30s [%s] %3d%%\n" "$dir" "$bar" "$progress"
                    ;;
                COMPLETED)
                    size_mb=$((size / 1024 / 1024))
                    printf "%-30s [DONE] %dMB in %ds\n" "$dir" "$size_mb" "$duration"
                    ;;
                FAILED)
                    printf "%-30s [FAILED]\n" "$dir"
                    ;;
            esac
        done < "$PROGRESS_FILE"
        
        # Check if all backups are done
        if ! grep -q "RUNNING\|STARTING" "$PROGRESS_FILE" 2>/dev/null; then
            echo ""
            echo "All backups completed!"
            break
        fi
        
        sleep 1
    done
}

# Start progress display in background
show_progress &
progress_pid=$!

# Run backups in parallel with job control
job_count=0
for dir in "${SOURCE_DIRS[@]}"; do
    # Wait if we've hit the parallel limit
    while [ $(jobs -r | wc -l) -ge $MAX_PARALLEL ]; do
        sleep 0.5
    done
    
    # Start backup job
    backup_directory "$dir" &
    ((job_count++))
done

# Wait for all backup jobs
wait

# Stop progress display
kill $progress_pid 2>/dev/null
show_progress  # Show final status

# Cleanup
rm -f "$PROGRESS_FILE"

# Generate summary report
echo ""
echo "Backup Summary"
echo "=============="
echo "Location: $BACKUP_ROOT"
echo "Total archives: $(ls -1 "$BACKUP_ROOT"/*.tar.gz 2>/dev/null | wc -l)"
echo "Total size: $(du -sh "$BACKUP_ROOT" | awk '{print $1}')"

# List all backups with sizes
ls -lh "$BACKUP_ROOT"/*.tar.gz 2>/dev/null

This script showcases parallel job management, real-time progress tracking with pv, and proper cleanup. It handles multiple directories simultaneously while keeping the user informed.
Q3: Write a script that synchronizes configuration files across multiple servers
This tests your understanding of remote execution and error handling:
#!/bin/bash

# Configuration
CONFIG_DIR="/etc/myapp"
SERVERS=("web01.example.com" "web02.example.com" "web03.example.com")
SSH_USER="deploy"
SSH_KEY="$HOME/.ssh/deploy_key"
BACKUP_BEFORE_SYNC=true
DRY_RUN=false

# Parse command line
while getopts "dnh" opt; do
    case $opt in
        d) DRY_RUN=true ;;
        n) BACKUP_BEFORE_SYNC=false ;;
        h) echo "Usage: $0 [-d] [-n] [-h]"
           echo "  -d  Dry run (show what would be done)"
           echo "  -n  No backup before sync"
           exit 0 ;;
    esac
done

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

# Logging function
log() {
    local level=$1
    shift
    local msg="$@"
    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
    
    case $level in
        ERROR) echo -e "${RED}[$timestamp] ERROR: $msg${NC}" >&2 ;;
        SUCCESS) echo -e "${GREEN}[$timestamp] SUCCESS: $msg${NC}" ;;
        INFO) echo -e "[$timestamp] INFO: $msg" ;;
        WARN) echo -e "${YELLOW}[$timestamp] WARN: $msg${NC}" ;;
    esac
    
    # Also log to file
    echo "[$timestamp] $level: $msg" >> "/var/log/config_sync.log"
}

# Function to check server connectivity
check_server() {
    local server=$1
    ssh -q -o ConnectTimeout=5 -o BatchMode=yes \
        -i "$SSH_KEY" "$SSH_USER@$server" exit 2>/dev/null
}

# Function to backup configs on remote server
backup_remote_configs() {
    local server=$1
    local backup_name="config_backup_$(date +%Y%m%d_%H%M%S).tar.gz"
    
    log INFO "Creating backup on $server"
    
    ssh -i "$SSH_KEY" "$SSH_USER@$server" "
        if [ -d '$CONFIG_DIR' ]; then
            sudo tar czf /tmp/$backup_name $CONFIG_DIR 2>/dev/null
            sudo mv /tmp/$backup_name /var/backups/
            echo 'Backup created: /var/backups/$backup_name'
        else
            echo 'Config directory not found, skipping backup'
        fi
    "
}

# Function to sync single file
sync_file() {
    local server=$1
    local file=$2
    local rel_path=${file#$CONFIG_DIR/}
    
    # Calculate checksum
    local local_sum=$(md5sum "$file" | awk '{print $1}')
    local remote_sum=$(ssh -i "$SSH_KEY" "$SSH_USER@$server" \
        "sudo md5sum '$file' 2>/dev/null | awk '{print \$1}'")
    
    if [ "$local_sum" = "$remote_sum" ]; then
        echo "  ✓ $rel_path (unchanged)"
        return 0
    fi
    
    if [ "$DRY_RUN" = true ]; then
        echo "  → Would sync: $rel_path"
        return 0
    fi
    
    # Copy file
    if scp -i "$SSH_KEY" "$file" "$SSH_USER@$server:/tmp/$(basename "$file")" >/dev/null 2>&1; then
        # Move to final location with sudo
        if ssh -i "$SSH_KEY" "$SSH_USER@$server" "
            sudo mkdir -p $(dirname "$file")
            sudo mv /tmp/$(basename "$file") $file
            sudo chown root:root $file
            sudo chmod 644 $file
        "; then
            echo "  ✓ $rel_path (updated)"
            return 0
        fi
    fi
    
    echo "  ✗ $rel_path (failed)"
    return 1
}

# Main sync process
log INFO "Starting configuration sync"
[ "$DRY_RUN" = true ] && log WARN "Running in DRY RUN mode"

# Check all servers first
log INFO "Checking server connectivity..."
available_servers=()
for server in "${SERVERS[@]}"; do
    if check_server "$server"; then
        available_servers+=("$server")
        echo "  ✓ $server"
    else
        log ERROR "Cannot connect to $server"
        echo "  ✗ $server"
    fi
done

if [ ${#available_servers[@]} -eq 0 ]; then
    log ERROR "No servers available for sync"
    exit 1
fi

# Find all config files
config_files=$(find "$CONFIG_DIR" -type f -name "*.conf" -o -name "*.yml" -o -name "*.json")
file_count=$(echo "$config_files" | wc -l)

log INFO "Found $file_count configuration files to sync"

# Sync to each server
for server in "${available_servers[@]}"; do
    log INFO "Syncing to $server"
    
    # Backup if requested
    if [ "$BACKUP_BEFORE_SYNC" = true ] && [ "$DRY_RUN" = false ]; then
        backup_remote_configs "$server"
    fi
    
    # Sync each file
    success_count=0
    fail_count=0
    
    while IFS= read -r file; do
        if sync_file "$server" "$file"; then
            ((success_count++))
        else
            ((fail_count++))
        fi
    done <<< "$config_files"
    
    # Reload services if sync succeeded
    if [ $fail_count -eq 0 ] && [ "$DRY_RUN" = false ]; then
        log INFO "Reloading services on $server"
        ssh -i "$SSH_KEY" "$SSH_USER@$server" "
            sudo systemctl reload nginx 2>/dev/null || true
            sudo systemctl reload myapp 2>/dev/null || true
        "
    fi
    
    # Report for this server
    if [ $fail_count -eq 0 ]; then
        log SUCCESS "$server: $success_count files synced successfully"
    else
        log ERROR "$server: $fail_count files failed (${success_count} succeeded)"
    fi
done

log INFO "Configuration sync completed"

This script handles multiple servers, does checksums to avoid unnecessary transfers, creates backups, and includes proper error handling with dry-run support.
Q4: Create a script that analyzes system performance and generates an HTML report
This shows data collection, processing, and formatted output:
#!/bin/bash

# Output file
REPORT_FILE="system_report_$(hostname)_$(date +%Y%m%d_%H%M%S).html"
TEMP_DIR="/tmp/sysreport_$$"
mkdir -p "$TEMP_DIR"

# Thresholds for warnings
CPU_THRESHOLD=80
MEM_THRESHOLD=85
DISK_THRESHOLD=90

# Function to get CSS color based on value
get_color() {
    local value=$1
    local threshold=$2
    
    if [ $(echo "$value >= $threshold" | bc) -eq 1 ]; then
        echo "#ff4444"  # Red
    elif [ $(echo "$value >= $threshold * 0.8" | bc) -eq 1 ]; then
        echo "#ff9944"  # Orange
    else
        echo "#44ff44"  # Green
    fi
}

# Collect system information
echo "Collecting system information..."

# Basic info
HOSTNAME=$(hostname)
UPTIME=$(uptime -p)
KERNEL=$(uname -r)
CPU_MODEL=$(grep "model name" /proc/cpuinfo | head -1 | cut -d: -f2)
CPU_CORES=$(nproc)
TOTAL_MEM=$(free -h | awk '/^Mem:/ {print $2}')

# CPU usage (over 5 seconds)
echo "Analyzing CPU usage..."
sar -u 1 5 > "$TEMP_DIR/cpu_stats.txt" 2>/dev/null || {
    # Fallback if sar not available
    top -b -n 2 -d 5 | grep "Cpu(s)" | tail -1 > "$TEMP_DIR/cpu_stats.txt"
}
CPU_USAGE=$(awk '/Average:/ {print 100 - $NF}' "$TEMP_DIR/cpu_stats.txt" || \
            awk '{print $2}' "$TEMP_DIR/cpu_stats.txt" | tr -d '%id,' || echo "0")

# Memory usage
MEM_STATS=$(free -m | awk '/^Mem:/ {printf "%.1f", ($3/$2) * 100}')

# Disk usage
echo "Analyzing disk usage..."
DISK_DATA=""
while read -r line; do
    device=$(echo "$line" | awk '{print $1}')
    mount=$(echo "$line" | awk '{print $6}')
    usage=$(echo "$line" | awk '{print $5}' | tr -d '%')
    size=$(echo "$line" | awk '{print $2}')
    used=$(echo "$line" | awk '{print $3}')
    avail=$(echo "$line" | awk '{print $4}')
    
    color=$(get_color "$usage" "$DISK_THRESHOLD")
    DISK_DATA="${DISK_DATA}
    <tr>
        <td>$device</td>
        <td>$mount</td>
        <td>$size</td>
        <td>$used</td>
        <td>$avail</td>
        <td style='color: $color; font-weight: bold;'>$usage%</td>
    </tr>"
done < <(df -h | grep '^/dev/' | grep -v '/dev/loop')

# Top processes
echo "Analyzing processes..."
TOP_CPU=$(ps aux --sort=-%cpu | head -6 | tail -5)
TOP_MEM=$(ps aux --sort=-%mem | head -6 | tail -5)

# Network statistics
NETSTAT=$(ss -s 2>/dev/null | grep -E "TCP:|UDP:" | head -4)

# Recent system logs (errors/warnings)
RECENT_ERRORS=$(journalctl -p err -n 10 --no-pager 2>/dev/null || \
                dmesg | grep -iE "error|fail|warn" | tail -10)

# Generate HTML report
cat > "$REPORT_FILE" << 'EOF'
<!DOCTYPE html>
<html>
<head>
    <title>System Performance Report</title>
    <meta charset="utf-8">
    <style>
        body {
            font-family: Arial, sans-serif;
            margin: 20px;
            background-color: #f5f5f5;
        }
        .container {
            max-width: 1200px;
            margin: 0 auto;
            background-color: white;
            padding: 20px;
            box-shadow: 0 0 10px rgba(0,0,0,0.1);
        }
        h1, h2 {
            color: #333;
            border-bottom: 2px solid #4CAF50;
            padding-bottom: 10px;
        }
        .info-grid {
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
            gap: 20px;
            margin: 20px 0;
        }
        .info-box {
            background-color: #f9f9f9;
            padding: 15px;
            border-radius: 5px;
            border-left: 4px solid #4CAF50;
        }
        table {
            width: 100%;
            border-collapse: collapse;
            margin: 20px 0;
        }
        th, td {
            text-align: left;
            padding: 12px;
            border-bottom: 1px solid #ddd;
        }
        th {
            background-color: #4CAF50;
            color: white;
        }
        tr:hover {
            background-color: #f5f5f5;
        }
        .metric {
            font-size: 24px;
            font-weight: bold;
            margin: 10px 0;
        }
        .warning {
            background-color: #fff3cd;
            border-color: #ffeaa7;
            color: #856404;
            padding: 10px;
            border-radius: 5px;
            margin: 10px 0;
        }
        .timestamp {
            text-align: right;
            color: #666;
            font-style: italic;
        }
        pre {
            background-color: #f4f4f4;
            padding: 10px;
            border-radius: 5px;
            overflow-x: auto;
        }
    </style>
</head>
<body>
<div class="container">
EOF

# Add content
cat >> "$REPORT_FILE" << EOF
    <h1>System Performance Report - $HOSTNAME</h1>
    <p class="timestamp">Generated on $(date '+%Y-%m-%d %H:%M:%S')</p>
    
    <div class="info-grid">
        <div class="info-box">
            <strong>Hostname:</strong> $HOSTNAME<br>
            <strong>Uptime:</strong> $UPTIME<br>
            <strong>Kernel:</strong> $KERNEL
        </div>
        <div class="info-box">
            <strong>CPU Model:</strong> $CPU_MODEL<br>
            <strong>CPU Cores:</strong> $CPU_CORES<br>
            <strong>Total Memory:</strong> $TOTAL_MEM
        </div>
        <div class="info-box">
            <strong>CPU Usage:</strong>
            <div class="metric" style="color: $(get_color $CPU_USAGE $CPU_THRESHOLD)">
                ${CPU_USAGE}%
            </div>
        </div>
        <div class="info-box">
            <strong>Memory Usage:</strong>
            <div class="metric" style="color: $(get_color $MEM_STATS $MEM_THRESHOLD)">
                ${MEM_STATS}%
            </div>
        </div>
    </div>
EOF

# Add warnings if thresholds exceeded
if [ $(echo "$CPU_USAGE >= $CPU_THRESHOLD" | bc) -eq 1 ]; then
    echo '<div class="warning">⚠️ High CPU usage detected!</div>' >> "$REPORT_FILE"
fi
if [ $(echo "$MEM_STATS >= $MEM_THRESHOLD" | bc) -eq 1 ]; then
    echo '<div class="warning">⚠️ High memory usage detected!</div>' >> "$REPORT_FILE"
fi

# Add disk usage table
cat >> "$REPORT_FILE" << EOF
    <h2>Disk Usage</h2>
    <table>
        <tr>
            <th>Device</th>
            <th>Mount Point</th>
            <th>Size</th>
            <th>Used</th>
            <th>Available</th>
            <th>Usage %</th>
        </tr>
        $DISK_DATA
    </table>
    
    <h2>Top CPU Consuming Processes</h2>
    <pre>$(echo "$TOP_CPU" | awk '{printf "%-8s %-8s %6s  %s\n", $1, $2, $9, $11}')</pre>
    
    <h2>Top Memory Consuming Processes</h2>
    <pre>$(echo "$TOP_MEM" | awk '{printf "%-8s %-8s %6s  %s\n", $1, $2, $10, $11}')</pre>
    
    <h2>Network Statistics</h2>
    <pre>$NETSTAT</pre>
    
    <h2>Recent System Errors/Warnings</h2>
    <pre>$(echo "$RECENT_ERRORS" | head -20)</pre>
    
    </div>
</body>
</html>
EOF

# Cleanup
rm -rf "$TEMP_DIR"

echo "Report generated: $REPORT_FILE"
# Optional: open in browser
if command -v xdg-open >/dev/null 2>&1; then
    xdg-open "$REPORT_FILE"
elif command -v open >/dev/null 2>&1; then
    open "$REPORT_FILE"
fi

This creates a professional-looking HTML report with color-coded metrics, responsive design, and comprehensive system analysis.
Q5: Write a script that implements a job queue system with worker processes
This demonstrates advanced process control and inter-process communication:
#!/bin/bash

# Configuration
QUEUE_DIR="/var/spool/job_queue"
WORKERS=4
PID_FILE="/var/run/job_queue.pid"
LOG_FILE="/var/log/job_queue.log"
WORKER_TIMEOUT=300  # 5 minutes max per job

# Create necessary directories
mkdir -p "$QUEUE_DIR"/{pending,processing,completed,failed}

# Logging function
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG_FILE"
}

# Signal handlers
cleanup() {
    log "Shutting down job queue system..."
    
    # Signal all workers to stop
    touch "$QUEUE_DIR/.shutdown"
    
    # Wait for workers to finish current jobs
    for pid in $(jobs -p); do
        log "Waiting for worker $pid to finish..."
        wait $pid 2>/dev/null
    done
    
    rm -f "$PID_FILE" "$QUEUE_DIR/.shutdown"
    log "Shutdown complete"
    exit 0
}

trap cleanup SIGINT SIGTERM

# Function to add a job to queue
add_job() {
    local job_type=$1
    local job_data=$2
    local priority=${3:-5}  # Default priority 5 (1=highest, 9=lowest)
    local job_id=$(date +%s%N)_$$
    local job_file="$QUEUE_DIR/pending/${priority}_${job_id}.job"
    
    cat > "$job_file" << EOF
JOB_ID=$job_id
JOB_TYPE=$job_type
JOB_DATA=$job_data
SUBMIT_TIME=$(date +%s)
SUBMIT_USER=$USER
STATUS=pending
EOF
    
    log "Job added: $job_id (type: $job_type, priority: $priority)"
    echo "$job_id"
}

# Worker function
worker_process() {
    local worker_id=$1
    log "Worker $worker_id started (PID: $$)"
    
    while [ ! -f "$QUEUE_DIR/.shutdown" ]; do
        # Find next job (sorted by priority and age)
        local job_file=$(ls -1 "$QUEUE_DIR/pending/"*.job 2>/dev/null | sort | head -1)
        
        if [ -z "$job_file" ]; then
            # No jobs, wait a bit
            sleep 2
            continue
        fi
        
        # Try to claim the job (atomic operation)
        local processing_file="$QUEUE_DIR/processing/$(basename "$job_file")"
        if ! mv "$job_file" "$processing_file" 2>/dev/null; then
            # Another worker got it first
            continue
        fi
        
        # Load job details
        source "$processing_file"
        log "Worker $worker_id processing job $JOB_ID (type: $JOB_TYPE)"
        
        # Update job status
        echo "WORKER=$worker_id" >> "$processing_file"
        echo "START_TIME=$(date +%s)" >> "$processing_file"
        echo "STATUS=processing" >> "$processing_file"
        
        # Execute job with timeout
        local job_output="$QUEUE_DIR/processing/${JOB_ID}.out"
        local job_result=0
        
        case "$JOB_TYPE" in
            "compress")
                timeout $WORKER_TIMEOUT tar czf "${JOB_DATA}.tar.gz" "$JOB_DATA" \
                    > "$job_output" 2>&1
                job_result=$?
                ;;
            
            "backup")
                timeout $WORKER_TIMEOUT rsync -av "$JOB_DATA" "/backup/$JOB_DATA" \
                    > "$job_output" 2>&1
                job_result=$?
                ;;
            
            "report")
                timeout $WORKER_TIMEOUT /usr/local/bin/generate_report.sh "$JOB_DATA" \
                    > "$job_output" 2>&1
                job_result=$?
                ;;
            
            "custom")
                # Execute custom command (careful with security!)
                timeout $WORKER_TIMEOUT bash -c "$JOB_DATA" \
                    > "$job_output" 2>&1
                job_result=$?
                ;;
            
            *)
                echo "Unknown job type: $JOB_TYPE" > "$job_output"
                job_result=1
                ;;
        esac
        
        # Move to completed or failed
        local end_time=$(date +%s)
        local duration=$((end_time - START_TIME))
        
        echo "END_TIME=$end_time" >> "$processing_file"
        echo "DURATION=$duration" >> "$processing_file"
        echo "EXIT_CODE=$job_result" >> "$processing_file"
        
        if [ $job_result -eq 0 ]; then
            echo "STATUS=completed" >> "$processing_file"
            mv "$processing_file" "$QUEUE_DIR/completed/"
            mv "$job_output" "$QUEUE_DIR/completed/"
            log "Worker $worker_id completed job $JOB_ID in ${duration}s"
        else
            echo "STATUS=failed" >> "$processing_file"
            mv "$processing_file" "$QUEUE_DIR/failed/"
            mv "$job_output" "$QUEUE_DIR/failed/"
            log "Worker $worker_id: job $JOB_ID failed (exit code: $job_result)"
        fi
    done
    
    log "Worker $worker_id stopped"
}

# Status monitoring function
monitor_status() {
    while [ ! -f "$QUEUE_DIR/.shutdown" ]; do
        clear
        echo "Job Queue Status - $(date)"
        echo "================================"
        
        # Count jobs by status
        local pending=$(ls -1 "$QUEUE_DIR/pending/"*.job 2>/dev/null | wc -l)
        local processing=$(ls -1 "$QUEUE_DIR/processing/"*.job 2>/dev/null | wc -l)
        local completed=$(ls -1 "$QUEUE_DIR/completed/"*.job 2>/dev/null | wc -l)
        local failed=$(ls -1 "$QUEUE_DIR/failed/"*.job 2>/dev/null | wc -l)
        
        echo "Pending:    $pending"
        echo "Processing: $processing"
        echo "Completed:  $completed"
        echo "Failed:     $failed"
        echo ""
        
        # Show current processing jobs
        if [ $processing -gt 0 ]; then
            echo "Currently Processing:"
            echo "--------------------"
            for job in "$QUEUE_DIR/processing/"*.job; do
                [ -f "$job" ] || continue
                source "$job"
                local runtime=$(($(date +%s) - START_TIME))
                printf "  Job %s (Worker %d): %s - %ds\n" \
                    "$JOB_ID" "$WORKER" "$JOB_TYPE" "$runtime"
            done
            echo ""
        fi
        
        # Show worker status
        echo "Workers:"
        echo "--------"
        for i in $(seq 1 $WORKERS); do
            if kill -0 ${WORKER_PIDS[$i]} 2>/dev/null; then
                echo "  Worker $i: Running (PID: ${WORKER_PIDS[$i]})"
            else
                echo "  Worker $i: Stopped"
            fi
        done
        
        sleep 5
    done
}

# Main execution
if [ "$1" = "add" ]; then
    # Add job mode
    shift
    add_job "$@"
    exit 0
fi

if [ "$1" = "status" ]; then
    # Status mode
    monitor_status
    exit 0
fi

# Start queue manager
log "Starting job queue system with $WORKERS workers"
echo $$ > "$PID_FILE"

# Start workers
declare -a WORKER_PIDS
for i in $(seq 1 $WORKERS); do
    worker_process $i &
    WORKER_PIDS[$i]=$!
done

# Wait for all workers
log "Job queue system running. Press Ctrl+C to stop."
wait

This implements a complete job queue with priority handling, multiple workers, atomic job claiming, timeout protection, and comprehensive logging. It's the kind of system you'd build to handle background tasks in production.

Shell Scripting Interview Questions and Answers For Experienced (5+ years Exp)
At the senior level, advanced shell scripting interview questions dig deep into architectural decisions, performance at scale, and the wisdom that comes from maintaining production systems through their entire lifecycle. These questions explore how you handle the messy realities of enterprise environments - legacy system integration, cross-platform compatibility nightmares, and scripts that need to run reliably for years. The focus shifts from "can you write it" to "have you lived through the consequences of different approaches" and whether you can design solutions that other engineers can maintain long after you've moved on.
Q1: How do you design shell scripts for high-availability environments where failure isn't an option?
After years of 3 AM calls, I've learned that HA scripts need multiple layers of protection. First, I implement circuit breakers - if a script fails repeatedly, it stops trying and alerts instead of hammering a broken system. I use distributed locking across nodes (often with Redis or etcd) to prevent split-brain scenarios. Here's my approach:
Health checks before actions: Never assume a service is up. Always verify endpoints are responding correctly before sending traffic.
Idempotency everywhere: Every operation should be safe to run multiple times. I use checksums, version checks, and state files to ensure this.
Graceful degradation: If a non-critical component fails, the script continues with reduced functionality rather than failing completely.
Audit trails: Every action gets logged with who/what/when/why, often shipped to a central logging system for correlation.
The key insight I've gained is that HA isn't about preventing failures - it's about failing gracefully and recovering automatically. I've seen too many "clever" scripts that made things worse during outages.
Q2: Explain your approach to handling sensitive data in shell scripts
This is where experience really shows. Never, ever put secrets in scripts. I've cleaned up too many messes where passwords were in version control. My approach:
Environment variables for local development only, never in production
Secrets management tools like HashiCorp Vault, AWS Secrets Manager, or Kubernetes secrets
Temporary credentials with short TTLs whenever possible
Audit logging for every secret access
I also use set +x before any line that might expose secrets in debug output, and I'm paranoid about temporary files - always created with mktemp and proper permissions, cleaned up in trap handlers. I've seen scripts that wrote database passwords to /tmp with world-readable permissions. That's a career-limiting move.
Q3: How do you optimize shell scripts that process millions of records?
The biggest lesson I've learned: the shell isn't always the answer. But when it is, here's what works:
Streaming over loading: Never load large datasets into memory. Use pipes and process line by line.
GNU Parallel for CPU-bound tasks. It's incredible how much faster things run with proper parallelization.
Sort/join over nested loops: Unix tools are optimized for this. A sort | join pipeline beats nested loops every time.
Minimize process spawning: That for loop calling sed 1000 times? Rewrite it as a single awk script.
Real example: I once replaced a 6-hour customer data processing script with a sort | join | awk pipeline that ran in 12 minutes. The original developer was reading a 2GB file into an associative array. Sometimes the old Unix philosophy of small, focused tools really shines.
Q4: Describe your most complex debugging experience with shell scripts
The worst one that comes to mind involved a script that worked perfectly for 3 years, then started randomly failing on Tuesdays. Turned out a log rotation job was creating a race condition, but only when the log file exceeded 2GB, which only happened on our busiest day.
My debugging toolkit has evolved from painful experiences:
strace/dtrace to see actual system calls
bash -x is just the starting point; I often add custom debug functions that can be toggled
Reproducing environments exactly - same shell version, same locale settings, same everything
Binary search debugging - comment out half the script, see if it still fails
The real skill is knowing when to stop debugging and rewrite. I've learned that if I'm spending more than a day debugging a complex script, it's probably too complex and needs to be simplified or rewritten in a proper programming language.
Q5: How do you handle shell script deployment and versioning across hundreds of servers?
Configuration management is crucial here. I've used Puppet, Ansible, and Chef, but the principles remain the same:
Immutable deployments: Scripts are versioned packages, not edited in place
Canary deployments: Roll out to 1%, then 10%, then 50%, monitoring metrics at each stage
Feature flags: Even in shell scripts, I implement toggles for risky changes
Rollback procedures: Always have a quick way back. I version with symlinks for instant rollback
One hard lesson: never trust system package managers alone. I've been burned by different versions of bash, different coreutils implementations, even different versions of basic commands like 'date'. Now I always include compatibility checks and sometimes ship specific binary versions with my scripts.
Q6: What's your philosophy on when to use shell scripts versus a "real" programming language?
This is the question that separates experienced engineers from script kiddies. My rule of thumb:
Shell scripts are perfect for:
Gluing systems together
System administration tasks
Quick prototypes
Build and deployment automation
But I switch to Python/Go/Ruby when:
The script exceeds 200-300 lines
I need complex data structures
Error handling becomes more complex than the actual logic
I'm parsing anything more complex than simple delimited data
Performance is critical
I've maintained 2000-line bash scripts. It's not fun. The maintenance cost grows exponentially with complexity. Now I'm quick to recognize when bash has served its purpose and it's time to rewrite.
Q7: How do you ensure shell script security in a zero-trust environment?
Security has evolved way beyond just checking inputs. In zero-trust environments:
No permanent credentials: Everything uses temporary tokens with the minimum required scope
Mutual TLS for script-to-service communication
Signed scripts: We GPG sign critical scripts and verify signatures before execution
Minimal attack surface: Scripts run in containers or VMs with only required tools installed
Security scanning: All scripts go through static analysis tools looking for common vulnerabilities
I've also learned to be paranoid about seemingly innocent things. That curl command downloading a script from GitHub? What if someone compromises the repo? Now I verify checksums and use pinned versions for everything.
Q8: Describe your approach to making shell scripts cloud-agnostic
After migrating between AWS, GCP, and Azure multiple times, I've learned:
Abstract provider-specific calls: Create wrapper functions for cloud operations
Use cloud SDK tools carefully: They change. Always pin versions and have fallbacks
Metadata services are different: Each cloud has its own way. Abstract this early
Storage is never just storage: S3, GCS, and Azure Blob have subtle differences
Example approach:
cloud_provider_detect() {
    if curl -s -f -m 1 http://169.254.169.254/latest/meta-data/instance-id >/dev/null 2>&1; then
        echo "aws"
    elif curl -s -f -m 1 -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/id >/dev/null 2>&1; then
        echo "gcp"
    else
        echo "azure"  # Or onprem, needs more logic
    fi
}

The key is planning for portability from day one. Retrofitting cloud-agnostic behavior is painful.
Q9: How do you handle backwards compatibility when system tools get updated?
This is a constant battle. GNU vs BSD tools, different bash versions, changing command options. My strategies:
Feature detection over version detection: Don't check if it's bash 4.3, check if it supports associative arrays
Wrapper functions: Abstract system commands behind functions that handle differences
Comprehensive testing matrix: Test on minimum supported versions of everything
Document requirements explicitly: Every script starts with a requirements block
I maintain compatibility libraries for common issues:
# Example: readlink portability
portable_readlink() {
    if readlink -f "$1" 2>/dev/null; then
        return
    elif command -v greadlink >/dev/null; then
        greadlink -f "$1"
    else
        # Fallback implementation
        python -c "import os; print(os.path.realpath('$1'))"
    fi
}

Q10: What are the most important lessons you've learned about shell scripting in production?
The scars teach the best lessons:
Observability beats cleverness: A simple script with great logging beats a clever script you can't debug
Plan for failure from line 1: Not just error handling, but operational failure - what happens when your script runs during a datacenter failover?
Other people will maintain your code: Write for them, not to show off. Comment the "why", not the "what"
Test the unhappy paths: Everyone tests success. Test what happens when that API returns 500, when the disk is full, when DNS is flaking
Know when to stop: Some problems shouldn't be solved in bash. Recognizing this saves weeks of pain
The meta-lesson: shell scripting in production is 20% writing code and 80% thinking about what could go wrong. Every senior engineer has war stories about simple scripts causing major outages. The difference is we've learned to be paranoid in productive ways.

Shell Scripting Coding Interview Questions and Answers For Experienced (5+ years Exp)
Senior-level advanced shell scripting interview questions go beyond algorithms to test battle-hardened production experience - can you write code that survives server crashes, handles race conditions, and scales across distributed systems? These coding challenges reflect real scenarios you've probably debugged at 3 AM: service orchestration failures, data corruption recovery, and the kind of edge cases that only show up after months in production. The solutions here demonstrate not just working code, but the defensive programming and architectural thinking that comes from years of learning things the hard way.
Q1: Write a distributed lock manager for coordinating jobs across multiple servers
This solves the classic problem of preventing duplicate cron jobs in clustered environments:
#!/bin/bash

# Distributed lock implementation using shared filesystem or Redis
# Handles stale locks, network partitions, and crash recovery

LOCK_DIR="/shared/locks"
REDIS_HOST="${REDIS_HOST:-localhost}"
REDIS_PORT="${REDIS_PORT:-6379}"
BACKEND="${LOCK_BACKEND:-filesystem}"  # or "redis"

# Configuration
readonly SCRIPT_NAME=$(basename "$0")
readonly HOSTNAME=$(hostname -f)
readonly PID=$$
readonly LOCK_TIMEOUT=${LOCK_TIMEOUT:-300}  # 5 minutes default
readonly LOCK_RETRY_INTERVAL=2
readonly STALE_LOCK_THRESHOLD=600  # 10 minutes

# Generate unique lock identifier
generate_lock_id() {
    echo "${HOSTNAME}-${PID}-$(date +%s%N)"
}

# Filesystem-based lock implementation
fs_acquire_lock() {
    local lock_name=$1
    local lock_file="$LOCK_DIR/$lock_name.lock"
    local lock_id=$(generate_lock_id)
    local temp_file=$(mktemp)
    
    # Write lock metadata
    cat > "$temp_file" << EOF
{
    "holder": "$lock_id",
    "hostname": "$HOSTNAME",
    "pid": $PID,
    "acquired": $(date +%s),
    "timeout": $LOCK_TIMEOUT,
    "command": "$0 $*"
}
EOF
    
    # Try atomic lock creation
    if ln "$temp_file" "$lock_file" 2>/dev/null; then
        rm -f "$temp_file"
        echo "$lock_id"
        return 0
    fi
    
    # Lock exists, check if stale
    if [ -f "$lock_file" ]; then
        local lock_age=$(($(date +%s) - $(stat -c %Y "$lock_file" 2>/dev/null || stat -f %m "$lock_file")))
        local lock_data=$(cat "$lock_file" 2>/dev/null)
        local lock_pid=$(echo "$lock_data" | grep -o '"pid": [0-9]*' | grep -o '[0-9]*')
        local lock_host=$(echo "$lock_data" | grep -o '"hostname": "[^"]*"' | cut -d'"' -f4)
        
        # Check if lock is stale
        if [ $lock_age -gt $STALE_LOCK_THRESHOLD ]; then
            echo "Removing stale lock (age: ${lock_age}s)" >&2
            rm -f "$lock_file"
            # Retry
            fs_acquire_lock "$lock_name"
            return $?
        fi
        
        # Check if process still exists (same host only)
        if [ "$lock_host" = "$HOSTNAME" ] && [ -n "$lock_pid" ]; then
            if ! kill -0 "$lock_pid" 2>/dev/null; then
                echo "Removing lock from dead process $lock_pid" >&2
                rm -f "$lock_file"
                # Retry
                fs_acquire_lock "$lock_name"
                return $?
            fi
        fi
    fi
    
    rm -f "$temp_file"
    return 1
}

fs_release_lock() {
    local lock_name=$1
    local lock_id=$2
    local lock_file="$LOCK_DIR/$lock_name.lock"
    
    # Verify we own the lock before releasing
    if [ -f "$lock_file" ]; then
        local current_holder=$(grep -o '"holder": "[^"]*"' "$lock_file" 2>/dev/null | cut -d'"' -f4)
        if [ "$current_holder" = "$lock_id" ]; then
            rm -f "$lock_file"
            return 0
        else
            echo "Cannot release lock owned by $current_holder" >&2
            return 1
        fi
    fi
    return 0
}

# Redis-based lock implementation
redis_acquire_lock() {
    local lock_name=$1
    local lock_id=$(generate_lock_id)
    local lock_key="dlock:$lock_name"
    
    # Try to acquire lock with NX (only if not exists) and EX (expiry)
    local result=$(redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" \
        SET "$lock_key" "$lock_id" NX EX "$LOCK_TIMEOUT" 2>/dev/null)
    
    if [ "$result" = "OK" ]; then
        # Store metadata separately
        redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" >/dev/null 2>&1 <<EOF
HSET "${lock_key}:meta" holder "$lock_id"
HSET "${lock_key}:meta" hostname "$HOSTNAME"
HSET "${lock_key}:meta" pid "$PID"
HSET "${lock_key}:meta" acquired "$(date +%s)"
EXPIRE "${lock_key}:meta" "$LOCK_TIMEOUT"
EOF
        echo "$lock_id"
        return 0
    fi
    
    # Check if lock is held by dead process on same host
    local lock_host=$(redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" \
        HGET "${lock_key}:meta" hostname 2>/dev/null)
    local lock_pid=$(redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" \
        HGET "${lock_key}:meta" pid 2>/dev/null)
    
    if [ "$lock_host" = "$HOSTNAME" ] && [ -n "$lock_pid" ]; then
        if ! kill -0 "$lock_pid" 2>/dev/null; then
            echo "Removing lock from dead process $lock_pid" >&2
            redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" DEL "$lock_key" "${lock_key}:meta" >/dev/null
            # Retry
            redis_acquire_lock "$lock_name"
            return $?
        fi
    fi
    
    return 1
}

redis_release_lock() {
    local lock_name=$1
    local lock_id=$2
    local lock_key="dlock:$lock_name"
    
    # Use Lua script for atomic check-and-delete
    redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" --eval - "$lock_key" "$lock_id" <<'EOF' >/dev/null
if redis.call("GET", KEYS[1]) == ARGV[1] then
    redis.call("DEL", KEYS[1], KEYS[1] .. ":meta")
    return 1
else
    return 0
end
EOF
}

# Main lock interface
acquire_lock() {
    local lock_name=$1
    local max_wait=${2:-0}
    local waited=0
    
    while true; do
        local lock_id
        case "$BACKEND" in
            filesystem)
                lock_id=$(fs_acquire_lock "$lock_name")
                ;;
            redis)
                lock_id=$(redis_acquire_lock "$lock_name")
                ;;
            *)
                echo "Unknown backend: $BACKEND" >&2
                return 1
                ;;
        esac
        
        if [ -n "$lock_id" ]; then
            echo "$lock_id"
            return 0
        fi
        
        if [ $max_wait -gt 0 ] && [ $waited -ge $max_wait ]; then
            return 1
        fi
        
        sleep $LOCK_RETRY_INTERVAL
        waited=$((waited + LOCK_RETRY_INTERVAL))
    done
}

release_lock() {
    local lock_name=$1
    local lock_id=$2
    
    case "$BACKEND" in
        filesystem)
            fs_release_lock "$lock_name" "$lock_id"
            ;;
        redis)
            redis_release_lock "$lock_name" "$lock_id"
            ;;
    esac
}

# Lock manager with automatic cleanup
with_lock() {
    local lock_name=$1
    shift
    
    local lock_id
    lock_id=$(acquire_lock "$lock_name" 30)  # Wait up to 30 seconds
    
    if [ -z "$lock_id" ]; then
        echo "Failed to acquire lock: $lock_name" >&2
        return 1
    fi
    
    # Set up cleanup trap
    trap "release_lock '$lock_name' '$lock_id'" EXIT INT TERM
    
    echo "Lock acquired: $lock_name (id: $lock_id)" >&2
    
    # Execute command
    "$@"
    local exit_code=$?
    
    # Release lock
    release_lock "$lock_name" "$lock_id"
    trap - EXIT INT TERM
    
    return $exit_code
}

# Example usage for the interview
if [ "${BASH_SOURCE[0]}" = "${0}" ]; then
    # Demo: critical section protection
    critical_task() {
        echo "Starting critical task on $HOSTNAME..."
        sleep 10
        echo "Critical task completed"
    }
    
    # Use distributed lock
    with_lock "my-critical-task" critical_task
fi

This implementation handles real-world edge cases: stale locks from crashed processes, network partitions, and provides both filesystem and Redis backends for different infrastructure setups.
Q2: Create a self-healing service monitor that automatically restarts failed services with exponential backoff
This handles the complexity of service management in production:
#!/bin/bash

# Self-healing service monitor with intelligent restart logic
# Prevents restart loops and implements circuit breaker pattern

readonly CONFIG_DIR="/etc/service-monitor"
readonly STATE_DIR="/var/lib/service-monitor"
readonly LOG_FILE="/var/log/service-monitor.log"

# Initialize directories
mkdir -p "$CONFIG_DIR" "$STATE_DIR"

# Service configuration example:
# {
#   "name": "web-app",
#   "check_cmd": "curl -sf http://localhost:8080/health",
#   "start_cmd": "systemctl start web-app",
#   "stop_cmd": "systemctl stop web-app",
#   "restart_cmd": "systemctl restart web-app",
#   "max_restarts": 5,
#   "restart_window": 3600,
#   "backoff_base": 2,
#   "max_backoff": 300
# }

# Logging with severity
log() {
    local level=$1
    shift
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $*" | tee -a "$LOG_FILE"
    
    # Also log to syslog if available
    if command -v logger >/dev/null 2>&1; then
        logger -t "service-monitor" -p "daemon.$level" "$*"
    fi
}

# Load service configuration
load_service_config() {
    local service_name=$1
    local config_file="$CONFIG_DIR/${service_name}.json"
    
    if [ ! -f "$config_file" ]; then
        log "error" "Configuration not found for service: $service_name"
        return 1
    fi
    
    # Parse JSON (using jq if available, fallback to python)
    if command -v jq >/dev/null 2>&1; then
        jq -r 'to_entries | .[] | "\(.key)=\(.value)"' "$config_file"
    else
        python -c "
import json, sys
with open('$config_file') as f:
    config = json.load(f)
    for k, v in config.items():
        print(f'{k}={v}')
"
    fi
}

# Get service state
get_service_state() {
    local service_name=$1
    local state_file="$STATE_DIR/${service_name}.state"
    
    if [ ! -f "$state_file" ]; then
        # Initialize state
        cat > "$state_file" << EOF
restart_count=0
last_restart=0
consecutive_failures=0
backoff_seconds=1
status=unknown
last_check=0
total_restarts=0
circuit_breaker=closed
EOF
    fi
    
    source "$state_file"
}

# Update service state
update_service_state() {
    local service_name=$1
    local state_file="$STATE_DIR/${service_name}.state"
    shift
    
    # Read current state
    local temp_file=$(mktemp)
    cp "$state_file" "$temp_file" 2>/dev/null || touch "$temp_file"
    
    # Update specified values
    while [ $# -gt 0 ]; do
        local key="${1%%=*}"
        local value="${1#*=}"
        
        # Update or add key
        if grep -q "^${key}=" "$temp_file"; then
            sed -i "s/^${key}=.*/${key}=${value}/" "$temp_file"
        else
            echo "${key}=${value}" >> "$temp_file"
        fi
        shift
    done
    
    # Atomic update
    mv "$temp_file" "$state_file"
}

# Calculate exponential backoff
calculate_backoff() {
    local base=$1
    local attempt=$2
    local max_backoff=$3
    
    local backoff=$((base ** attempt))
    if [ $backoff -gt $max_backoff ]; then
        backoff=$max_backoff
    fi
    
    # Add jitter (±25%) to prevent thundering herd
    local jitter=$((backoff / 4))
    local random_jitter=$((RANDOM % (jitter * 2) - jitter))
    backoff=$((backoff + random_jitter))
    
    echo $backoff
}

# Check if service is healthy
check_service_health() {
    local check_cmd=$1
    
    # Execute health check with timeout
    timeout 30 bash -c "$check_cmd" >/dev/null 2>&1
}

# Restart service with backoff
restart_service() {
    local service_name=$1
    local restart_cmd=$2
    local backoff_base=$3
    local max_backoff=$4
    
    # Get current state
    get_service_state "$service_name"
    
    # Calculate backoff
    local backoff=$(calculate_backoff "$backoff_base" "$consecutive_failures" "$max_backoff")
    
    log "warning" "Restarting $service_name (attempt $((consecutive_failures + 1)), waiting ${backoff}s)"
    
    # Wait with exponential backoff
    sleep "$backoff"
    
    # Attempt restart
    if $restart_cmd; then
        log "info" "Successfully restarted $service_name"
        update_service_state "$service_name" \
            "last_restart=$(date +%s)" \
            "backoff_seconds=1" \
            "total_restarts=$((total_restarts + 1))"
        return 0
    else
        log "error" "Failed to restart $service_name"
        return 1
    fi
}

# Main monitoring loop for a service
monitor_service() {
    local service_name=$1
    
    # Load configuration
    eval "$(load_service_config "$service_name")"
    
    # Configuration validation
    for required in check_cmd restart_cmd max_restarts restart_window; do
        if [ -z "${!required}" ]; then
            log "error" "Missing required config: $required for $service_name"
            return 1
        fi
    done
    
    # Set defaults
    backoff_base=${backoff_base:-2}
    max_backoff=${max_backoff:-300}
    
    log "info" "Starting monitor for $service_name"
    
    while true; do
        # Get current state
        get_service_state "$service_name"
        
        # Check if circuit breaker is open
        if [ "$circuit_breaker" = "open" ]; then
            local circuit_break_duration=$(($(date +%s) - last_restart))
            if [ $circuit_break_duration -lt 3600 ]; then
                sleep 60
                continue
            else
                # Try to close circuit breaker
                log "info" "Attempting to close circuit breaker for $service_name"
                update_service_state "$service_name" "circuit_breaker=half-open"
            fi
        fi
        
        # Perform health check
        if check_service_health "$check_cmd"; then
            # Service is healthy
            if [ "$status" != "healthy" ] || [ "$consecutive_failures" -gt 0 ]; then
                log "info" "$service_name is healthy"
                update_service_state "$service_name" \
                    "status=healthy" \
                    "consecutive_failures=0" \
                    "circuit_breaker=closed"
            fi
        else
            # Service is unhealthy
            log "error" "$service_name health check failed"
            
            # Update failure count
            consecutive_failures=$((consecutive_failures + 1))
            update_service_state "$service_name" \
                "status=unhealthy" \
                "consecutive_failures=$consecutive_failures" \
                "last_check=$(date +%s)"
            
            # Check restart window
            current_time=$(date +%s)
            window_start=$((current_time - restart_window))
            
            # Count restarts in window
            recent_restarts=0
            if [ -f "$STATE_DIR/${service_name}.history" ]; then
                recent_restarts=$(awk -v start="$window_start" '$1 > start' \
                    "$STATE_DIR/${service_name}.history" | wc -l)
            fi
            
            # Check if we should restart
            if [ $recent_restarts -ge $max_restarts ]; then
                if [ "$circuit_breaker" != "open" ]; then
                    log "error" "Circuit breaker OPEN for $service_name (too many restarts)"
                    update_service_state "$service_name" "circuit_breaker=open"
                    
                    # Send alert
                    send_alert "$service_name" "Circuit breaker opened - manual intervention required"
                fi
            else
                # Attempt restart
                if restart_service "$service_name" "$restart_cmd" "$backoff_base" "$max_backoff"; then
                    # Log restart
                    echo "$(date +%s) restart" >> "$STATE_DIR/${service_name}.history"
                    
                    # Wait before next check to allow service to stabilize
                    sleep 30
                else
                    # Restart failed, increase backoff
                    update_service_state "$service_name" \
                        "consecutive_failures=$consecutive_failures"
                fi
            fi
        fi
        
        # Wait before next check
        sleep "${check_interval:-60}"
    done
}

# Send alerts (integrate with your alerting system)
send_alert() {
    local service=$1
    local message=$2
    
    # Example integrations
    if [ -n "$SLACK_WEBHOOK" ]; then
        curl -X POST "$SLACK_WEBHOOK" \
            -H "Content-Type: application/json" \
            -d "{\"text\": \"Service Monitor Alert: $service - $message\"}" \
            2>/dev/null || true
    fi
    
    # Email alert
    if command -v mail >/dev/null 2>&1 && [ -n "$ALERT_EMAIL" ]; then
        echo "$message" | mail -s "Service Monitor: $service" "$ALERT_EMAIL"
    fi
    
    log "alert" "$service: $message"
}

# Main execution
main() {
    # Monitor all configured services
    for config_file in "$CONFIG_DIR"/*.json; do
        [ -f "$config_file" ] || continue
        
        service_name=$(basename "$config_file" .json)
        monitor_service "$service_name" &
        
        # Store PID for management
        echo $! > "$STATE_DIR/${service_name}.pid"
    done
    
    # Wait for all monitors
    wait
}

# Handle signals
trap 'log "info" "Shutting down service monitor"; pkill -P $$; exit 0' TERM INT

if [ "${BASH_SOURCE[0]}" = "${0}" ]; then
    main "$@"
fi

Q3: Implement a log aggregation and analysis system that detects anomalies across distributed services
This demonstrates handling big data streams and pattern recognition:
#!/bin/bash

# Distributed log analysis system with anomaly detection
# Handles multiple log formats, performs statistical analysis, and alerts on anomalies

readonly AGGREGATOR_PORT=9514
readonly ANALYSIS_INTERVAL=60
readonly BASELINE_WINDOW=3600  # 1 hour
readonly ANOMALY_THRESHOLD=3   # Standard deviations
readonly STATE_DIR="/var/lib/log-analyzer"
readonly PATTERNS_FILE="/etc/log-analyzer/patterns.conf"

mkdir -p "$STATE_DIR"

# Pattern definitions for different log types
declare -A LOG_PATTERNS=(
    ["nginx"]='(?P<ip>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) \S+" (?P<status>\d+) (?P<size>\d+)'
    ["apache"]='(?P<ip>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) \S+" (?P<status>\d+) (?P<size>\d+)'
    ["syslog"]='(?P<time>\S+ \S+ \S+) (?P<host>\S+) (?P<process>[^\[]+)\[(?P<pid>\d+)\]: (?P<message>.*)'
    ["json"]='^\{.*\}$'
)

# Statistical functions
calculate_stats() {
    # Input: newline-separated numbers
    # Output: count mean stddev min max p95
    awk '
    {
        data[NR] = $1
        sum += $1
        if (NR == 1 || $1 < min) min = $1
        if (NR == 1 || $1 > max) max = $1
    }
    END {
        if (NR == 0) {
            print "0 0 0 0 0 0"
            exit
        }
        mean = sum / NR
        
        # Calculate variance
        for (i = 1; i <= NR; i++) {
            variance += (data[i] - mean) ^ 2
        }
        variance = variance / NR
        stddev = sqrt(variance)
        
        # Calculate 95th percentile
        n = int(NR * 0.95)
        if (n < 1) n = 1
        
        # Sort for percentile
        for (i = 1; i <= NR; i++) {
            for (j = i + 1; j <= NR; j++) {
                if (data[i] > data[j]) {
                    temp = data[i]
                    data[i] = data[j]
                    data[j] = temp
                }
            }
        }
        
        print NR, mean, stddev, min, max, data[n]
    }
    '
}

# Z-score anomaly detection
is_anomaly() {
    local value=$1
    local mean=$2
    local stddev=$3
    local threshold=${4:-$ANOMALY_THRESHOLD}
    
    # Avoid division by zero
    if (( $(echo "$stddev == 0" | bc -l) )); then
        return 1
    fi
    
    # Calculate z-score
    local zscore=$(echo "scale=2; ($value - $mean) / $stddev" | bc -l)
    local abs_zscore=$(echo "scale=2; if ($zscore < 0) -$zscore else $zscore" | bc -l)
    
    if (( $(echo "$abs_zscore > $threshold" | bc -l) )); then
        return 0  # Is anomaly
    else
        return 1  # Not anomaly
    fi
}

# Parse log line based on format
parse_log_line() {
    local line=$1
    local log_type=$2
    
    case "$log_type" in
        json)
            # Parse JSON log
            if command -v jq >/dev/null 2>&1; then
                echo "$line" | jq -r '
                    @tsv "\(.timestamp // now | todate) \(.level // "INFO") \(.service // "unknown") \(.message // "")"
                ' 2>/dev/null
            else
                # Fallback to python
                python3 -c "
import json, sys, datetime
try:
    data = json.loads('$line')
    timestamp = data.get('timestamp', datetime.datetime.now().isoformat())
    level = data.get('level', 'INFO')
    service = data.get('service', 'unknown')
    message = data.get('message', '')
    print(f'{timestamp}\t{level}\t{service}\t{message}')
except:
    pass
"
            fi
            ;;
        
        nginx|apache)
            # Parse using regex
            python3 -c "
import re
pattern = r'${LOG_PATTERNS[$log_type]}'
match = re.match(pattern, '''$line''')
if match:
    data = match.groupdict()
    print(f\"{data.get('time', '')}\t{data.get('status', '')}\t{data.get('method', '')}\t{data.get('path', '')}\")
"
            ;;
        
        *)
            # Generic parsing
            echo "$line" | awk '{print $1, $2, $3, $0}'
            ;;
    esac
}

# Aggregate logs from multiple sources
aggregate_logs() {
    local output_file=$1
    local duration=$2
    
    # Start log receiver
    nc -l -k -p "$AGGREGATOR_PORT" > "$output_file" 2>/dev/null &
    local nc_pid=$!
    
    # Also collect local logs
    if command -v journalctl >/dev/null 2>&1; then
        journalctl -f --since="$duration seconds ago" >> "$output_file" 2>/dev/null &
        local journal_pid=$!
    fi
    
    # Collect for specified duration
    sleep "$duration"
    
    # Stop collectors
    kill $nc_pid 2>/dev/null
    [ -n "$journal_pid" ] && kill $journal_pid 2>/dev/null
    
    # Return line count
    wc -l < "$output_file"
}

# Analyze log patterns and detect anomalies
analyze_logs() {
    local log_file=$1
    local analysis_output="$STATE_DIR/analysis_$(date +%s).json"
    
    # Initialize metrics
    declare -A error_counts
    declare -A response_times
    declare -A status_codes
    declare -A service_errors
    
    # Process each log line
    while IFS= read -r line; do
        # Detect log type
        local log_type="generic"
        if [[ "$line" =~ ^\{ ]]; then
            log_type="json"
        elif [[ "$line" =~ \"[A-Z]+\ .*HTTP ]]; then
            log_type="nginx"
        fi
        
        # Parse line
        local parsed=$(parse_log_line "$line" "$log_type")
        [ -z "$parsed" ] && continue
        
        # Extract fields based on log type
        case "$log_type" in
            json)
                IFS=$'\t' read -r timestamp level service message <<< "$parsed"
                
                # Count errors by service
                if [[ "$level" =~ ERROR|CRITICAL|FATAL ]]; then
                    ((service_errors["$service"]++))
                fi
                ;;
            
            nginx|apache)
                IFS=$'\t' read -r timestamp status method path <<< "$parsed"
                
                # Count status codes
                ((status_codes["$status"]++))
                
                # Track error rates
                if [[ "$status" =~ ^[45] ]]; then
                    ((error_counts["http_errors"]++))
                fi
                ;;
        esac
    done < "$log_file"
    
    # Load historical baselines
    local baseline_file="$STATE_DIR/baseline.json"
    if [ -f "$baseline_file" ]; then
        # Compare with baseline
        for service in "${!service_errors[@]}"; do
            local current_errors=${service_errors[$service]}
            local baseline_mean=$(jq -r ".services.$service.error_rate.mean // 0" "$baseline_file")
            local baseline_stddev=$(jq -r ".services.$service.error_rate.stddev // 1" "$baseline_file")
            
            if is_anomaly "$current_errors" "$baseline_mean" "$baseline_stddev"; then
                log "alert" "Anomaly detected: $service error rate is $current_errors (baseline: $baseline_mean ± $baseline_stddev)"
                send_anomaly_alert "$service" "error_rate" "$current_errors" "$baseline_mean" "$baseline_stddev"
            fi
        done
    fi
    
    # Generate analysis report
    cat > "$analysis_output" << EOF
{
    "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
    "total_lines": $(wc -l < "$log_file"),
    "services": {
EOF
    
    # Add service metrics
    local first=true
    for service in "${!service_errors[@]}"; do
        [ "$first" = true ] && first=false || echo "," >> "$analysis_output"
        cat >> "$analysis_output" << EOF
        "$service": {
            "error_count": ${service_errors[$service]},
            "error_rate": $(echo "scale=2; ${service_errors[$service]} * 100 / $(wc -l < "$log_file")" | bc)
        }
EOF
    done
    
    echo -e "\n    },\n    \"status_codes\": {" >> "$analysis_output"
    
    # Add status code distribution
    first=true
    for status in "${!status_codes[@]}"; do
        [ "$first" = true ] && first=false || echo "," >> "$analysis_output"
        echo -n "        \"$status\": ${status_codes[$status]}" >> "$analysis_output"
    done
    
    echo -e "\n    }\n}" >> "$analysis_output"
    
    # Update baseline with exponential moving average
    update_baseline "$analysis_output"
    
    echo "$analysis_output"
}

# Update baseline metrics
update_baseline() {
    local current_analysis=$1
    local baseline_file="$STATE_DIR/baseline.json"
    local alpha=0.3  # EMA weight for new data
    
    if [ ! -f "$baseline_file" ]; then
        # Initialize baseline
        cp "$current_analysis" "$baseline_file"
        return
    fi
    
    # Update baseline using exponential moving average
    python3 - "$baseline_file" "$current_analysis" "$alpha" << 'EOF'
import json
import sys

baseline_file, current_file, alpha = sys.argv[1:4]
alpha = float(alpha)

with open(baseline_file) as f:
    baseline = json.load(f)

with open(current_file) as f:
    current = json.load(f)

# Update service metrics
for service, metrics in current.get('services', {}).items():
    if service not in baseline.get('services', {}):
        baseline.setdefault('services', {})[service] = {
            'error_rate': {'mean': 0, 'stddev': 1}
        }
    
    # Update using EMA
    old_mean = baseline['services'][service]['error_rate']['mean']
    new_value = metrics['error_rate']
    new_mean = alpha * new_value + (1 - alpha) * old_mean
    
    # Update variance estimate
    old_var = baseline['services'][service]['error_rate'].get('variance', 1)
    new_var = alpha * ((new_value - new_mean) ** 2) + (1 - alpha) * old_var
    
    baseline['services'][service]['error_rate'] = {
        'mean': new_mean,
        'stddev': new_var ** 0.5,
        'variance': new_var
    }

# Save updated baseline
with open(baseline_file, 'w') as f:
    json.dump(baseline, f, indent=2)
EOF
}

# Send anomaly alerts
send_anomaly_alert() {
    local service=$1
    local metric=$2
    local current_value=$3
    local baseline_mean=$4
    local baseline_stddev=$5
    
    local zscore=$(echo "scale=2; ($current_value - $baseline_mean) / $baseline_stddev" | bc -l)
    
    # Create alert payload
    local alert_data=$(cat << EOF
{
    "service": "$service",
    "metric": "$metric",
    "current_value": $current_value,
    "baseline_mean": $baseline_mean,
    "baseline_stddev": $baseline_stddev,
    "z_score": $zscore,
    "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
    "severity": "$([ $(echo "$zscore > 5" | bc) -eq 1 ] && echo "critical" || echo "warning")"
}
EOF
)
    
    # Send to various channels
    if [ -n "$WEBHOOK_URL" ]; then
        curl -X POST "$WEBHOOK_URL" \
            -H "Content-Type: application/json" \
            -d "$alert_data" \
            2>/dev/null || true
    fi
    
    # Log alert
    echo "$alert_data" >> "$STATE_DIR/alerts.jsonl"
    
    # Execute custom alert handler if defined
    if [ -x "/etc/log-analyzer/alert-handler.sh" ]; then
        echo "$alert_data" | /etc/log-analyzer/alert-handler.sh
    fi
}

# Main monitoring loop
main() {
    log "info" "Starting distributed log analyzer"
    
    while true; do
        local temp_log=$(mktemp)
        
        # Aggregate logs
        log "info" "Collecting logs for $ANALYSIS_INTERVAL seconds"
        local line_count=$(aggregate_logs "$temp_log" "$ANALYSIS_INTERVAL")
        log "info" "Collected $line_count log lines"
        
        # Analyze logs
        if [ "$line_count" -gt 0 ]; then
            local analysis_file=$(analyze_logs "$temp_log")
            log "info" "Analysis complete: $analysis_file"
            
            # Clean up old analysis files (keep last 24 hours)
            find "$STATE_DIR" -name "analysis_*.json" -mtime +1 -delete
        fi
        
        rm -f "$temp_log"
        
        # Wait before next cycle
        sleep 10
    done
}

# Logging function
log() {
    local level=$1
    shift
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $*" >&2
}

if [ "${BASH_SOURCE[0]}" = "${0}" ]; then
    main "$@"
fi

Q4: Build a zero-downtime deployment system with automatic rollback on failure
This handles complex deployment orchestration:
#!/bin/bash

# Zero-downtime deployment system with health checks and automatic rollback
# Supports blue-green and rolling deployments across multiple servers

readonly DEPLOY_DIR="/opt/deployments"
readonly CONFIG_FILE="/etc/deploy/config.json"
readonly STATE_FILE="/var/lib/deploy/state.json"
readonly HEALTH_CHECK_RETRIES=10
readonly HEALTH_CHECK_INTERVAL=5
readonly CANARY_PERCENTAGE=10

# Deployment strategies
declare -A STRATEGIES=(
    ["blue-green"]="deploy_blue_green"
    ["rolling"]="deploy_rolling"
    ["canary"]="deploy_canary"
)

# Initialize directories
mkdir -p "$(dirname "$STATE_FILE")" "$DEPLOY_DIR"

# Atomic state management
update_state() {
    local temp_file=$(mktemp)
    if [ -f "$STATE_FILE" ]; then
        cp "$STATE_FILE" "$temp_file"
    else
        echo "{}" > "$temp_file"
    fi
    
    # Update state using jq
    local key=$1
    local value=$2
    
    jq --arg k "$key" --arg v "$value" '.[$k] = $v' "$temp_file" > "${temp_file}.new"
    mv "${temp_file}.new" "$STATE_FILE"
    rm -f "$temp_file"
}

get_state() {
    local key=$1
    if [ -f "$STATE_FILE" ]; then
        jq -r --arg k "$key" '.[$k] // empty' "$STATE_FILE"
    fi
}

# Load balancer management (HAProxy example)
update_load_balancer() {
    local action=$1
    local server=$2
    local backend=${3:-"webservers"}
    
    case "$action" in
        drain)
            echo "set server $backend/$server state drain" | \
                socat stdio /var/run/haproxy.sock
            ;;
        ready)
            echo "set server $backend/$server state ready" | \
                socat stdio /var/run/haproxy.sock
            ;;
        maint)
            echo "disable server $backend/$server" | \
                socat stdio /var/run/haproxy.sock
            ;;
        enable)
            echo "enable server $backend/$server" | \
                socat stdio /var/run/haproxy.sock
            ;;
    esac
}

# Wait for connection draining
wait_for_drain() {
    local server=$1
    local backend=${2:-"webservers"}
    local max_wait=60
    local waited=0
    
    log "info" "Waiting for connections to drain from $server"
    
    while [ $waited -lt $max_wait ]; do
        local conn_count=$(echo "show servers state $backend" | \
            socat stdio /var/run/haproxy.sock | \
            grep "$server" | awk '{print $7}')
        
        if [ "$conn_count" -eq 0 ]; then
            log "info" "Server $server drained successfully"
            return 0
        fi
        
        sleep 2
        waited=$((waited + 2))
    done
    
    log "warning" "Timeout waiting for $server to drain (still $conn_count connections)"
    return 1
}

# Health check implementation
perform_health_check() {
    local server=$1
    local health_endpoint=$2
    local expected_response=${3:-"200"}
    
    local attempt=1
    while [ $attempt -le $HEALTH_CHECK_RETRIES ]; do
        log "debug" "Health check attempt $attempt for $server"
        
        local response=$(curl -s -o /dev/null -w "%{http_code}" \
            --connect-timeout 5 \
            --max-time 10 \
            "http://$server$health_endpoint")
        
        if [[ "$response" =~ $expected_response ]]; then
            log "info" "Health check passed for $server"
            return 0
        fi
        
        log "warning" "Health check failed for $server (got $response, expected $expected_response)"
        sleep $HEALTH_CHECK_INTERVAL
        attempt=$((attempt + 1))
    done
    
    log "error" "Health check failed for $server after $HEALTH_CHECK_RETRIES attempts"
    return 1
}

# Deploy to single server
deploy_to_server() {
    local server=$1
    local artifact=$2
    local deploy_user=${3:-"deploy"}
    local deploy_path=${4:-"/var/www/app"}
    
    log "info" "Deploying $artifact to $server"
    
    # Create deployment directory with timestamp
    local timestamp=$(date +%Y%m%d_%H%M%S)
    local new_release_path="${deploy_path}/releases/${timestamp}"
    
    # Pre-deployment commands
    ssh "$deploy_user@$server" "mkdir -p ${deploy_path}/releases"
    
    # Transfer artifact
    if ! scp "$artifact" "$deploy_user@$server:/tmp/deploy_${timestamp}.tar.gz"; then
        log "error" "Failed to transfer artifact to $server"
        return 1
    fi
    
    # Extract and prepare
    if ! ssh "$deploy_user@$server" "
        set -e
        cd ${deploy_path}/releases
        mkdir -p ${timestamp}
        tar -xzf /tmp/deploy_${timestamp}.tar.gz -C ${timestamp}
        rm -f /tmp/deploy_${timestamp}.tar.gz
        
        # Run pre-deployment hooks if exists
        if [ -x ${timestamp}/deploy/pre-deploy.sh ]; then
            cd ${timestamp}
            ./deploy/pre-deploy.sh
        fi
    "; then
        log "error" "Failed to prepare deployment on $server"
        return 1
    fi
    
    # Store current version for rollback
    local current_link=$(ssh "$deploy_user@$server" "readlink ${deploy_path}/current" 2>/dev/null || echo "")
    if [ -n "$current_link" ]; then
        update_state "previous_release_${server}" "$current_link"
    fi
    
    # Atomic switch
    if ! ssh "$deploy_user@$server" "
        ln -sfn ${new_release_path} ${deploy_path}/current
        
        # Run post-deployment hooks
        if [ -x ${new_release_path}/deploy/post-deploy.sh ]; then
            cd ${new_release_path}
            ./deploy/post-deploy.sh
        fi
        
        # Reload/restart services
        sudo systemctl reload nginx || true
        sudo systemctl restart app || true
    "; then
        log "error" "Failed to activate deployment on $server"
        return 1
    fi
    
    log "info" "Successfully deployed to $server"
    return 0
}

# Rollback single server
rollback_server() {
    local server=$1
    local deploy_user=${2:-"deploy"}
    local deploy_path=${3:-"/var/www/app"}
    
    local previous_release=$(get_state "previous_release_${server}")
    if [ -z "$previous_release" ]; then
        log "error" "No previous release found for $server"
        return 1
    fi
    
    log "warning" "Rolling back $server to $previous_release"
    
    ssh "$deploy_user@$server" "
        ln -sfn $previous_release ${deploy_path}/current
        sudo systemctl reload nginx || true
        sudo systemctl restart app || true
    "
}

# Blue-Green deployment strategy
deploy_blue_green() {
    local artifact=$1
    local config=$2
    
    # Parse configuration
    local blue_servers=($(echo "$config" | jq -r '.blue_servers[]'))
    local green_servers=($(echo "$config" | jq -r '.green_servers[]'))
    local health_endpoint=$(echo "$config" | jq -r '.health_check.endpoint // "/health"')
    
    # Determine which environment is currently live
    local live_env=$(get_state "live_environment")
    local target_env
    local target_servers
    
    if [ "$live_env" = "blue" ]; then
        target_env="green"
        target_servers=("${green_servers[@]}")
    else
        target_env="blue"
        target_servers=("${blue_servers[@]}")
    fi
    
    log "info" "Starting blue-green deployment to $target_env environment"
    update_state "deployment_id" "$(date +%s)"
    update_state "deployment_status" "in_progress"
    
    # Deploy to target environment
    local failed_servers=()
    for server in "${target_servers[@]}"; do
        if ! deploy_to_server "$server" "$artifact"; then
            failed_servers+=("$server")
        fi
    done
    
    if [ ${#failed_servers[@]} -gt 0 ]; then
        log "error" "Deployment failed on servers: ${failed_servers[*]}"
        update_state "deployment_status" "failed"
        return 1
    fi
    
    # Health check all target servers
    log "info" "Running health checks on $target_env environment"
    for server in "${target_servers[@]}"; do
        if ! perform_health_check "$server" "$health_endpoint"; then
            log "error" "Health check failed for $server"
            update_state "deployment_status" "failed"
            
            # Rollback is not needed in blue-green as we haven't switched traffic
            return 1
        fi
    done
    
    # Switch traffic
    log "info" "Switching traffic to $target_env environment"
    for server in "${target_servers[@]}"; do
        update_load_balancer "enable" "$server"
    done
    
    # Drain old environment
    local old_servers
    if [ "$target_env" = "blue" ]; then
        old_servers=("${green_servers[@]}")
    else
        old_servers=("${blue_servers[@]}")
    fi
    
    for server in "${old_servers[@]}"; do
        update_load_balancer "drain" "$server"
    done
    
    # Wait for draining
    for server in "${old_servers[@]}"; do
        wait_for_drain "$server"
        update_load_balancer "maint" "$server"
    done
    
    # Update state
    update_state "live_environment" "$target_env"
    update_state "deployment_status" "completed"
    update_state "deployment_end" "$(date +%s)"
    
    log "info" "Blue-green deployment completed successfully"
}

# Rolling deployment strategy
deploy_rolling() {
    local artifact=$1
    local config=$2
    
    # Parse configuration
    local servers=($(echo "$config" | jq -r '.servers[]'))
    local batch_size=$(echo "$config" | jq -r '.batch_size // 1')
    local health_endpoint=$(echo "$config" | jq -r '.health_check.endpoint // "/health"')
    local pause_between_batches=$(echo "$config" | jq -r '.pause_seconds // 30')
    
    log "info" "Starting rolling deployment (batch size: $batch_size)"
    update_state "deployment_id" "$(date +%s)"
    update_state "deployment_status" "in_progress"
    
    # Process servers in batches
    local total_servers=${#servers[@]}
    local deployed=0
    
    while [ $deployed -lt $total_servers ]; do
        local batch_servers=()
        local batch_end=$((deployed + batch_size))
        
        # Get servers for this batch
        for ((i=deployed; i<batch_end && i<total_servers; i++)); do
            batch_servers+=("${servers[$i]}")
        done
        
        log "info" "Processing batch: ${batch_servers[*]}"
        
        # Drain servers in batch
        for server in "${batch_servers[@]}"; do
            update_load_balancer "drain" "$server"
        done
        
        # Wait for all to drain
        for server in "${batch_servers[@]}"; do
            wait_for_drain "$server"
        done
        
        # Deploy to batch
        local batch_failed=false
        for server in "${batch_servers[@]}"; do
            if ! deploy_to_server "$server" "$artifact"; then
                log "error" "Deployment failed on $server"
                batch_failed=true
                break
            fi
            
            # Health check immediately
            if ! perform_health_check "$server" "$health_endpoint"; then
                log "error" "Health check failed for $server"
                batch_failed=true
                break
            fi
            
            # Re-enable in load balancer
            update_load_balancer "ready" "$server"
        done
        
        if [ "$batch_failed" = true ]; then
            log "error" "Batch deployment failed, initiating rollback"
            
            # Rollback all deployed servers
            for ((i=0; i<deployed+${#batch_servers[@]}; i++)); do
                rollback_server "${servers[$i]}"
                update_load_balancer "ready" "${servers[$i]}"
            done
            
            update_state "deployment_status" "failed_rollback_completed"
            return 1
        fi
        
        deployed=$((deployed + ${#batch_servers[@]}))
        
        # Pause between batches if not the last batch
        if [ $deployed -lt $total_servers ]; then
            log "info" "Pausing $pause_between_batches seconds before next batch"
            sleep "$pause_between_batches"
        fi
    done
    
    update_state "deployment_status" "completed"
    update_state "deployment_end" "$(date +%s)"
    log "info" "Rolling deployment completed successfully"
}

# Canary deployment strategy
deploy_canary() {
    local artifact=$1
    local config=$2
    
    # Parse configuration
    local servers=($(echo "$config" | jq -r '.servers[]'))
    local canary_duration=$(echo "$config" | jq -r '.canary_duration // 300')
    local success_rate_threshold=$(echo "$config" | jq -r '.success_rate_threshold // 99.5')
    local health_endpoint=$(echo "$config" | jq -r '.health_check.endpoint // "/health"')
    
    # Calculate canary size
    local total_servers=${#servers[@]}
    local canary_count=$((total_servers * CANARY_PERCENTAGE / 100))
    [ $canary_count -lt 1 ] && canary_count=1
    
    log "info" "Starting canary deployment ($canary_count of $total_servers servers)"
    
    # Deploy to canary servers
    local canary_servers=("${servers[@]:0:$canary_count}")
    for server in "${canary_servers[@]}"; do
        update_load_balancer "drain" "$server"
        wait_for_drain "$server"
        
        if ! deploy_to_server "$server" "$artifact"; then
            log "error" "Canary deployment failed on $server"
            rollback_server "$server"
            update_load_balancer "ready" "$server"
            return 1
        fi
        
        if ! perform_health_check "$server" "$health_endpoint"; then
            log "error" "Canary health check failed for $server"
            rollback_server "$server"
            update_load_balancer "ready" "$server"
            return 1
        fi
        
        update_load_balancer "ready" "$server"
    done
    
    # Monitor canary servers
    log "info" "Monitoring canary servers for $canary_duration seconds"
    local start_time=$(date +%s)
    local end_time=$((start_time + canary_duration))
    
    while [ $(date +%s) -lt $end_time ]; do
        # Check error rates
        local error_rate=$(calculate_error_rate "${canary_servers[@]}")
        local success_rate=$(echo "100 - $error_rate" | bc)
        
        if (( $(echo "$success_rate < $success_rate_threshold" | bc -l) )); then
            log "error" "Canary success rate ($success_rate%) below threshold ($success_rate_threshold%)"
            
            # Rollback canary servers
            for server in "${canary_servers[@]}"; do
                update_load_balancer "drain" "$server"
            done
            
            for server in "${canary_servers[@]}"; do
                wait_for_drain "$server"
                rollback_server "$server"
                update_load_balancer "ready" "$server"
            done
            
            return 1
        fi
        
        sleep 10
    done
    
    log "info" "Canary phase successful, proceeding with full deployment"
    
    # Deploy to remaining servers using rolling strategy
    local remaining_servers=("${servers[@]:$canary_count}")
    local remaining_config=$(echo "$config" | jq --argjson servers "$(printf '%s\n' "${remaining_servers[@]}" | jq -R . | jq -s .)" '.servers = $servers')
    
    deploy_rolling "$artifact" "$remaining_config"
}

# Calculate error rate from logs/metrics
calculate_error_rate() {
    local servers=("$@")
    
    # This would integrate with your metrics system
    # For demo, returning mock value
    echo "0.5"
}

# Logging
log() {
    local level=$1
    shift
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $*" | tee -a /var/log/deploy.log
}

# Main deployment orchestrator
main() {
    local artifact=$1
    local strategy=${2:-"rolling"}
    
    if [ ! -f "$artifact" ]; then
        log "error" "Artifact not found: $artifact"
        exit 1
    fi
    
    if [ ! -f "$CONFIG_FILE" ]; then
        log "error" "Configuration not found: $CONFIG_FILE"
        exit 1
    fi
    
    local config=$(cat "$CONFIG_FILE")
    
    # Validate strategy
    if [ -z "${STRATEGIES[$strategy]}" ]; then
        log "error" "Unknown deployment strategy: $strategy"
        exit 1
    fi
    
    # Execute deployment
    ${STRATEGIES[$strategy]} "$artifact" "$config"
}

if [ "${BASH_SOURCE[0]}" = "${0}" ]; then
    main "$@"
fi

Q5: Create a disaster recovery automation system that handles failover between regions
This demonstrates complex orchestration and state management:
#!/bin/bash

# Multi-region disaster recovery automation system
# Handles database replication, DNS failover, and application state synchronization

readonly DR_CONFIG="/etc/disaster-recovery/config.json"
readonly STATE_DIR="/var/lib/dr-automation"
readonly HEALTH_CHECK_INTERVAL=30
readonly FAILOVER_THRESHOLD=3  # Consecutive failures before failover
readonly RPO_WARNING_THRESHOLD=300  # 5 minutes

mkdir -p "$STATE_DIR"

# Region health tracking
declare -A REGION_HEALTH
declare -A REGION_FAILURES
declare -A LAST_REPLICATION_LAG

# Load configuration
load_config() {
    if [ ! -f "$DR_CONFIG" ]; then
        log "error" "DR configuration not found"
        exit 1
    fi
    
    # Export configuration as environment variables
    eval "$(jq -r '
        .regions[] | 
        "REGION_\(.name | ascii_upcase)_ENDPOINT=\"\(.endpoint)\"",
        "REGION_\(.name | ascii_upcase)_DB_HOST=\"\(.database.host)\"",
        "REGION_\(.name | ascii_upcase)_DB_PORT=\"\(.database.port // 5432)\""
    ' "$DR_CONFIG")"
    
    # Get all regions
    REGIONS=($(jq -r '.regions[].name' "$DR_CONFIG"))
    PRIMARY_REGION=$(jq -r '.primary_region' "$DR_CONFIG")
}

# Comprehensive health check for a region
check_region_health() {
    local region=$1
    local endpoint_var="REGION_${region^^}_ENDPOINT"
    local endpoint="${!endpoint_var}"
    
    log "debug" "Checking health of region: $region"
    
    # Application health check
    local app_health=false
    local http_status=$(curl -s -o /dev/null -w "%{http_code}" \
        --connect-timeout 5 --max-time 10 \
        "https://$endpoint/health" || echo "000")
    
    if [ "$http_status" = "200" ]; then
        app_health=true
    fi
    
    # Database health check
    local db_health=false
    local db_host_var="REGION_${region^^}_DB_HOST"
    local db_port_var="REGION_${region^^}_DB_PORT"
    local db_host="${!db_host_var}"
    local db_port="${!db_port_var}"
    
    if pg_isready -h "$db_host" -p "$db_port" -t 5 >/dev/null 2>&1; then
        db_health=true
    fi
    
    # Check replication lag if not primary
    local replication_ok=true
    if [ "$region" != "$PRIMARY_REGION" ]; then
        local lag=$(check_replication_lag "$region")
        LAST_REPLICATION_LAG[$region]=$lag
        
        if [ "$lag" -gt "$RPO_WARNING_THRESHOLD" ]; then
            log "warning" "High replication lag for $region: ${lag}s"
            replication_ok=false
        fi
    fi
    
    # Overall health status
    if [ "$app_health" = true ] && [ "$db_health" = true ] && [ "$replication_ok" = true ]; then
        REGION_HEALTH[$region]="healthy"
        REGION_FAILURES[$region]=0
        return 0
    else
        REGION_HEALTH[$region]="unhealthy"
        REGION_FAILURES[$region]=$((${REGION_FAILURES[$region]:-0} + 1))
        
        log "error" "Region $region health check failed - App: $app_health, DB: $db_health, Replication: $replication_ok"
        return 1
    fi
}

# Check database replication lag
check_replication_lag() {
    local region=$1
    local db_host_var="REGION_${region^^}_DB_HOST"
    local db_host="${!db_host_var}"
    
    # Get replication lag in seconds
    local lag=$(psql -h "$db_host" -U postgres -t -c "
        SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int
        WHERE pg_is_in_recovery();" 2>/dev/null | tr -d ' ')
    
    if [ -z "$lag" ] || [ "$lag" = "NULL" ]; then
        echo "0"
    else
        echo "$lag"
    fi
}

# Perform DNS failover
perform_dns_failover() {
    local from_region=$1
    local to_region=$2
    local domain=$(jq -r '.domain' "$DR_CONFIG")
    local dns_provider=$(jq -r '.dns_provider' "$DR_CONFIG")
    
    log "warning" "Initiating DNS failover from $from_region to $to_region"
    
    case "$dns_provider" in
        route53)
            # AWS Route53 failover
            local hosted_zone=$(aws route53 list-hosted-zones-by-name \
                --query "HostedZones[?Name=='${domain}.'].Id" \
                --output text | cut -d'/' -f3)
            
            local to_endpoint_var="REGION_${to_region^^}_ENDPOINT"
            local to_endpoint="${!to_endpoint_var}"
            
            # Update DNS record
            aws route53 change-resource-record-sets \
                --hosted-zone-id "$hosted_zone" \
                --change-batch "{
                    \"Changes\": [{
                        \"Action\": \"UPSERT\",
                        \"ResourceRecordSet\": {
                            \"Name\": \"${domain}\",
                            \"Type\": \"CNAME\",
                            \"TTL\": 60,
                            \"ResourceRecords\": [{\"Value\": \"${to_endpoint}\"}]
                        }
                    }]
                }"
            ;;
        
        cloudflare)
            # Cloudflare API
            local zone_id=$(curl -s -X GET "https://api.cloudflare.com/client/v4/zones?name=$domain" \
                -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" | \
                jq -r '.result[0].id')
            
            local record_id=$(curl -s -X GET \
                "https://api.cloudflare.com/client/v4/zones/$zone_id/dns_records?name=$domain" \
                -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" | \
                jq -r '.result[0].id')
            
            local to_endpoint_var="REGION_${to_region^^}_ENDPOINT"
            local to_endpoint="${!to_endpoint_var}"
            
            curl -X PUT "https://api.cloudflare.com/client/v4/zones/$zone_id/dns_records/$record_id" \
                -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
                -H "Content-Type: application/json" \
                --data "{\"type\":\"CNAME\",\"name\":\"$domain\",\"content\":\"$to_endpoint\",\"ttl\":60}"
            ;;
    esac
    
    log "info" "DNS failover initiated, propagation may take up to 60 seconds"
}

# Promote standby database to primary
promote_database() {
    local region=$1
    local db_host_var="REGION_${region^^}_DB_HOST"
    local db_host="${!db_host_var}"
    
    log "warning" "Promoting database in region $region to primary"
    
    # Execute promotion (PostgreSQL example)
    ssh "postgres@$db_host" "pg_ctl promote -D /var/lib/postgresql/data"
    
    # Update application configuration to point to new primary
    local app_servers=($(jq -r ".regions[] | select(.name==\"$region\") | .app_servers[]" "$DR_CONFIG"))
    
    for server in "${app_servers[@]}"; do
        ssh "deploy@$server" "
            sed -i 's/^DB_HOST=.*/DB_HOST=$db_host/' /etc/app/config
            sudo systemctl restart app
        "
    done
}

# Sync application state
sync_application_state() {
    local from_region=$1
    local to_region=$2
    
    log "info" "Syncing application state from $from_region to $to_region"
    
    # Sync Redis/cache data
    local from_redis=$(jq -r ".regions[] | select(.name==\"$from_region\") | .redis.host" "$DR_CONFIG")
    local to_redis=$(jq -r ".regions[] | select(.name==\"$to_region\") | .redis.host" "$DR_CONFIG")
    
    # Create Redis dump and restore
    ssh "redis@$from_redis" "redis-cli BGSAVE && sleep 2"
    ssh "redis@$from_redis" "cat /var/lib/redis/dump.rdb" | \
        ssh "redis@$to_redis" "cat > /var/lib/redis/dump.rdb && redis-cli FLUSHALL && redis-server --loadmodule"
    
    # Sync session data if using file-based sessions
    local from_servers=($(jq -r ".regions[] | select(.name==\"$from_region\") | .app_servers[]" "$DR_CONFIG"))
    local to_servers=($(jq -r ".regions[] | select(.name==\"$to_region\") | .app_servers[]" "$DR_CONFIG"))
    
    # Use parallel for efficiency
    printf '%s\n' "${from_servers[@]}" | parallel -j 4 "
        rsync -av

Bash Shell Scripting Interview Questions And Answers
Whether you're just starting out or have years of experience, bash shell interview questions test your understanding of the shell's unique features, syntax quirks, and best practices that make bash the go-to choice for system automation. We'll cover the essential bash-specific concepts that interviewers love to explore - from array manipulation and string operations to process substitution and advanced parameter expansion. These questions focus on bash's distinctive capabilities that set it apart from basic POSIX shell scripting.
Q1: What's the difference between [ ] and [[ ]] in bash?
This is a classic that trips up many people. The single bracket [ ] is actually a command (same as test), while double brackets [[ ]] are a bash keyword with enhanced functionality:
[[ ]] supports pattern matching: [[ $string == pa* ]] works, but [ $string == pa* ] doesn't
[[ ]] handles empty variables better: [[ -z $empty ]] works fine, while [ -z $empty ] needs quotes
[[ ]] supports regex matching with =~ operator
[[ ]] allows && and || inside: [[ $a -eq 1 && $b -eq 2 ]]
The key insight: always use [[ ]] in bash unless you need POSIX compatibility. It's safer and more powerful.
Q2: Explain bash arrays and how they differ from regular variables
Bash supports both indexed and associative arrays, which many people don't fully utilize:
Indexed arrays:
arr=(apple banana cherry)
echo ${arr[0]}  # apple
echo ${arr[@]}  # all elements
echo ${#arr[@]} # array length (3)

Associative arrays (bash 4+):
declare -A hash
hash[name]="John"
hash[age]=30
echo ${hash[name]}  # John

The tricky part is that bash arrays are sparse - you can have arr[0] and arr[10] without arr[1-9]. Also, array indices start at 0, not 1 like in some shells.
Q3: How does command substitution work and what's the difference between `` and $()?
Both capture command output, but $() is strongly preferred:
$() nests easily: echo $(echo $(date))
Backticks require escaping for nesting: echo `echo \`date\ ``
$() is clearer to read, especially with syntax highlighting
$() handles quotes more predictably
Example showing the difference:
# This works fine
result=$(echo "hello $(echo "world")")

# This is a nightmare with backticks
result=`echo "hello \`echo \"world\"\`"`

Q4: What are bash parameter expansions and give some advanced examples?
Parameter expansion is bash's Swiss Army knife for string manipulation:
${var:-default} - Use default if var is unset/null
${var:=default} - Set var to default if unset/null
${var:?error} - Exit with error if var is unset/null
${var:+alternate} - Use alternate if var is set
Advanced string manipulation:
path="/home/user/documents/file.txt"
echo ${path##*/}  # file.txt (basename)
echo ${path%/*}   # /home/user/documents (dirname)
echo ${path/documents/archive}  # String replacement
echo ${path^^}    # Convert to uppercase
echo ${path,,}    # Convert to lowercase

The pattern: # removes from beginning, % from end. Double it (##, %%) for greedy matching.
Q5: Explain the different types of expansions in bash and their order
Bash performs expansions in a specific order, and understanding this prevents many bugs:
Brace expansion: {a,b,c} or {1..10}
Tilde expansion: ~ becomes home directory
Parameter/variable expansion: $var or ${var}
Command substitution: $(command) or `command`
Arithmetic expansion: $((2 + 2))
Word splitting: Based on IFS
Pathname expansion: *.txt (globbing)
Why this matters:
# This doesn't work as expected
var="*.txt"
ls $var  # Word splitting happens after parameter expansion

# This works
ls *.txt  # Pathname expansion happens directly

Q6: What is process substitution and when would you use it?
Process substitution <(command) creates a temporary file descriptor that acts like a file:
# Compare outputs of two commands
diff <(ls dir1) <(ls dir2)

# Read from multiple commands simultaneously
while read -u 3 line1 && read -u 4 line2; do
    echo "File1: $line1 | File2: $line2"
done 3< <(cat file1) 4< <(cat file2)

# Use with commands that only accept files
grep "pattern" <(curl -s https://example.com)

It's incredibly useful when you need to treat command output as a file without creating temporary files. The key limitation: it's not POSIX, so it's bash/zsh specific.
Q7: How do you handle signals and traps in bash?
Traps let you intercept signals and run cleanup code:
# Basic trap syntax
trap 'echo "Ctrl+C pressed"' INT
trap 'cleanup_function' EXIT
trap 'echo "Error on line $LINENO"' ERR

# Remove trap
trap - INT

# List all signals
trap -l

# Useful pattern: cleanup on any exit
cleanup() {
    rm -f /tmp/tempfile.$$
    echo "Cleaned up"
}
trap cleanup EXIT INT TERM

Pro tip: Always use single quotes in trap commands to prevent premature expansion. The ERR trap is especially useful with set -e for debugging.
Q8: Explain bash's set options and their impact on script behavior
The set command changes bash's behavior fundamentally:
set -e (errexit): Exit on any command failure
set -u (nounset): Exit on undefined variable
set -o pipefail: Pipe fails if any command fails
set -x (xtrace): Print commands before execution
Best practice combo:
set -euo pipefail  # Strict mode
set -E  # ERR trap inherits to functions
set -T  # DEBUG/RETURN traps inherit to functions

You can also use set +e to temporarily disable, and $- contains current options. The gotcha: set -e doesn't work in certain contexts like if conditions.
Q9: What are coprocesses in bash and how do you use them?
Coprocesses (bash 4+) let you run background processes with two-way communication:
# Start a coprocess
coproc BC { bc; }

# Send data to coprocess
echo "2+2" >&${BC[1]}

# Read result
read result <&${BC[0]}
echo $result  # 4

# More complex example with named coprocess
coproc GREP { grep --line-buffered "error"; }
echo "this is fine" >&${GREP[1]}
echo "error occurred" >&${GREP[1]}
read -t 1 line <&${GREP[0]} && echo "Got: $line"

Coprocesses are useful for maintaining persistent connections to programs like bc, awk, or database clients without the overhead of starting new processes.
Q10: How does bash handle arithmetic and what are the different ways to perform calculations?
Bash offers multiple arithmetic methods, each with different capabilities:
# Arithmetic expansion - integers only
result=$((5 + 3))
((count++))  # C-style increment

# let command
let "result = 5 + 3"

# expr command (external, avoid)
result=$(expr 5 + 3)

# bc for floating point
result=$(echo "scale=2; 5/3" | bc)

# Arithmetic comparison in conditions
if ((5 > 3)); then echo "yes"; fi

# C-style for loop
for ((i=0; i<10; i++)); do
    echo $i
done

Bash only supports integer arithmetic natively. For floating-point, use bc or awk. The ((...)) construct returns 0 for true, opposite of normal bash commands.
Q11: What's the difference between source, ., and executing a script?
These three methods have crucial differences:
./script.sh - Runs in a subshell, can't modify parent environment
source script.sh or . script.sh - Runs in current shell, can modify environment
exec script.sh - Replaces current shell process entirely
Example showing the difference:
# In script.sh:
export MY_VAR="hello"

# Method 1: Won't see MY_VAR after
./script.sh
echo $MY_VAR  # Empty

# Method 2: Will see MY_VAR
source script.sh
echo $MY_VAR  # hello

# Method 3: Never returns to original shell
exec script.sh
echo "Never printed"

Use source for configuration files, ./ for independent scripts, and exec for wrapper scripts.
Q12: How do you properly handle spaces and special characters in bash?
This is where most scripts break. Key strategies:
# Always quote variables
bad:  rm $file
good: rm "$file"

# Use arrays for multiple items
files=("file 1.txt" "file 2.txt")
for f in "${files[@]}"; do
    cat "$f"
done

# Read with proper IFS handling
while IFS= read -r line; do
    echo "$line"
done < file.txt

# Handle filenames with newlines
find . -name "*.txt" -print0 | while IFS= read -r -d '' file; do
    echo "Processing: $file"
done

# Safe command construction
cmd=(ls -la "/path with spaces/")
"${cmd[@]}"  # Executes correctly

The golden rule: If it's not a numeric value you're absolutely sure about, quote it. Use "$@" not $* for arguments, and always use -r with read to prevent backslash interpretation.

 

 

 

 
 
 
 
Stay Ahead in DevOps Interviews
Join Our DevOps Course with Placement & Interview Support

 

Conclusion

We have discussed the top shell interview questions and answers than may be asked in the DevOps interview on the shell. For the requirement of brevity, we have listed only some questions and answers which are important. However, more questions and answers will be discussed when you take up theDevOps course at StarAgile institute. StarAgile is an industry-recognized institute for the DevOps training online which can be taken at the comforts of home and office. We recommend you to enquire about the DevOps training at StarAgile institute and register for the various certification benefits.

Share Article
WhatsappFacebookXLinkedInTelegram
About Author
Karan Gupta

Cloud Engineer

AWS DevOps Engineer with 6 years of experience in designing, implementing, automating and
maintaining the cloud infrastructure on the Amazon Web Services (AWS).
Are you Confused? Let us assist you.
+1
Explore DevOps Certification Training!
Upon course completion, you'll earn a certification and expertise.
ImageImageImageImage

Popular Courses

Gain Knowledge from top MNC experts and earn globally recognised certificates.
50645 Enrolled
2 Days
From USD 699.00
USD
349.00
Next Schedule September 8, 2025
2362 Enrolled
2 Days
From USD 498.00
USD
349.00
Next Schedule September 10, 2025
25970 Enrolled
2 Days
From USD 1,199.00
USD
545.00
Next Schedule September 13, 2025
20980 Enrolled
2 Days
From USD 999.00
USD
499.00
Next Schedule September 7, 2025
12659 Enrolled
2 Days
From USD 1,199.00
USD
545.00
Next Schedule September 13, 2025
PreviousNext

Trending Articles

The most effective project-based immersive learning experience to educate that combines hands-on projects with deep, engaging learning.
WhatsApp