1 - Administration

Tools and guides for administering Linux systems.

Find automation helpers and reference material that streamline Linux administration tasks.

1.1 - Linux Boot/Shutdown Quick-Analyzer

Parse Linux boot and shutdown events from system and serial console logs.

What is Linux Boot/Shutdown Quick-Analyzer and why do you need it?

The Linux Boot/Shutdown Quick-Analyzer is a Python script that analyzes Linux boot and shutdown logs.

Preview

It makes it easy to find boot and shutdown events in system logs, providing information such as the number of boot and shutdown events, along with the timestamp, filename, and line number for each event.


How to run it

  1. You need to have Python 3 installed on your machine.

  2. To use the script, navigate to the directory containing your logs (usually /var/log) and run the following command:

# Make sure you are working inside your system logs dir
cd /var/log

# Execute the script
curl https://raw.githubusercontent.com/samatild/linuxrebootcheck/main/linuxrebootcheck.py | python3

Special arguments for special Serial Console logs!

What about those logs that are not in the usual format?

To analyze Serial Console file, you need to download the script and use the –serial-console-log argument followed by the filename:

curl -O https://raw.githubusercontent.com/samatild/linuxrebootcheck/main/linuxrebootcheck.py
python3 linuxrebootcheck.py --serial-console-log <your_log_file>

Conclusion

Overall, the Linux Boot/Shutdown Quick-Analyzer is a useful tool for analyzing Linux boot and shutdown logs. It provides a quick and easy way to find boot and shutdown events in system logs, making it easier to troubleshoot issues and identify potential problems. If you’re looking for a simple and efficient way to analyze your system logs, give it a try!

Source

🔗 samatild/linuxrebootcheck

1.2 - How Antivirus Software Can Prevent Linux Boot: Troubleshooting Guide

Learn how antivirus software can interfere with Linux system boot, including readonly filesystem problems, LSM conflicts, and CrowdStrike Falcon Sensor troubleshooting.

Understanding Antivirus Boot Interference

Antivirus software, while crucial for system security, can sometimes interfere with the Linux boot process. This occurs when security modules become overly aggressive during system initialization, potentially causing boot failures, readonly filesystem mounts, or service startup issues.

Common Symptoms

  • System fails to boot completely
  • Filesystem mounts as readonly (ro) instead of read-write (rw)
  • Critical services fail to start
  • Boot hangs at specific points
  • SELinux/AppArmor policy violations during boot

Filesystem Readonly Issues

One of the most common problems occurs when antivirus software causes the root filesystem to mount readonly. This prevents the system from writing critical boot files and can halt the initialization process.

Root Cause Analysis

Antivirus software often implements filesystem integrity checking or real-time scanning that can interfere with:

  • Journal replay during filesystem mounting
  • Metadata updates during boot
  • Temporary file creation in /tmp, /var, /run

Example Scenarios

Scenario 1: Journal Corruption Detection

[   12.345678] EXT4-fs (sda1): INFO: recovery required on readonly filesystem
[   12.345678] EXT4-fs (sda1): write access unavailable, cannot proceed
[   12.345678] EXT4-fs (sda1): recovery failed, mounting readonly

Scenario 2: Real-time Scanner Blocking Writes

[   15.678901] systemd[1]: Failed to start Local File Systems.
[   15.678901] systemd[1]: Dependency failed for Remote File Systems.
[   15.678901] mount[1234]: mount: / cannot be mounted read-write

Recovery Steps

  1. Boot into recovery mode or single-user mode:
# At GRUB menu, press 'e' to edit
# Add 'single' or 'recovery' to kernel parameters
linux /boot/vmlinuz-... ro single
  1. Check filesystem integrity:
# Run filesystem check
fsck -f /dev/sda1

# If issues persist, check dmesg for antivirus-related messages
dmesg | grep -i "antivirus\|security\|scanner"
  1. Temporarily disable antivirus during boot:
# For systemd-based systems, mask the service temporarily
systemctl mask antivirus-service-name
systemctl reboot

Linux Security Modules (LSM) Conflicts

Linux Security Modules (LSM) provide the framework for security subsystems like SELinux, AppArmor, and various antivirus solutions. When multiple LSMs are active or improperly configured, they can conflict during boot.

LSM Architecture Overview

LSM hooks into the kernel at critical points:

  • Process creation and execution
  • File access operations
  • Network operations
  • Memory management

Common LSM Boot Conflicts

SELinux + Antivirus LSM:

  • Both may attempt to enforce policies on the same resources
  • Race conditions during policy loading
  • Conflicting access decisions

AppArmor Profile Loading:

[FAILED] Failed to load AppArmor profiles
[FAILED] apparmor.service: Main process exited, code=exited, status=1/FAILURE

Troubleshooting LSM Issues

  1. Check LSM status:
# View active LSMs
cat /sys/kernel/security/lsm

# Check SELinux status
sestatus

# Check AppArmor status
apparmor_status
  1. Boot with permissive mode:
# For SELinux
linux /boot/vmlinuz-... selinux=0

# For AppArmor
linux /boot/vmlinuz-... apparmor=0
  1. Review security logs:
# Check audit logs for LSM denials
ausearch -m avc -ts boot

# View journal for security module errors
journalctl -b | grep -i "security\|lsm\|selinux\|apparmor"

CrowdStrike Falcon Sensor Boot Issues

CrowdStrike Falcon Sensor is a common enterprise antivirus solution that can cause boot problems when misconfigured. The sensor requires proper licensing and network connectivity to function correctly.

The Critical Error

When CrowdStrike Falcon Sensor fails during boot, you may see:

[FAILED] Failed to start CrowdStrike Falcon Sensor.

This failure can cascade into other issues:

  • System may continue booting but without security protection
  • Network services may fail if the sensor blocks them
  • Filesystem operations may be restricted

Root Causes

  1. Missing or invalid license
  2. Network connectivity issues during sensor initialization
  3. Conflicting security policies
  4. Outdated sensor version
  5. Improper installation or configuration

Immediate Fix: Masking the Service

When the CrowdStrike service fails and blocks system access, you can temporarily mask it to allow the system to boot:

# Check the exact service name
systemctl list-units --all | grep -i crowdstrike

# Mask the service to prevent automatic startup
sudo systemctl mask falcon-sensor

# Reboot the system
sudo systemctl reboot

Permanent Solutions

  1. Verify licensing:
# Check CrowdStrike status
/opt/CrowdStrike/falconctl -g --cid

# If CID is missing, contact your administrator
  1. Update sensor:
# Update CrowdStrike sensor
/opt/CrowdStrike/falconctl -s --update

# Or reinstall if update fails
  1. Network configuration:
# Ensure DNS resolution works
nslookup falcon.crowdstrike.com

# Check proxy settings if applicable
env | grep -i proxy
  1. Configuration validation:
# Check sensor configuration
/opt/CrowdStrike/falconctl -g --tags
/opt/CrowdStrike/falconctl -g --version

Prevention Best Practices

  • Test updates in staging environments
  • Maintain current licensing
  • Monitor sensor health regularly
  • Have rollback procedures documented

General Troubleshooting Framework

Boot Analysis Steps

  1. Collect boot logs:
# View current boot logs
journalctl -b

# Save logs for analysis
journalctl -b > boot_logs.txt
  1. Identify the failing component:
# Check failed services
systemctl --failed

# Review systemd boot timeline
systemd-analyze blame
  1. Isolate antivirus components:
# List security-related services
systemctl list-units --type=service | grep -E "(security|antivirus|falcon|clamav)"

# Temporarily disable for testing
sudo systemctl stop antivirus-service
sudo systemctl disable antivirus-service

Recovery Options

Option 1: Clean Boot

  • Disable all non-essential services
  • Boot with minimal security modules
  • Gradually re-enable components

Option 2: Recovery Environment

  • Use live USB/CD for filesystem repair
  • Access encrypted volumes if necessary
  • Reinstall antivirus software if corrupted

Option 3: Kernel Parameters

# Boot parameters for troubleshooting
linux /boot/vmlinuz-... ro quiet splash security= selinux=0 apparmor=0

Prevention and Best Practices

System Configuration

  1. Proper service ordering:
# Ensure antivirus starts after critical filesystems
# Edit service files to add proper dependencies
systemctl edit antivirus-service
  1. Exclude system paths:
# Configure antivirus to exclude boot-critical paths
# Examples: /boot, /sys, /proc, /dev
  1. Regular maintenance:
# Update antivirus definitions
antivirus-update-command

# Monitor system logs for early warnings
logwatch --service antivirus

Monitoring and Alerting

  • Set up log monitoring for antivirus-related errors
  • Configure alerts for service failures
  • Regular health checks of security components
  • Documentation of emergency procedures

Conclusion

Antivirus software is essential for Linux security but requires careful configuration to avoid boot interference. Understanding LSM interactions, filesystem behavior, and specific tool requirements (like CrowdStrike Falcon Sensor) is crucial for maintaining system stability.

When issues occur, systematic troubleshooting—starting with log analysis and service isolation—usually reveals the root cause. Temporary fixes like service masking provide immediate relief while permanent solutions address underlying configuration problems.

Remember: security and stability aren’t mutually exclusive with proper planning and monitoring.

1.3 - Understanding IO Delays in Linux - Performance Testing with io-delayer

A deep dive into Linux IO performance degradation using kernel modules to simulate VFS and block device delays, with engineering analysis of path resolution and queue congestion impacts.

GitHub  •  Linux Kernel Module

Simulate and analyze IO performance degradation at multiple kernel layers to understand system bottlenecks.

Why IO Delays Matter in System Performance

Input/Output operations form the backbone of system performance, yet they represent one of the most complex and often misunderstood aspects of Linux performance engineering. When applications experience slowdowns, the root cause frequently traces back to IO delays introduced at various kernel layers.

The io-delayer kernel module provides a controlled environment to study these delays, helping engineers understand how different types of IO interception impact overall system performance.

The io-delayer Tool

io-delayer is a Linux kernel module that demonstrates IO performance degradation by intercepting operations at two critical kernel layers:

Visualization of IO delays at different Linux kernel layers (VFS vs Block Device)

Figure: Visualization of IO delays at different Linux kernel layers (VFS vs Block Device).
  1. VFS (Virtual File System) Layer: Where path resolution and high-level file operations occur
  2. Block Device Layer: Where actual storage device interactions happen

Using kprobes, the module inserts configurable microsecond delays into the IO path, allowing precise measurement of performance impact across different scenarios.

VFS Layer Delays: The Path Resolution Bottleneck

Understanding VFS in the IO Stack

The Virtual File System layer sits between system calls and actual filesystem implementations. When an application calls open(), read(), or write(), these requests pass through VFS before reaching storage devices.

According to the Linux kernel VFS documentation, VFS provides a unified interface that abstracts different filesystem types while handling core operations like:

  • Path name resolution (translating /home/user/file.txt to inode references)
  • Permission checking
  • File descriptor management
  • Caching coordination

How VFS Delays Degrade Performance

VFS delays primarily impact path resolution times, which become critical in scenarios involving:

1. Directory Traversal Operations

// Every path component requires VFS resolution
int fd = open("/home/user/documents/large-dataset/file-001.dat", O_RDONLY);

Each directory in the path (home, user, documents, large-dataset) triggers VFS operations. A 100μs delay per VFS call on a deep path like /var/lib/docker/containers/... can accumulate to milliseconds.

2. Metadata-Heavy Workloads

Applications that perform many stat() calls or directory listings suffer disproportionately:

  • Package managers during dependency resolution
  • File browsers indexing directories
  • Backup tools scanning filesystem trees

3. Process Startup Overhead

Binary execution involves resolving library paths through VFS:

# Dynamic linker resolves ~50-100 library paths per process start
execve("/usr/bin/python3", ...)

Real-World VFS Delay Scenarios

Antivirus Software Impact: Modern endpoint protection often hooks VFS operations to scan files during access. A 50-200μs delay per file operation can make boot times 2-3x slower.

Container Filesystem Overhead: Docker’s layered filesystem adds VFS traversal through multiple overlay layers. Additional delays here compound with existing virtualization overhead.

Network Filesystem Latency: NFS and CIFS clients already introduce network round-trips; VFS delays can make interactive applications feel unresponsive.

Block Device Layer Delays: Queue Congestion Dynamics

Block Device Architecture

Below VFS lies the block device layer, responsible for managing IO queues, scheduling, and hardware interaction. The Linux block IO subsystem uses:

  • IO Schedulers: CFQ, deadline, noop, mq-deadline
  • Request Queues: Buffering and reordering IO requests
  • Elevator Algorithms: Optimizing disk head movement patterns

Queue Congestion Mechanics

Block device delays create queue congestion that affects the entire storage subsystem:

1. Request Queue Backlog

// Block device delays cause queue depth to increase
// Normal queue depth: 1-32 requests
// With delays: Can exceed 1000+ pending requests

When delays are introduced at the block layer, incoming requests accumulate faster than they can be processed, leading to:

  • Memory Pressure: IO requests consume kernel memory
  • Increased Latency: Requests wait longer in queue
  • Throughput Degradation: Device saturation occurs at lower loads

2. IO Scheduler Inefficiency

Modern IO schedulers optimize for:

  • Seek Time Minimization: Grouping nearby requests
  • Fairness: Preventing starvation between processes
  • Priority Handling: Real-time vs. normal priority requests

Delays disrupt these optimizations, causing schedulers to make suboptimal decisions.

3. Cascading Effects

Block delays create feedback loops:

  1. Delays increase queue depth
  2. Deeper queues increase average wait time
  3. Higher latency causes applications to issue more concurrent requests
  4. More concurrency deepens queues further

Real-World Block Device Scenarios

Storage Security Appliances: Enterprise storage systems with inline deduplication, compression, or encryption add block-level delays. Understanding these helps size hardware appropriately.

Virtualization Storage Stacks: Hypervisor storage filters, vSAN layers, and distributed storage introduce block delays that compound with virtualization overhead.

Hardware RAID Controllers: Battery-backed write caches and consistency checks add measurable delays that affect application performance expectations.

Engineering Methodology: Testing with io-delayer

Installation and Setup

# Clone the repository
git clone https://github.com/samatild/io-delayer.git
cd io-delayer

# Build and install (requires kernel headers)
make all
sudo make install
sudo make load

VFS Delay Testing

# Set 100 microsecond VFS delay
sudo ./io-delayer-cli set 100

# Test path resolution impact
time find /usr -name "*.so" 2>/dev/null | head -10

# Measure application startup time
time python3 -c "import sys; print('Python started')"

Block Device Delay Testing

# Set 500 microsecond block device delay
sudo ./io-delayer-cli set-blk 500

# Test sequential throughput
dd if=/dev/zero of=/tmp/test bs=1M count=100

# Test random IO performance
fio --name=randread --rw=randread --bs=4k --size=1G --numjobs=4 --runtime=30

Performance Analysis Techniques

1. SystemTap/BPF Tracing

# Trace VFS operations with delays
stap -e '
probe vfs.read.return {
  if (execname() == "target_app") {
    printf("%s: VFS read latency %d us\n", execname(), gettimeofday_us() - @entry(gettimeofday_us()));
  }
}
'

2. IO Queue Monitoring

# Monitor block device queue depths
iostat -x 1

# Check /proc/diskstats for queue congestion
cat /proc/diskstats | grep sda

3. Flame Graph Analysis

# Generate IO flame graphs during testing
perf record -F 99 -a -g -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > io-flame.svg

Real-World Engineering Applications

Security Software Performance Characterization

Understanding VFS delays helps security teams balance protection with performance:

# Simulate antivirus overhead
sudo ./io-delayer-cli set 150  # 150μs per file operation

# Test impact on common workflows
time grep -r "pattern" /var/log/  # Log analysis
time find /home -name "*.pdf"     # Document indexing

Key Insight: 100-200μs VFS delays can make interactive applications feel “sluggish,” explaining user complaints about security software impact.

Storage System Capacity Planning

Block device delays help size storage arrays:

# Simulate storage appliance latency
sudo ./io-delayer-cli set-blk 2000  # 2ms storage latency

# Test application tolerance
fio --name=oltp_simulation --rw=randrw --rwmixread=70 --bs=8k --size=10G --numjobs=8 --runtime=300

Key Insight: Applications tolerant of 1-2ms latency can handle most enterprise storage, but 5+ms delays require architectural changes.

Container Performance Optimization

Docker and Kubernetes environments add multiple IO layers:

# Simulate container filesystem overhead
sudo ./io-delayer-cli set 50   # OverlayFS traversal
sudo ./io-delayer-cli set-blk 100  # Storage driver delays

# Test container startup time
time docker run --rm hello-world

Key Insight: Container startup times degrade linearly with VFS delays, explaining performance differences between bare metal and containerized deployments.

Kernel Implementation Details

Kprobe Integration

The module uses kprobes to intercept kernel functions without modification:

// VFS layer hooking
static struct kprobe kp_vfs_read = {
    .symbol_name = "vfs_read",
    .pre_handler = pre_handler_vfs_read,
};

// Block device layer hooking  
static struct kprobe kp_blk_start_request = {
    .symbol_name = "blk_start_request",
    .pre_handler = pre_handler_blk_start_request,
};

Delay Injection Mechanism

static void inject_delay(unsigned int delay_us)
{
    if (delay_us > 0) {
        udelay(delay_us);
    }
}

This uses udelay() for microsecond-precision delays, calibrated against the system timer.

Sysfs Interface

Runtime configuration via sysfs:

# VFS delay control
/sys/kernel/io_delayer/vfs_delay_us
# Block device delay control  
/sys/kernel/io_delayer/blk_delay_us

Performance Benchmarking Best Practices

Controlled Testing Environment

  1. Baseline Measurement: Always establish performance without delays first
  2. Incremental Testing: Start with small delays (10-50μs) and increase gradually
  3. Statistical Analysis: Run tests multiple times, use percentiles over averages
  4. System Monitoring: Track CPU usage, memory pressure, and IO queue depths

Realistic Delay Values

Based on real-world measurements:

  • VFS delays: 10-200μs (antivirus, DLP software)
  • Block delays: 100-2000μs (storage appliances, encryption)
  • Network delays: 50-500μs (WAN acceleration, proxy servers)

Performance Metrics to Monitor

# System-level metrics
vmstat 1
iostat -x 1
sar -u 1

# Application-level timing
strace -c -p $(pidof target_app)
perf stat -p $(pidof target_app)

Troubleshooting and Validation

Verifying Module Operation

# Check module loading
lsmod | grep io_delayer

# Verify kprobe attachment
cat /sys/kernel/debug/kprobes/list | grep io_delayer

# Monitor delay settings
cat /sys/kernel/io_delayer/vfs_delay_us
cat /sys/kernel/io_delayer/blk_delay_us

Performance Impact Assessment

# Before delays (baseline)
sudo ./io-delayer-cli set 0
sudo ./io-delayer-cli set-blk 0
fio --name=baseline --rw=randread --bs=4k --size=1G --runtime=30 --output=baseline.json

# With VFS delays
sudo ./io-delayer-cli set 100
fio --name=vfs_delayed --rw=randread --bs=4k --size=1G --runtime=30 --output=vfs_delayed.json

# With block device delays
sudo ./io-delayer-cli set 0
sudo ./io-delayer-cli set-blk 500
fio --name=blk_delayed --rw=randread --bs=4k --size=1G --runtime=30 --output=blk_delayed.json

# Calculate degradation - compare IOPS, latency percentiles, CPU usage

Security and Production Considerations

Important: This module is for testing purposes only. Production use can cause system instability or data loss.

  • Requires root privileges for installation
  • Can impact system responsiveness during high-delay testing
  • Not intended for security enforcement (unlike actual antivirus software)
  • MIT licensed for educational and research use

Advanced Use Cases

Custom Kernel Instrumentation

The module serves as a template for building custom IO tracing tools:

// Add custom probes for specific filesystems
static struct kprobe kp_ext4_writepages = {
    .symbol_name = "ext4_writepages",
    .pre_handler = custom_ext4_handler,
};

Performance Regression Testing

# Automated testing script
#!/bin/bash
for delay in 0 50 100 200; do
    sudo ./io-delayer-cli set $delay
    sudo ./io-delayer-cli set-blk $delay

    # Run your favorite benchmark utility (fio, tuxtoaster, etc.)
    echo "Testing with ${delay}us delays..."
    fio --name=regression_test --rw=randrw --rwmixread=70 --bs=4k --size=1G --runtime=30 --output=results_${delay}us.json

    # Or use tuxtoaster for comprehensive system testing
    # tuxtoaster --category disk --test "Random IOPS R/W"

    # Reset delays between tests
    sudo ./io-delayer-cli set 0
    sudo ./io-delayer-cli set-blk 0
    sleep 5
done

Conclusion

The io-delayer kernel module provides essential visibility into how IO delays at different kernel layers impact overall system performance. By understanding VFS path resolution bottlenecks and block device queue congestion, engineers can make informed decisions about:

  • Security software deployment strategies
  • Storage system architecture choices
  • Container performance optimization
  • Application performance expectations

The tool bridges the gap between theoretical kernel knowledge and practical performance engineering, enabling data-driven decisions in complex Linux environments.

Key Takeaways:

  • VFS delays primarily affect interactive performance through path resolution overhead
  • Block device delays create queue congestion that impacts throughput and scalability
  • Real-world delays range from microseconds (software) to milliseconds (hardware)
  • Systematic testing with tools like io-delayer prevents performance surprises

Learn More & References

1.4 - Streamlining Linux Diagnostics with SOSParser

A comprehensive guide to using SOSParser, an automated analysis tool for Linux sosreport and supportconfig diagnostic files.

GitHub  •  Docker Hub  • 

Parse, analyze, and understand Linux diagnostic reports with automated intelligence.

What is SOSParser?

SOSParser is a powerful web application designed to automatically parse and analyze Linux sosreport and supportconfig diagnostic files, converting them into comprehensive, interactive HTML reports. Created to streamline the often tedious process of manually reviewing system diagnostic data, SOSParser transforms raw diagnostic archives into structured, searchable insights that accelerate troubleshooting and system analysis.

Whether you’re a system administrator, DevOps engineer, or support technician dealing with complex Linux environments, SOSParser provides an automated approach to understanding what’s happening inside your systems.

SOSParser screenshot: Animated workflow of analysis and reporting

The Problem SOSParser Solves

When Linux systems encounter issues, the standard diagnostic approach involves generating comprehensive reports using tools like:

  • sosreport - A utility that collects detailed system information from Red Hat-based distributions
  • supportconfig - SUSE’s equivalent diagnostic collection tool

These reports contain thousands of files with critical system information, but analyzing them manually is:

  • Time-consuming: Hours of sifting through logs, configurations, and system data
  • Error-prone: Easy to miss important correlations between different system components
  • Inconsistent: Different analysts may interpret the same data differently
  • Repetitive: Common patterns and issues require rediscovery each time

SOSParser addresses these challenges by providing automated, intelligent analysis that surfaces key insights immediately.

How SOSParser Works

Input Processing

SOSParser accepts standard diagnostic archives in various compressed formats:

  • .tar.xz (most common)
  • .tar.gz
  • .tar.bz2
  • .tar

Analysis Pipeline

Once uploaded, SOSParser processes the diagnostic data through multiple analysis modules:

  1. Data Extraction: Automatically unpacks and organizes the diagnostic archive
  2. Content Parsing: Extracts and structures data from hundreds of system files
  3. Correlation Analysis: Identifies relationships between different system components
  4. Insight Generation: Applies heuristics and rules to identify potential issues
  5. Report Generation: Creates an interactive HTML report with visualizations and recommendations

What SOSParser Analyzes

System Information

  • Hardware Details: CPU architecture, memory configuration, disk layout
  • OS Information: Distribution, version, kernel details
  • System Resources: Current utilization, capacity planning insights

System Configuration

  • Boot Configuration: GRUB settings, init systems, startup services
  • Authentication: PAM configuration, user management, security policies
  • Services: Systemd units, cron jobs, running processes
  • Security: SELinux/AppArmor status, firewall rules, package integrity

Filesystem Analysis

  • Mount Points: Filesystem types, mount options, capacity usage
  • LVM Configuration: Volume groups, logical volumes, physical volumes
  • Disk Usage: Largest directories, file ownership patterns, permission issues
  • Filesystem Health: Journal status, inode usage, fragmentation indicators

Network Analysis

  • Interface Configuration: IP addresses, subnet masks, gateway settings
  • Routing Tables: Static and dynamic routes, network connectivity
  • DNS Configuration: Resolvers, search domains, DNS query patterns
  • Firewall Rules: iptables/nftables configuration, active rulesets
  • Network Services: Listening ports, connection states, network statistics

Log Analysis

  • System Logs: /var/log/messages, /var/log/syslog, journald entries
  • Kernel Logs: dmesg output, kernel ring buffer analysis
  • Authentication Logs: Login attempts, sudo usage, security events
  • Service Logs: Application-specific log analysis and error pattern detection
  • Security Events: Failed access attempts, intrusion indicators

Cloud Services Integration

  • AWS: EC2 instance metadata, IAM roles, VPC configuration
  • Azure: VM extensions, resource groups, networking setup
  • GCP: Compute Engine metadata, service accounts, network configuration
  • Oracle Cloud: Instance details, VNICs, storage configuration

Getting Started with SOSParser

The easiest way to run SOSParser is using Docker:

# Pull the official image
docker pull samuelmatildes/sosparser:latest

# Run the container
docker run -d -p 8000:8000 --name sosparser samuelmatildes/sosparser:latest

Then open http://localhost:8000 in your browser.

Persisting Data

For production use, mount volumes to persist uploads and generated reports:

# Using bind mounts
docker run -d -p 8000:8000 --name sosparser \
  -v $(pwd)/data/uploads:/app/webapp/uploads \
  -v $(pwd)/data/outputs:/app/webapp/outputs \
  samuelmatildes/sosparser:latest

# Using named volumes
docker run -d -p 8000:8000 --name sosparser \
  -v sosparser_uploads:/app/webapp/uploads \
  -v sosparser_outputs:/app/webapp/outputs \
  samuelmatildes/sosparser:latest

Local Development

To build and run locally:

git clone https://github.com/samatild/SOSParser.git
cd SOSParser
docker build -t sosparser:local .
docker run -d -p 8000:8000 sosparser:local

Using SOSParser

Web Interface Workflow

  1. Upload: Select your sosreport or supportconfig file (supports multiple formats)
  2. Analyze: Click “Analyze Report” to start automated processing
  3. Review: Browse the generated interactive HTML report
  4. Export: Download reports for sharing or archival

Report Features

The generated reports include:

  • Interactive Navigation: Collapsible sections, searchable content
  • Visual Indicators: Color-coded severity levels for issues
  • Cross-References: Links between related system components
  • Recommendations: Actionable suggestions based on findings
  • Export Options: PDF generation, data extraction

Common Use Cases

Incident Response

  • Rapid triage of production system issues
  • Correlation of symptoms across multiple subsystems
  • Identification of root cause patterns

Capacity Planning

  • Resource utilization analysis
  • Performance bottleneck identification
  • Growth trend assessment

Security Audits

  • Configuration compliance checking
  • Vulnerability assessment
  • Access pattern analysis

Change Validation

  • Pre/post-change comparison
  • Configuration drift detection
  • Impact assessment

Advanced Features and Roadmap

Currently Available

  • Multi-format diagnostic file support
  • Cloud platform detection and analysis
  • Comprehensive system health scoring
  • Interactive HTML report generation

Planned Enhancements

  • Advanced Disk Diagnostics: SMART data analysis, ATA command integration
  • Application Server Analysis: Apache/Nginx configuration parsing, database connectivity
  • Container Orchestration: Kubernetes pod analysis, Docker container inspection
  • Backup System Integration: Backup status validation, recovery testing
  • Monitoring Integration: Prometheus metrics correlation, alerting rule validation
  • Machine Learning: Anomaly detection, predictive issue identification

Performance and Scalability

SOSParser is designed to handle large diagnostic reports efficiently:

  • Processing Speed: Most reports analyzed in under 2 minutes
  • Memory Usage: Optimized for systems with 2GB+ RAM
  • Storage: Reports typically 10-20% of original archive size
  • Concurrency: Supports multiple simultaneous analyses

Security Considerations

  • Local Processing: All analysis occurs locally - no data sent to external services
  • Container Isolation: Docker deployment provides additional security boundaries
  • Data Privacy: Sensitive information remains within your infrastructure
  • Audit Trail: Processing logs available for compliance requirements

Integration and Automation

API Access

SOSParser provides REST API endpoints for integration with existing workflows:

# Upload and analyze via API
curl -X POST -F "[email protected]" http://localhost:8000/api/analyze

CI/CD Integration

  • Automated analysis of system snapshots
  • Regression testing for configuration changes
  • Compliance validation pipelines

Monitoring Integration

  • Alert generation based on analysis results
  • Dashboard integration for system health overview
  • Trend analysis across multiple systems

Contributing and Community

SOSParser is an open-source project that welcomes contributions:

  • Bug Reports: Use GitHub Issues for problems or feature requests
  • Code Contributions: Pull requests are reviewed and merged regularly
  • Documentation: Help improve guides and examples
  • Testing: Report compatibility with different Linux distributions

Join the community on Telegram for updates, discussions, and support.

Troubleshooting SOSParser

Common Issues

Upload Failures

  • Check file size limits (typically 500MB max)
  • Verify archive integrity before upload
  • Ensure proper file permissions

Analysis Errors

  • Confirm the diagnostic file was generated correctly
  • Check for corrupted archives
  • Review Docker logs for processing errors

Performance Issues

  • Allocate sufficient CPU and memory resources
  • Process large reports during off-peak hours
  • Consider horizontal scaling for high-volume environments

Conclusion

SOSParser represents a significant advancement in Linux system diagnostics, transforming the traditionally manual and time-intensive process of analyzing sosreport and supportconfig files into an automated, intelligent workflow. By providing comprehensive analysis, actionable insights, and interactive reports, it empowers system administrators and support teams to resolve issues faster and maintain healthier Linux environments.

Whether you’re managing a single server or overseeing enterprise-scale deployments, SOSParser provides the tools needed to understand your systems at a deeper level, identify potential issues before they become critical, and maintain optimal system health.


Learn More

1.5 - Linux Benchmarking Made Easy with Tux Toaster

A practical guide to stress testing and benchmarking Linux systems using the Tux Toaster toolkit.

GitHub  •  PyPI  

Benchmark smarter, not harder — with Tux Toaster.

What is Tux Toaster?

Tux Toaster is an all-in-one performance toolkit for Linux. It triggers various load tests (“toasters”) to help you evaluate the performance and stability of your system across CPU, memory, disk, and network. It offers an interactive terminal menu with multi-select support and clear, stoppable workloads.

Preview

When to use it

  • Hardware bring-up and burn-in
  • Post-maintenance validation (kernel/firmware/driver updates)
  • Capacity planning and instance comparison
  • Performance regressions investigations
  • Reproducible stress scenarios for bug reports

Requirements

Tux Toaster targets Linux and relies on:

  • Python 3.8+
  • System utilities: dd, lsblk, taskset, pkill
  • Internet connectivity for network tests

Optional/privileged:

  • Root privileges for the “Unclean GC” runaway memory test to adjust oom_score_adj

Installation

Install from PyPI (recommended):

pip install tuxtoaster
tuxtoaster

If your shell cannot find tuxtoaster, add the install directory to PATH:

# User installs (scripts in ~/.local/bin)
export PATH="$HOME/.local/bin:$PATH"

Alternative: install from source (editable):

git clone https://github.com/samatild/tuxtoaster.git
cd tuxtoaster
pip install -U pip setuptools wheel
pip install -e .
tuxtoaster

Menu controls:

  • Use arrow keys to navigate, Enter to select.
  • Many submenus support multi-select; hints appear in the UI.
  • Press q, x, or Esc in a menu to go back.
  • During tests, press Enter to stop.

Quick start

From the main menu, pick a category and test(s) to run.

CPU

  • Single Core
  • All Cores
  • Custom Number of Cores (uses taskset; experimental)

Memory

  • Single Runaway Thread
  • Multiple Runaway Threads
  • Memory spikes
  • Unclean GC (requires root to set oom_score_adj)

Disk

  • IOPS Reads (4K, direct I/O)
  • IOPS Writes (4K, direct I/O)
  • Random IOPS R/W (4K, direct I/O)
  • IOPS 50-50 R/W (4K, direct I/O)
  • Throughput Reads (4MB, direct I/O)
  • Throughput Writes (4MB, direct I/O)
  • Random Throughput R/W (4MB, direct I/O)
  • Throughput 50-50 R/W (4MB, direct I/O)
  • Read while write cache is getting flushed

Network

  • Network IN (Single) — downloads https://proof.ovh.net/files/100Mb.dat
  • Network OUT (Single) — UDP to 8.8.8.8:53
  • Network IN (Multiple) — N parallel downloads
  • Network OUT (Multiple) — N parallel UDP senders
  • Socket Exhaustion — exhaust process/kernel sockets or reserve listener port ranges
  • Simulate Latencies — local TCP proxy that injects base latency and jitter (check project Git for instructions)

Multiple tests at once

Run CPU, Memory, Network, and Disk concurrently:

  • From the main menu: Multiple → select items with SPACE → Enter to start.
  • Press Enter to stop all running tests.

Behaviors:

  • CPU: one busy process per core (like All Cores)
  • Memory: threads allocate 500 MB chunks repeatedly (requires psutil)
  • Network: continuous download of https://proof.ovh.net/files/100Mb.dat
  • Disk: alternating dd read/write with 4MB direct I/O under a temp file

Reading results

Tux Toaster prints live progress and a summary when you stop a test. Disk tests create temporary files under a dedicated directory on the selected mount points and clean up on exit. Network tests report bandwidth per socket in multi-socket modes.

Tips:

  • Run tests at least 3 times and use medians for comparisons.
  • Keep a record of CPU governor, kernel version, microcode, and thermal state.
  • Pin CPU frequency when comparing hardware to reduce variance.

Good benchmarking hygiene

  • Stop noisy services (package updates, indexing, backup agents)

Troubleshooting

  • Missing psutil: pip3 install psutil
  • Permission errors: some memory tests and taskset pinning may require sudo
  • Inconsistent results: check CPU governor, temperature, and background load
  • Direct I/O errors: some filesystems/containers may not honor oflag=direct

Learn more


2 - Kernel

Linux Kernel Information

Find automation helpers and reference material that streamline Linux administration tasks.

2.1 - Soft and Hard Lockups

Detect, troubleshoot, and simulate Linux kernel soft and hard lockups with the watchdog.

Introduction

Kernel Watchdog

According to Linux Kernel Documentation, the Linux kernel can act as a watchdog to detect both soft and hard lockups.

The Linux Kernel Watchdog is a mechanism that monitors the system for both soft and hard lockups. It is designed to detect when a CPU is stuck in a loop and unable to make progress. When a soft lockup is detected, the watchdog will print a warning message to the system log.

Soft vs. Hard Lockups

Soft Lockup

A ‘softlockup’ is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds, without giving other tasks a chance to run. The current stack trace is displayed upon detection and, by default, the system will stay locked up.

Linux Kernel Watchdog - Softlockup detection

By default the system will display the stack trace and stay locked up. Watchdog can be configured to reboot the system automatically when a soft lockup is detected. This can be done by setting the softlockup_panic parameter to a value greater than zero.

Common error messages:

BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
Modules linked in: ...

Hard Lockup

A ‘hardlockup’ is defined as a bug that causes the CPU to loop in kernel mode for more than 10 seconds, without letting other interrupts have a chance to run.

Linux Kernel Watchdog - Hardlockup detection

Similar to softlockup, the system will display the stack trace and stay locked up. The watchdog can be configured to reboot the system automatically when a hard lockup is detected. This can be done by setting the hardlockup_panic parameter to a value greater than zero.

Common error messages:

BUG: hard lockup - CPU#0 stuck for 22s! [swapper/0:1]
Modules linked in: ...

Deep Dive into Soft/Hard Lockups

Reacting to soft/hard lockups

Ok, so you have a soft or hard lockup, what now? Here are some steps you can take to troubleshoot and resolve the issue:

The Basics

1. Check the system logs: Look for any error messages or warnings that may indicate the cause of the lockup.

2. Check the CPU load: High CPU load can cause lockups. Use tools like top or htop to monitor CPU usage.

3. Read the watchdog warning message: The watchdog will print a warning message to the system log when a lockup is detected. This message may provide clues as to the cause of the lockup. Usually, the message will include a stack trace that shows where the CPU is stuck, also the name of the process that caused the lockup and it’s PID.

Thresholds

The watchdog has default thresholds for detecting soft and hard lockups. These thresholds can be adjusted to suit your system’s needs. Before changing thresholds, it is important to understand the implications of changing these values. Setting the thresholds too low may result in false positives, while setting them too high may result in missed lockups.

Implementation

A periodic hrtimer runs to generate interrupts and kick the watchdog job. An NMI perf event is generated every “watchdog_thresh” (compile-time initialized to 10 and configurable through sysctl of the same name) seconds to check for hardlockups. If any CPU in the system does not receive any hrtimer interrupt during that time the ‘hardlockup detector’ (the handler for the NMI perf event) will generate a kernel warning or call panic, depending on the configuration. The watchdog job runs in a stop scheduling thread that updates a timestamp every time it is scheduled. If that timestamp is not updated for 2*watchdog_thresh seconds (the softlockup threshold) the ‘softlockup detector’ (coded inside the hrtimer callback function) will dump useful debug information to the system log, after which it will call panic if it was instructed to do so or resume execution of other kernel code.

Check current threshold:

cat /proc/sys/kernel/watchdog_thresh
10

Update threshold to 30 seconds:

# For temporary change
echo 30 > /proc/sys/kernel/watchdog_thresh

# For permanent change
echo "kernel.watchdog_thresh = 30" >> /etc/sysctl.conf
sysctl -p

Advanced Troubleshooting

The watchdog can be configured to panic the system when a lockup is detected. This can be done by setting the softlockup_panic or hardlockup_panic parameters to a value greater than zero.:

# Enable panic on soft lockup
echo 1 > /proc/sys/kernel/softlockup_panic

# Enable panic on hard lockup
echo 1 > /proc/sys/kernel/hardlockup_panic

In combination with crash dump, you can get a full dump of the system state when a lockup is detected. This can be useful for debugging the issue.

Step-by-step guide for this recipe: 1. Enable crash dump in the kernel configuration. Make sure that it’s collecting memory dumps when a kernel panic occurs. 2. Configure the kernel to reboot the system when a lockup is detected. 3. Proceed by analyzing the crash dump to identify the cause of the lockup.

Demo - Creating a Softlockup and understanding the watchdog output

Simulating a Soft Lockup

For the example below, we will simulate a soft lockup by loading a kernel module design to test if watchdog and lockup detection are working as expected.

Test module to generate lockups: CONFIG_TEST_LOCKUP

Note: When installing Linux Kernel modules, make sure: You are using the correct version of the kernel module for your kernel version. (Browse kernel config source code and match your kernel version: Example for v6.9.5). And, that you have the necessary kernel headers installed.

The following parameters are available to simulate a variety of lockups:


# modinfo soft.ko
filename:       /root/CONFIG_TEST_LOCKUP/soft.ko
description:    Test module to generate lockups
author:         Konstantin Khlebnikov <[email protected]>
license:        GPL
srcversion:     302B4AE69F898F7B25CABF8
depends:
retpoline:      Y
name:           soft
vermagic:       5.15.0-1064-azure SMP mod_unload modversions
parm:           time_secs:lockup time in seconds, default 0 (uint)
parm:           time_nsecs:nanoseconds part of lockup time, default 0 (uint)
parm:           cooldown_secs:cooldown time between iterations in seconds, default 0 (uint)
parm:           cooldown_nsecs:nanoseconds part of cooldown, default 0 (uint)
parm:           iterations:lockup iterations, default 1 (uint)
parm:           all_cpus:trigger lockup at all cpus at once (bool)
parm:           state:wait in 'R' running (default), 'D' uninterruptible, 'K' killable, 'S' interruptible state (charp)
parm:           use_hrtimer:use high-resolution timer for sleeping (bool)
parm:           iowait:account sleep time as iowait (bool)
parm:           lock_read:lock read-write locks for read (bool)
parm:           lock_single:acquire locks only at one cpu (bool)
parm:           reacquire_locks:release and reacquire locks/irq/preempt between iterations (bool)
parm:           touch_softlockup:touch soft-lockup watchdog between iterations (bool)
parm:           touch_hardlockup:touch hard-lockup watchdog between iterations (bool)
parm:           call_cond_resched:call cond_resched() between iterations (bool)
parm:           measure_lock_wait:measure lock wait time (bool)
parm:           lock_wait_threshold:print lock wait time longer than this in nanoseconds, default off (ulong)
parm:           disable_irq:disable interrupts: generate hard-lockups (bool)
parm:           disable_softirq:disable bottom-half irq handlers (bool)
parm:           disable_preempt:disable preemption: generate soft-lockups (bool)
parm:           lock_rcu:grab rcu_read_lock: generate rcu stalls (bool)
parm:           lock_mmap_sem:lock mm->mmap_lock: block procfs interfaces (bool)
parm:           lock_rwsem_ptr:lock rw_semaphore at address (ulong)
parm:           lock_mutex_ptr:lock mutex at address (ulong)
parm:           lock_spinlock_ptr:lock spinlock at address (ulong)
parm:           lock_rwlock_ptr:lock rwlock at address (ulong)
parm:           alloc_pages_nr:allocate and free pages under locks (uint)
parm:           alloc_pages_order:page order to allocate (uint)
parm:           alloc_pages_gfp:allocate pages with this gfp_mask, default GFP_KERNEL (uint)
parm:           alloc_pages_atomic:allocate pages with GFP_ATOMIC (bool)
parm:           reallocate_pages:free and allocate pages between iterations (bool)
parm:           file_path:file path to test (string)
parm:           lock_inode:lock file -> inode -> i_rwsem (bool)
parm:           lock_mapping:lock file -> mapping -> i_mmap_rwsem (bool)
parm:           lock_sb_umount:lock file -> sb -> s_umount (bool)

We will simulate a soft lockup by loading the module with the following parameters:

insmod soft.ko time_secs=35 iterations=1 all_cpus=0 state="R"

Where:

  • time_secs=35 - Lockup time in seconds
  • iterations=1 - Lockup iterations
  • all_cpus=0 - Trigger lockup at all CPUs at once
  • state="R" - Wait in ‘R’ running state

After loading the module, the system will be locked up for 35 seconds. The watchdog will detect the soft lockup and print a warning message to the system log.

Understanding the watchdog output

The watchdog warning message

[  568.503455] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [insmod:5912]
[  568.508018] Modules linked in: soft(OE+) nls_iso8859_1 kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel binfmt_misc sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd joydev hid_generic serio_raw hyperv_drm drm_kms_helper syscopyarea hid_hyperv sysfillrect sysimgblt hid fb_sys_fops hv_netvsc hyperv_keyboard cec rc_core sch_fq_codel drm i2c_core efi_pstore ip_tables x_tables autofs4

The watchdog warning message includes the following information:

  • BUG: soft lockup - Indicates that a soft lockup was detected
  • CPU#1 stuck for 26s! - Indicates that CPU#1 was stuck for 26 seconds
  • [insmod:5912] - Indicates that the process with PID 5912 caused the lockup (in this case, the insmod process)
  • Modules linked in (...) - Lists the kernel modules that were loaded at the time of the lockup

The stack trace

[  568.508074] Call Trace:
[  568.508075]  <IRQ>
[  568.508079]  ? show_regs.cold+0x1a/0x1f
[  568.508085]  ? watchdog_timer_fn+0x1c4/0x220
[  568.508089]  ? softlockup_fn+0x30/0x30
[  568.508092]  ? __hrtimer_run_queues+0xca/0x1d0
[  568.508095]  ? hrtimer_interrupt+0x109/0x230
[  568.508097]  ? hv_stimer0_isr+0x20/0x30
[  568.508101]  ? __sysvec_hyperv_stimer0+0x32/0x70
[  568.508104]  ? sysvec_hyperv_stimer0+0x7b/0x90
[  568.508109]  </IRQ>
[  568.508110]  <TASK>
[  568.508111]  ? asm_sysvec_hyperv_stimer0+0x1b/0x20
[  568.508116]  ? delay_tsc+0x2b/0x70
[  568.508118]  __const_udelay+0x43/0x50
[  568.508122]  test_wait+0xc6/0xe0 [soft]
[  568.508129]  test_lockup+0xd9/0x270 [soft]
[  568.508133]  test_lockup_init+0x891/0x1000 [soft]
[  568.508137]  ? 0xffffffffc084f000
[  568.508139]  do_one_initcall+0x48/0x1e0
[  568.508143]  ? __cond_resched+0x19/0x40
[  568.508146]  ? kmem_cache_alloc_trace+0x15a/0x420
[  568.508150]  do_init_module+0x52/0x230
[  568.508154]  load_module+0x1294/0x1500
[  568.508159]  __do_sys_finit_module+0xbf/0x120
[  568.508162]  ? __do_sys_finit_module+0xbf/0x120
[  568.508165]  __x64_sys_finit_module+0x1a/0x20
[  568.508168]  x64_sys_call+0x1ac3/0x1fa0
[  568.508170]  do_syscall_64+0x54/0xb0
[  568.508174]  ? __audit_syscall_exit+0x265/0x2d0
[  568.508177]  ? ksys_mmap_pgoff+0x14b/0x2a0
[  568.508182]  ? exit_to_user_mode_prepare+0x54/0x270
[  568.508185]  ? syscall_exit_to_user_mode+0x27/0x40
[  568.508187]  ? do_syscall_64+0x61/0xb0
[  568.508189]  entry_SYSCALL_64_after_hwframe+0x67/0xd1
[  568.508191] RIP: 0033:0x7f860d3fa95d
[  568.508194] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 03 35 0d 00 f7 d8 64 89 01 48
[  568.508196] RSP: 002b:00007ffd21043e88 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  568.508198] RAX: ffffffffffffffda RBX: 0000562af1c447d0 RCX: 00007f860d3fa95d
[  568.508199] RDX: 0000000000000000 RSI: 0000562af1c442a0 RDI: 0000000000000003
[  568.508200] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007ffd21044048
[  568.508201] R10: 0000000000000003 R11: 0000000000000246 R12: 0000562af1c442a0
[  568.508202] R13: 0000000000000000 R14: 0000562af1c447a0 R15: 0000562af1c442a0
[  568.508204]  </TASK>
[  577.859869] soft: Finish on CPU1 in 34727950300 ns
[  577.859874] soft: FINISH in 34727963600 ns

The stack trace shows where the CPU was stuck and provides information about the process that caused the lockup.

Important information from the stack trace:

  • IRQ - Indicates that the CPU was in an interrupt context when the lockup occurred.

IRQ Example

  • TASK - Indicates what calls where being executed when the lockup occurred.

TASK Example

When comparing with source code, you can identify the function that caused the lockup. In this case, the test_wait function in the soft module caused the lockup.

test_lockup.c

static void test_wait(unsigned int secs, unsigned int nsecs)
{
	if (wait_state == TASK_RUNNING) {
		if (secs)
			mdelay(secs * MSEC_PER_SEC);
		if (nsecs)
			ndelay(nsecs);
		return;
	}

	__set_current_state(wait_state);
	if (use_hrtimer) {
		ktime_t time;

		time = ns_to_ktime((u64)secs * NSEC_PER_SEC + nsecs);
		schedule_hrtimeout(&time, HRTIMER_MODE_REL);
	} else {
		schedule_timeout(secs * HZ + nsecs_to_jiffies(nsecs));
	}
}

Why did it hang?

  • The test_wait function was waiting in the TASK_RUNNING state, which caused the CPU to be stuck in a loop.
  • Since we are loading a module, the execution will be in kernel mode, and the watchdog will detect the lockup.
  • The mdelay function was used to delay the execution of the function for a specified number of milliseconds. In this case, the function was delayed for 35 seconds, which caused the lockup.

 

Final Thoughts

The Linux Kernel Watchdog is a powerful tool for detecting soft and hard lockups. By understanding how the watchdog works and how to react to lockups, you can troubleshoot and resolve issues more effectively. Remember to check the system logs, CPU load, and watchdog warning messages when a lockup occurs. By following best practices and using advanced troubleshooting techniques, you can identify the root cause of lockups and take steps to prevent them in the future.

2.2 - Why Kernel Crash Dumps Are Critical for Root Cause Analysis

Deep-dive on using vmcore crash dumps for postmortem kernel debugging, including real-world kernel bug and OOM workflows.

Postmortem Kernel Forensics with vmcore

Summary

When the Linux kernel panics, there is no userspace stack, no application logs, and often no intact filesystems. The only canonical, lossless record of the kernel’s terminal state is the crash dump (vmcore). Without vmcore, you are constrained to heuristics and guesswork; with vmcore, you can deterministically reconstruct CPU state, task scheduling, memory allocators, locks, timers, and subsystems at the exact point of failure. This is the difference between timeline narratives and hard proof.

What a vmcore Captures (and Why It Matters)

  • CPU architectural state: general-purpose registers, control registers, MSRs, per-CPU contexts.
  • Full kernel virtual memory snapshot: page tables, slab caches, VFS dentries/inodes, networking stacks, block layer queues, and device driver state.
  • Task list and scheduler state: task_struct, runqueues, RT/DL classes, stop machine contexts.
  • Lock state: mutex, spinlock_t owners, wait queues, and contention points.
  • Timers/workqueues/interrupts: pending timers, softirqs, tasklets, IRQ threads, NMI backtraces.

With unstripped vmlinux and kernel debuginfo, these structures become symbol-resolved and type-aware in tools like crash and gdb.

Minimal Prerequisites for a Useful Dump

  • Reserve crash kernel memory at boot: crashkernel=auto (or a fixed size appropriate to RAM and distro guidance).
  • Ensure kdump service is active and the dump target has write bandwidth and space (prefer raw disk/LVM or fast local FS; only use NFS/SSH if necessary).
  • Keep exact-matching debuginfo for the running kernel build:
    • Uncompressed vmlinux with full DWARF and symbols.
    • Matching System.map and all loaded module debuginfo (e.g., kernel-debuginfo, kernel-debuginfo-common on RHEL/Fedora; linux-image-…-dbgsym on Debian/Ubuntu repositories).
  • Persist critical panic policies:
sysctl -w kernel.panic_on_oops=1
sysctl -w kernel.unknown_nmi_panic=1
sysctl -w kernel.panic_on_unrecovered_nmi=1
sysctl -w vm.panic_on_oom=2   # 1=panic on OOM, 2=panic if no killable task
sysctl -w kernel.panic=10     # auto-reboot N seconds after panic

Persist via /etc/sysctl.d/*.conf as needed. For manual testing, enable SysRq and force a controlled crash:

echo 1 | sudo tee /proc/sys/kernel/sysrq
echo c | sudo tee /proc/sysrq-trigger

Acquisition Pipeline and Size Reduction

makedumpfile can filter non-essential pages to reduce vmcore size and I/O time without destroying forensics value. Recommended options:

makedumpfile -l --message-level 1 \
  -d 31 \
  /proc/vmcore /var/crash/vmcore.filtered
  • -d 31 drops cache pages, free pages, user pages, and unused memory; tune masks per incident.
  • Always retain an unfiltered copy during critical investigations if space allows.

Core Tooling

  • crash: purpose-built kernel postmortem shell using vmlinux DWARF types.
  • gdb with vmlinux: useful for advanced symbol work and scripted analysis.
  • vmcore-dmesg: extracts oops logs and last-kmsg from the dump.

Launch crash with debuginfo and module path:

crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /var/crash/vmcore

Typical initial commands in crash:

sys                # kernel, uptime, panic info
ps                 # task list summary
bt                 # backtrace of current task (set with 'set' or '-p PID')
log                # kernel ring buffer extracted from vmcore
kmem -i            # memory info: zones, nodes, reclaimers
files -p <PID>     # per-process file descriptors
dev -d             # device list & drivers
irq                # IRQ and softirq state
foreach bt         # backtrace all tasks (can be heavy on large systems)

Example 1 — Kernel Bug/Oops Leading to Panic

Symptoms at runtime: abrupt reboot, serial console shows BUG/oops with taint flags; no userspace core dumps.

Postmortem workflow:

vmcore-dmesg /var/crash/vmcore | less

Look for signatures such as:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
RIP: 0010:driver_xyz_process+0x5a/0x120 [driver_xyz]
Call Trace:
  worker_thread+0x8f/0x1a0
  kthread+0xef/0x120
  ret_from_fork+0x2c/0x40
Tainted: G    B  W  OE  5.14.0-xyz #1

Correlate symbols and inspect the faulting frame:

crash> sym driver_xyz_process
crash> dis -l driver_xyz_process+0x5a
crash> bt
crash> set -p <pid_of_worker>
crash> bt -f            # show full frames with arguments
crash> struct task_struct <task_addr>

Representative outputs:

# vmcore-dmesg (panic excerpt)
[ 1234.567890] Kernel panic - not syncing: Fatal exception
[ 1234.567891] CPU: 7 PID: 4123 Comm: kworker/u16:2 Tainted: G    B  W  OE   5.14.0-xyz #1
[ 1234.567893] Hardware name: Generic XYZ/ABC, BIOS 1.2.3 01/01/2025
[ 1234.567895] Workqueue: events_unbound driver_xyz_wq
[ 1234.567897] RIP: 0010:driver_xyz_process+0x5a/0x120 [driver_xyz]
[ 1234.567901] Call Trace:
[ 1234.567905]  worker_thread+0x8f/0x1a0
[ 1234.567906]  kthread+0xef/0x120
[ 1234.567907]  ret_from_fork+0x2c/0x40
crash> sys
      KERNEL: /usr/lib/debug/lib/modules/5.14.0-xyz/vmlinux
    DUMPFILE: /var/crash/vmcore  [PARTIAL DUMP]
        CPUS: 32
        DATE: Tue Oct 14 10:22:31 2025
      UPTIME: 02:14:58
LOAD AVERAGE: 6.14, 6.02, 5.77
       PANIC: "Kernel panic - not syncing: Fatal exception"
         PID: 4123
     COMMAND: "kworker/u16:2"
        TASK: ffff8b2a7f1f0c00  [THREAD_INFO: ffffb2f1c2d2a000]
         CPU: 7
       STATE: TASK_RUNNING (PANIC)
crash> ps
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
> 4123       2    7  ffff8b2a7f1f0c00  RU   0.1   0      0    kworker/u16:2
   1         0    0  ffff8b2a70000180  IN   0.0  16272  1308  systemd
   532       1    2  ffff8b2a703f9b40  IN   0.2  912312 80324 containerd
   987     532    5  ffff8b2a7a2fcd00  IN   0.4  1452312 231212 kubelet
crash> bt
PID: 4123  TASK: ffff8b2a7f1f0c00  CPU: 7  COMMAND: "kworker/u16:2"
 #0 [ffffb2f1c2d2be78] machine_kexec at ffffffff914b3e10
 #1 [ffffb2f1c2d2bec8] __crash_kexec at ffffffff915a1c32
 #2 [ffffb2f1c2d2bf28] panic at ffffffff914c2a9d
 #3 [ffffb2f1c2d2bf80] oops_end at ffffffff9148df90
 #4 [ffffb2f1c2d2bfb0] page_fault_oops at ffffffff9148e4b5
 #5 [ffffb2f1c2d2bfe0] exc_page_fault at ffffffff91abc7e1
 #6 [ffffb2f1c2d2c018] asm_exc_page_fault at ffffffff91c0133e
 #7 [ffffb2f1c2d2c048] driver_xyz_process+0x5a/0x120 [driver_xyz]
 #8 [ffffb2f1c2d2c0a0] worker_thread+0x8f/0x1a0
 #9 [ffffb2f1c2d2c0e0] kthread+0xef/0x120
#10 [ffffb2f1c2d2c110] ret_from_fork+0x2c/0x40
crash> kmem -i
                 PAGES        TOTAL      PERCENTAGE
    TOTAL MEM    3276800      12.5 GB   100%
         FREE     152345       595 MB     4%
         USED    3124455      11.9 GB    96%
       SHARED      80312       313 MB     2%
      BUFFERS      49152       192 MB     1%
       CACHED     842304       3.2 GB    26%
        SLAB      921600       3.5 GB    28%
      PAGECACHE   655360       2.5 GB    20%
ZONE DMA32: min 16224, low 20280, high 24336, scanned 1e6, order 3 allocs failing
Reclaimers: kswapd0: active, direct reclaim: observed
crash> log | head -n 6
<0>[ 1234.567890] Kernel panic - not syncing: Fatal exception
<4>[ 1234.567900] CPU: 7 PID: 4123 Comm: kworker/u16:2 Tainted: G    B  W  OE
<4>[ 1234.567905] RIP: 0010:driver_xyz_process+0x5a/0x120 [driver_xyz]
<6>[ 1234.567950] Workqueue: events_unbound driver_xyz_wq

Actionable patterns:

  • Null-dereference at a deref site → check expected invariants and lifetime rules for the object; validate RCU usage (rcu_read_lock()/_unlock() pairs) and reference counting (kref, refcount_t).
  • Use-after-free → examine slab allocator metadata around the pointer; kmem and rd -p (raw reads) can validate freelist poisoning.
  • Interrupt vs thread context → verify hardirq/softirq context in bt; ensure lock acquisition order obeys documented lockdep dependencies.

If tainted by proprietary modules (OE), ensure matching module debuginfo is loaded so frames resolve cleanly. Validate module list:

crash> mod

From here, produce a minimal repro and map the faulting path to specific source lines using dis -l and DWARF line tables; attach exact register state and call trace to the fix.

Example 2 — Out-Of-Memory (OOM) and Panic-on-OOM

By default, OOM does not produce a vmcore because the kernel kills a task to free memory and continues. For deterministic forensics on pathological memory pressure, set vm.panic_on_oom=1 or 2 so the system panics and kdump captures a vmcore.

Pre-incident configuration:

sysctl -w vm.panic_on_oom=2
sysctl -w kernel.panic=15

After the event, extract the OOM report:

vmcore-dmesg /var/crash/vmcore | grep -A40 -B10 -n "Out of memory"

Example OOM excerpt:

[ 4321.000001] Out of memory: Killed process 29876 (jav a) total-vm:16384000kB, anon-rss:15500000kB, file-rss:12000kB, shmem-rss:0kB, UID:1000 pgtables:30240kB oom_score_adj:0
[ 4321.000015] oom_reaper: reaped process 29876 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 4321.000120] Memory cgroup out of memory: Killed process 30555 (python) in cgroup /kubepods/besteffort/pod123/xyz
[ 4321.000200] Node 0 DMA32: free:152kB min:16224kB low:20280kB high:24336kB active_anon:10123456kB inactive_anon:1123456kB active_file:123456kB inactive_file:654321kB unevictable:0kB
[ 4321.000250] kswapd0: node 0, oom: task failed order:3, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)

Selected crash views for OOM analysis:

crash> kmem -i
                 PAGES        TOTAL      PERCENTAGE
    TOTAL MEM    3276800      12.5 GB   100%
         FREE      20480        80 MB     0%
         USED    3256320      12.4 GB    99%
       CACHED     131072       512 MB     4%
        SLAB      983040       3.7 GB    30%
Direct reclaim active; high-order allocations failing (order:3)
crash> ps -m | head -n 5
   PID    VSZ      RSS COMM
 29876  16384000 15500000 java
 30555   2048000  1800000 python
   987   1452312   231212 kubelet

Interpretation checklist inside crash:

crash> kmem -i               # zones, watermarks, reclaimers state
crash> kmem -s               # slab usage; look for runaway caches
crash> ps -m                 # memory stats per task
crash> vtop <task> <va>      # translate VA to PFN to inspect mapping
crash> files -p <PID>        # fd pressure and mmaps
crash> p sysctl_oom_dump_tasks
crash> log                   # OOM killer selection rationale, constraints

Indicators:

  • oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=... shows the policy path; score/oom_score_adj determine the victim.
  • Stalled reclaim (kswapd, direct reclaim) with high order allocations failing → likely hugepages, GFP_ATOMIC depletion, or CMA stress.
  • One slab consuming disproportionate memory → e.g., runaway dentry or kmalloc-64 due to leak; confirm with kmem -S and inspect suspects via object walkers if available.

If OOM was triggered by a specific container/cgroup, use cgroup-aware views (kernel dependent):

crash> p memory.stat @<memcg_addr>

Correlating vmcore with Source and Binaries

Always analyze with the exact build artifacts of the panicked kernel:

  • vmlinux and module .debug files must match the uname -r and build ID of the running kernel at the time of panic.
  • Mismatches lead to wrong type layouts, invalid offsets, and misleading backtraces.
  • On distros with split debuginfo, install the debuginfo packages for the precise NVR (Name-Version-Release) string.

Crash Analysis Cheat Sheet

  • vmcore-dmesg: panic reason, oops, OOM logs; fastest high-signal overview.
  • sys: kernel build, CPU count, uptime, panic string, current task/CPU.
  • ps / ps -m: runnable tasks; -m adds memory stats per task.
  • bt / bt -f: backtrace of current or selected task with frames/args.
  • kmem -i / -s / -S: memory inventory; slabs by cache; cache detail.
  • log: kernel ring buffer reconstructed from vmcore.
  • mod: loaded modules and taint state.
  • files -p <PID>: file descriptors and mmaps for a task.
  • irq: hardirq/softirq state.
  • vtop <task> <va>: VA→PFN translation; inspect mappings around suspect pages.

References and Further Reading


2.3 - Enabling Automatic Kernel Crash Collection with kdump

How to automatically enable and configure kdump crash collection on Linux systems using the kdump-enabler script.

Automatic Enablement of Kernel Crash Dump Collection with kdump-enabler

This article explains how to automatically enable and configure kernel crash dump (kdump) collection on Linux systems using the kdump-enabler script. This approach works across multiple distributions and simplifies the process of preparing your system to collect crash dumps for troubleshooting and analysis.

Overview

kdump-enabler is a Bash script that automates the setup of kdump:

  • Installs required packages
  • Configures the crashkernel parameter in GRUB
  • Enables and starts the kdump service
  • Sets up SysRq for manual crash triggering
  • Creates backups of configuration files before changes
  • Supports Ubuntu, Debian, RHEL, CentOS, Fedora, openSUSE, Arch Linux, and more

Prerequisites

  • Root privileges (run with sudo)
  • systemd-based Linux distribution
  • GRUB bootloader
  • Sufficient disk space in /var/crash for crash dumps

Installation

Clone the repository and run the script:

git clone https://github.com/samatild/kdump-enabler.git
cd kdump-enabler
sudo ./kdump-enabler.sh

Or download and run directly:

curl -O https://raw.githubusercontent.com/samatild/kdump-enabler/main/kdump-enabler.sh
chmod +x kdump-enabler.sh
sudo ./kdump-enabler.sh

Usage

Run the script interactively:

sudo ./kdump-enabler.sh

Or use options for automation:

sudo ./kdump-enabler.sh -y           # Non-interactive mode
sudo ./kdump-enabler.sh --check-only  # Only check current configuration
sudo ./kdump-enabler.sh --no-sysrq    # Skip SysRq crash enablement

What the Script Does

  1. Detects your Linux distribution and package manager
  2. Checks current kdump status
  3. Installs required packages
  4. Configures crashkernel parameter in GRUB based on system RAM
  5. Enables kdump service at boot and starts it
  6. Enables SysRq for manual crash triggering
  7. Creates crash dump directory at /var/crash

Post-Installation

After running the script, reboot your system for the crashkernel parameter to take effect:

sudo reboot

Verify kdump is working:

  • Ubuntu/Debian:
    sudo kdump-tools test
    sudo systemctl status kdump-tools
    
  • RHEL/CentOS/Fedora:
    sudo kdumpctl showmem
    sudo systemctl status kdump
    
  • Check crashkernel:
    cat /proc/cmdline | grep crashkernel
    
  • Check SysRq:
    cat /proc/sys/kernel/sysrq
    # Should output: 1
    

Examples

Below are examples of running the script on different distributions and with various options, along with the kinds of output you can expect.

Interactive run (Ubuntu/Debian)

sudo ./kdump-enabler.sh

# Output (abridged):
╔══════════════════════════════════════════════════════════════╗
║                        KDUMP ENABLER v1.0.0                  ║
╚══════════════════════════════════════════════════════════════╝
[INFO] Detecting Linux distribution...
[SUCCESS] Detected: Ubuntu 22.04
[INFO] Package manager: apt

[INFO] Checking current kdump configuration...
[WARNING] No crashkernel parameter found in kernel command line
[WARNING] kdump service exists but is not active
[WARNING] System requires kdump configuration

[WARNING] This script will:
  1. Install kdump packages (linux-crashdump kdump-tools kexec-tools)
  2. Configure crashkernel parameter in GRUB
  3. Enable and start kdump service
  4. Enable SysRq crash trigger
  5. Require a system reboot to complete setup
Do you want to continue? [y/N] y

[INFO] Installing required packages...
... apt-get update -qq
... apt-get install -y linux-crashdump kdump-tools kexec-tools
[SUCCESS] Packages installed successfully

[INFO] Configuring crashkernel parameter...
[INFO] Recommended crashkernel size: 384M (Total RAM: 12GB)
... updating /etc/default/grub
... running update-grub
[SUCCESS] GRUB configuration updated

[INFO] Configuring kdump settings...
... setting USE_KDUMP=1 in /etc/default/kdump-tools
[SUCCESS] kdump-tools configured

[INFO] Enabling kdump service...
[SUCCESS] kdump service enabled at boot
[WARNING] kdump service will start after reboot (crashkernel parameter needs to be loaded)

[INFO] Enabling SysRq crash trigger...
[SUCCESS] SysRq enabled for current session
[SUCCESS] SysRq configuration persisted to /etc/sysctl.conf

╔════════════════════════════════════════════════════════════════╗
║                    KDUMP SETUP COMPLETED                        ║
╚════════════════════════════════════════════════════════════════╝
IMPORTANT: A system reboot is required to apply all changes!

Non-interactive run (auto-confirm)

sudo ./kdump-enabler.sh -y

# Output differences:
# - Skips confirmation prompts
# - Performs install/configuration immediately

Check-only mode (no changes)

sudo ./kdump-enabler.sh --check-only

# Output (abridged):
[INFO] Checking current kdump configuration...
[WARNING] No crashkernel parameter found in kernel command line
[WARNING] kdump service not found
[INFO] Crash dump directory: /var/crash (0 dumps found)

# Exits after status check without installing or modifying anything

Skip SysRq enablement

sudo ./kdump-enabler.sh -y --no-sysrq

# Output differences:
# - Does not enable SysRq or persist sysctl settings
# - All other steps (packages, GRUB, service) proceed

RHEL/Fedora example highlights

sudo ./kdump-enabler.sh -y

# Output (abridged):
[SUCCESS] Detected: Red Hat Enterprise Linux 9
[INFO] Package manager: yum

[INFO] Installing required packages...
... yum install -y kexec-tools
[SUCCESS] Packages installed successfully

[INFO] Configuring crashkernel parameter...
... updating /etc/default/grub
... grub2-mkconfig -o /boot/grub2/grub.cfg
[SUCCESS] GRUB2 configuration updated

[INFO] Configuring kdump settings...
... ensuring path /var/crash in /etc/kdump.conf
... setting core_collector makedumpfile -l --message-level 1 -d 31
[SUCCESS] kdump.conf configured

Testing Crash Dumps

⚠️ Warning: The following will immediately crash your system and generate a dump.

echo c | sudo tee /proc/sysrq-trigger

After reboot, check for crash dumps:

ls -lh /var/crash/

Troubleshooting

  • Ensure crashkernel is loaded: cat /proc/cmdline | grep crashkernel
  • Reboot after running the script
  • Check available memory and disk space
  • View service logs: sudo journalctl -u kdump -xe
  • Update GRUB if needed and reboot

References


For more details, see the kdump-enabler GitHub repository.

2.4 - Kernel Mode vs User Mode: Privilege Levels and System Call Execution

Deep technical explanation of CPU privilege levels, kernel mode vs user mode execution contexts, system call mechanisms, memory protection, and security implications in the Linux kernel.

CPU Privilege Levels and Execution Contexts

Summary

Modern processors implement hardware-enforced privilege levels to isolate untrusted user code from critical kernel services. Linux uses two primary modes: kernel mode (ring 0, CPL 0, EL1) and user mode (ring 3, CPL 3, EL0). Kernel mode grants unrestricted access to CPU features, physical memory, I/O ports, and privileged instructions. User mode restricts access to a virtualized, isolated address space and requires kernel mediation for hardware resources. The transition between modes occurs via system calls, interrupts, and exceptions, all managed by the kernel’s interrupt and system call handlers. Understanding this separation is fundamental to security, performance optimization, and kernel debugging.

CPU Privilege Levels (Architecture Overview)

x86/x86-64 Privilege Rings

The x86 architecture defines four privilege levels (rings 0-3), though Linux uses only rings 0 and 3:

x86/AMD64 CPU Privilege Rings
CPU Privilege Levels: Rings 0–3 (Click to expand)
  • Ring 0 (Kernel Mode): Highest privilege, executes kernel code, device drivers, interrupt handlers
  • Ring 1-2: Unused by Linux (historically used for device drivers or hypervisors)
  • Ring 3 (User Mode): Lowest privilege, executes user applications

Current Privilege Level (CPL): Stored in the lower 2 bits of the CS (Code Segment) register:

  • CPL=0: Kernel mode
  • CPL=3: User mode

Kernel source: arch/x86/include/asm/segment.h

// Simplified CPL check macros
#define user_mode(regs) (((regs)->cs & SEGMENT_RPL_MASK) == USER_CS)
#define kernel_mode(regs) (!user_mode(regs))

ARM64 Exception Levels

ARM64 uses Exception Levels (EL0-EL3):

ARM64 Exception Levels EL0–EL3
ARM64 Exception Levels: EL0–EL3 (Click to expand)
  • EL0 (User Mode): Applications and user-space code
  • EL1 (Kernel Mode): Linux kernel, hypervisor in some configurations
  • EL2 (Hypervisor): Virtualization layer (KVM, Xen)
  • EL3 (Secure Monitor): TrustZone secure world

Current Exception Level (EL): Stored in CurrentEL system register (bits [3:2])

Kernel source: arch/arm64/include/asm/ptrace.h

#define PSR_MODE_EL0t  0x00000000
#define PSR_MODE_EL1t  0x00000004
#define PSR_MODE_EL1h  0x00000005

Kernel Mode Characteristics

Privileges and Capabilities

1. Unrestricted Memory Access

  • Direct access to physical memory addresses
  • Can read/write kernel data structures
  • Access to kernel address space (typically 0xffff800000000000 and above on x86-64)
  • Can modify page tables, MMU configuration

2. Privileged Instructions

  • cli/sti (disable/enable interrupts)
  • lgdt/lidt (load GDT/IDT)
  • mov crX (control register access)
  • in/out (I/O port access)
  • hlt (halt CPU)
  • wrmsr/rdmsr (Model-Specific Registers)

3. Interrupt and Exception Handling

  • Can install interrupt handlers
  • Access to interrupt controller (APIC, GIC)
  • Exception vector table modification

4. System Control

  • CPU scheduling decisions
  • Process creation/destruction
  • File system operations
  • Network stack management

Kernel example - Direct memory access:

// Kernel can directly access physical memory
void *phys_addr = __va(0x1000000);  // Convert physical to virtual
unsigned long value = *(unsigned long *)phys_addr;

// User mode cannot do this - would cause segmentation fault

Kernel Mode Execution Contexts

1. Process Context

  • Executing on behalf of a user process (system call handler)
  • Has associated task_struct, user-space memory mappings
  • Can sleep, be preempted
  • Example: read(), write(), open() syscall handlers

2. Interrupt Context

  • Executing in response to hardware interrupt
  • No associated process (current may point to interrupted process)
  • Cannot sleep, very limited blocking operations
  • Must be fast and non-blocking
  • Example: Network packet received, timer interrupt

3. Softirq/Tasklet Context

  • Deferred interrupt processing
  • Can run in interrupt context or ksoftirqd thread
  • Similar constraints to interrupt context
  • Example: Network packet processing, timer callbacks

Kernel source - Process context check:

// arch/x86/kernel/entry_64.S
ENTRY(entry_SYSCALL_64)
    // Save user context
    SWAPGS
    movq    %rsp, PER_CPU_VAR(cpu_current_top_of_stack)
    
    // Switch to kernel stack
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp
    
    // Now in kernel mode, process context
    call    do_syscall_64

User Mode Characteristics

Restrictions and Isolation

1. Virtual Memory Only

  • Cannot access physical addresses directly
  • Limited to virtual address space (typically 0x0000000000000000 to 0x00007fffffffffff on x86-64)
  • Page tables managed by kernel, not user-space
  • Segmentation fault on invalid access

2. No Privileged Instructions

  • Attempting privileged instructions causes General Protection Fault (#GP)
  • Trapped by kernel, typically results in SIGSEGV to process

3. System Call Interface

  • Must use system calls to request kernel services
  • System calls are the controlled entry point to kernel mode
  • Each syscall has defined parameters and return values

4. Process Isolation

  • Each process has independent virtual address space
  • Cannot directly access other processes’ memory
  • Inter-process communication requires kernel mediation (shared memory, pipes, sockets)

User-space example - Attempting privileged operation:

// This will fail in user mode
int main() {
    unsigned long cr0;
    asm volatile("mov %%cr0, %0" : "=r"(cr0));  // Privileged instruction
    // Result: SIGSEGV - Segmentation fault
    return 0;
}

User-space example - Valid system call:

// User mode must use system calls
int main() {
    int fd = open("/etc/passwd", O_RDONLY);  // System call
    if (fd < 0) {
        perror("open");
        return 1;
    }
    close(fd);  // System call
    return 0;
}

Mode Transitions: System Calls

System Call Mechanism

System calls are the controlled mechanism for transitioning from user mode to kernel mode. The transition involves:

  1. User-space preparation: Set up syscall number and arguments
  2. Hardware trap: CPU switches to kernel mode via special instruction
  3. Kernel handler: Kernel validates and executes requested operation
  4. Return: Kernel switches back to user mode with result

x86-64 System Call Interface

Transition from User Mode to Kernel Mode - System Call
Transition from User Mode to Kernel Mode via System Call (Click to expand)

Syscall instruction (syscall):

  • Fastest method on x86-64
  • System call number in %rax
  • Arguments in %rdi, %rsi, %rdx, %r10, %r8, %r9
  • Return value in %rax
  • Return address in %rcx

Kernel source: arch/x86/entry/entry_64.S

// User-space syscall invocation (glibc wrapper)
static inline long __syscall0(long n) {
    long ret;
    asm volatile ("syscall" : "=a"(ret) : "a"(n) : "rcx", "r11", "memory");
    return ret;
}

// Kernel entry point
ENTRY(entry_SYSCALL_64)
    UNWIND_HINT_EMPTY
    
    // Swap GS to kernel space
    swapgs
    
    // Save user-space registers
    movq    %rsp, PER_CPU_VAR(cpu_current_top_of_stack)
    
    // Load kernel stack
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp
    
    // Call C handler
    call    do_syscall_64

Alternative: int 0x80 (legacy):

  • Slower, uses interrupt mechanism
  • Still supported for 32-bit compatibility
  • Uses interrupt descriptor table (IDT)

ARM64 System Call Interface

svc (Supervisor Call) instruction:

  • System call number in x8 (W8)
  • Arguments in x0-x7
  • Return value in x0

Kernel source: arch/arm64/kernel/entry.S

// User-space syscall invocation
long syscall(long n, ...) {
    register long ret asm("x0");
    register long syscall_nr asm("x8") = n;
    asm volatile ("svc #0" : "=r"(ret) : "r"(syscall_nr) : "memory");
    return ret;
}

System Call Handler Flow

Kernel source: kernel/sys.c, arch/x86/entry/common.c

// Simplified syscall handler
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr) {
    // Validate syscall number
    if (likely(nr < NR_syscalls)) {
        // Get syscall function pointer
        syscall_fn_t syscall_fn = syscall_table[nr];
        
        // Execute syscall (still in kernel mode)
        regs->ax = syscall_fn(regs);
    } else {
        regs->ax = -ENOSYS;
    }
    
    // Return to user mode
    syscall_exit_to_user_mode(regs);
}

Practical example - Tracing system calls:

# Trace all syscalls made by a process
strace -e trace=all ls -l

# Example output:
# openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
# read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360\3\2\0\0\0\0\0"..., 832) = 832
# close(3)                                = 0
# openat(AT_FDCWD, ".", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
# getdents64(3, /* 5 entries */, 32768)  = 144
# close(3)                                = 0

Memory Protection and Address Spaces

Virtual Address Space Layout

x86-64 User-Space (Linux):

0x0000000000000000 - 0x00007fffffffffff: User-space (128 TB)
    ├─ 0x0000000000400000: Code segment (.text)
    ├─ 0x00007fffff000000: Stack (grows downward)
    ├─ 0x00007ffffffde000: vDSO (virtual dynamic shared object)
    └─ 0x00007ffffffff000: Stack guard page

x86-64 Kernel-Space:

0xffff800000000000 - 0xffffffffffffffff: Kernel-space (128 TB)
    ├─ 0xffff888000000000: Direct mapping of physical memory
    ├─ 0xffffc90000000000: vmalloc area
    ├─ 0xffffea0000000000: Vmemmap (struct page array)
    └─ 0xffffffff80000000: Kernel code (text segment)

Kernel source: arch/x86/include/asm/page_64_types.h

#define __PAGE_OFFSET_BASE      _AC(0xffff880000000000, UL)
#define __PAGE_OFFSET           __PAGE_OFFSET_BASE
#define __START_KERNEL_map      _AC(0xffffffff80000000, UL)

Page Table Protection

Page Table Entry (PTE) flags:

  • User/Supervisor (U/S) bit: Controls user vs kernel access
    • U/S=0: Kernel-only pages (kernel mode access)
    • U/S=1: User-accessible pages (both modes)
  • Read/Write (R/W) bit: Controls write permissions
  • Execute Disable (NX) bit: Prevents code execution (DEP/XN)

Kernel source: arch/x86/include/asm/pgtable_types.h

#define _PAGE_BIT_PRESENT   0
#define _PAGE_BIT_RW        1
#define _PAGE_BIT_USER      2
#define _PAGE_BIT_PWT       3
#define _PAGE_BIT_PCD       4
#define _PAGE_BIT_ACCESSED  5
#define _PAGE_BIT_DIRTY     6
#define _PAGE_BIT_PSE       7
#define _PAGE_BIT_PAT       7
#define _PAGE_BIT_GLOBAL    8
#define _PAGE_BIT_SOFTW1    9
#define _PAGE_BIT_SOFTW2    10
#define _PAGE_BIT_SOFTW3    11
#define _PAGE_BIT_PAT_LARGE 12
#define _PAGE_BIT_SPECIAL   _PAGE_BIT_SOFTW1
#define _PAGE_BIT_CPA_TEST  _PAGE_BIT_SOFTW1
#define _PAGE_BIT_NX        63

Practical example - Checking page permissions:

# View memory mappings of a process
cat /proc/self/maps

# Example output:
# 00400000-00401000 r-xp 00000000 08:01 123456 /bin/cat      (executable, read-only)
# 00600000-00601000 r--p 00000000 08:01 123456 /bin/cat      (read-only data)
# 00601000-00602000 rw-p 00001000 08:01 123456 /bin/cat      (read-write data)
# 7fff00000000-7fff00001000 rw-p 00000000 00:00 0            (stack, read-write)
# ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0    (vsyscall, kernel-managed)

Segmentation Fault Handling

When user-mode code attempts invalid access:

  1. CPU generates page fault (#PF exception)
  2. Kernel page fault handler (do_page_fault()) executes
  3. Kernel checks:
    • Address validity
    • Permissions (U/S bit, R/W bit)
    • If invalid → send SIGSEGV to process
    • If valid (copy-on-write, lazy allocation) → fix mapping and resume

Kernel source: arch/x86/mm/fault.c

static void
__do_page_fault(struct pt_regs *regs, unsigned long error_code,
                unsigned long address) {
    struct vm_area_struct *vma;
    struct mm_struct *mm = current->mm;
    
    // Check if fault occurred in kernel mode
    if (unlikely(fault_in_kernel_space(address))) {
        // Kernel-mode fault handling
        do_kern_addr_fault(regs, error_code, address);
        return;
    }
    
    // User-mode fault
    vma = find_vma(mm, address);
    if (!vma) {
        // Invalid address - send SIGSEGV
        bad_area(regs, error_code, address);
        return;
    }
    
    // Check permissions
    if (unlikely(access_error(error_code, vma))) {
        bad_area_access_error(regs, error_code, address, vma);
        return;
    }
    
    // Valid fault - handle (COW, lazy allocation, etc.)
    fault = handle_mm_fault(vma, address, flags, regs);
}

Interrupts and Exceptions

Interrupt Handling Flow

Hardware interrupt sequence:

  1. Hardware device asserts interrupt line
  2. CPU finishes current instruction
  3. CPU switches to kernel mode (CPL=0)
  4. CPU saves user context (registers, return address)
  5. Kernel interrupt handler executes
  6. Kernel returns to user mode (or schedules different task)

Kernel source: arch/x86/entry/entry_64.S

// Interrupt entry point
ENTRY(common_interrupt)
    // Save all registers
    SAVE_ALL
    
    // Disable interrupts (if needed)
    cli
    
    // Call C interrupt handler
    call    do_IRQ
    
    // Restore and return
    RESTORE_ALL
    iretq

Exception Types

Faults (correctable):

  • Page fault (#PF): Virtual memory access violation
  • General Protection Fault (#GP): Invalid memory access, privilege violation
  • Division Error (#DE): Divide by zero

Traps (instruction completes):

  • Breakpoint (#BP): int3 instruction, debugger breakpoints
  • Overflow (#OF): Arithmetic overflow

Aborts (severe errors):

  • Double Fault (#DF): Exception during exception handling
  • Machine Check (#MC): Hardware error

All exceptions transition to kernel mode for handling.

Security Implications

Why Separation Matters

1. Process Isolation

  • User processes cannot access each other’s memory
  • Kernel enforces access control via page tables
  • Prevents malicious or buggy programs from affecting others

2. Resource Protection

  • Hardware resources (I/O ports, MSRs) protected from user access
  • File system integrity maintained by kernel
  • Network stack isolation prevents packet manipulation

3. Privilege Escalation Prevention

  • User code cannot directly execute privileged instructions
  • System calls are the only controlled entry point
  • Kernel validates all requests before execution

Security example - Attempted privilege escalation:

// User-space code attempting to access kernel memory
int main() {
    // Attempt to read kernel address space
    unsigned long *kernel_addr = (unsigned long *)0xffffffff80000000;
    unsigned long value = *kernel_addr;  // Page fault!
    // Result: SIGSEGV - Segmentation fault
    // Kernel prevents user access to kernel memory
    return 0;
}

System Call Validation

Kernel validates all syscall parameters:

// Example: open() syscall validation
SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode) {
    // Validate filename pointer (user-space address)
    if (get_user(path, filename))
        return -EFAULT;
    
    // Validate path length
    if (strnlen_user(path, MAX_PATH) > MAX_PATH)
        return -ENAMETOOLONG;
    
    // Check permissions
    if (!may_open(&nd.path, acc_mode, open_flag))
        return -EACCES;
    
    // Proceed with file opening
    return do_filp_open(dfd, tmp, &op);
}

Practical Examples and Demonstrations

Example 1: Observing Mode Transitions

Trace system calls and mode switches:

# Install perf tools
sudo apt-get install linux-perf

# Trace syscalls with timing
sudo perf trace -e syscalls:sys_enter_openat,syscalls:sys_exit_openat ls

# Output shows:
# 0.000 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
# 0.012 <... openat resumed> )      = 3
# Each line shows user->kernel->user transition

Example 2: Kernel vs User Memory Access

User-space program:

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>

int main() {
    // User-space memory allocation
    void *user_addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE,
                          MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    
    printf("User-space address: %p\n", user_addr);
    *(int *)user_addr = 42;  // OK - user can write to own memory
    
    // Attempt kernel-space access
    void *kernel_addr = (void *)0xffffffff80000000;
    // *(int *)kernel_addr = 42;  // Would cause SIGSEGV
    
    return 0;
}

Kernel module (for comparison):

// Kernel module can access both
static int __init test_init(void) {
    void *user_addr = (void *)0x400000;  // User-space address
    void *kernel_addr = (void *)0xffffffff80000000;  // Kernel address
    
    // Access kernel memory (requires proper mapping)
    // unsigned long val = *(unsigned long *)kernel_addr;
    
    // Access user memory (requires copy_from_user)
    // unsigned long val;
    // copy_from_user(&val, user_addr, sizeof(val));
    
    return 0;
}

Example 3: System Call Overhead

Measure syscall overhead:

#include <stdio.h>
#include <sys/time.h>
#include <unistd.h>

int main() {
    struct timeval start, end;
    long iterations = 1000000;
    
    gettimeofday(&start, NULL);
    for (long i = 0; i < iterations; i++) {
        getpid();  // System call
    }
    gettimeofday(&end, NULL);
    
    long us = (end.tv_sec - start.tv_sec) * 1000000 +
              (end.tv_usec - start.tv_usec);
    printf("Average syscall time: %.2f nanoseconds\n",
           (us * 1000.0) / iterations);
    
    return 0;
}

// Typical output: ~200-500 nanoseconds per syscall
// This includes: user->kernel transition + handler + kernel->user return

Example 4: Inspecting Current Mode

Check if code is running in kernel or user mode:

# From user-space (always shows user mode)
cat /proc/self/status | grep State

# From kernel module (can check current mode)
# In kernel code:
if (user_mode(regs)) {
    printk("Running in user mode\n");
} else {
    printk("Running in kernel mode\n");
}

Kernel Internals: Mode Switching Code

Context Switch and Mode Return

Returning to user mode after syscall:

// arch/x86/entry/common.c
__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs) {
    // Check for pending work (signals, preemption)
    exit_to_user_mode_prepare(regs);
    
    // Restore user-space registers
    __syscall_exit_to_user_mode_work(regs);
}

static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs) {
    // Restore GS (swap back to user GS)
    swapgs;
    
    // Return to user mode
    // IRET instruction restores:
    // - CS (code segment) -> CPL=3
    // - SS (stack segment)
    // - RFLAGS (flags register)
    // - RIP (instruction pointer)
    // - RSP (stack pointer)
}

Interrupt Return

Returning from interrupt handler:

// arch/x86/entry/entry_64.S
ENTRY(common_interrupt_return)
    // Check if returning to user mode
    testb   $3, CS(%rsp)
    jnz     swapgs_restore_regs_and_return_to_usermode
    
    // Returning to kernel mode
    RESTORE_ALL
    iretq

swapgs_restore_regs_and_return_to_usermode:
    // Swap GS back to user mode
    swapgs
    RESTORE_ALL
    iretq

Performance Considerations

System Call Overhead

Factors affecting syscall performance:

  1. Mode transition cost: ~50-100 CPU cycles
  2. Register save/restore: ~100-200 cycles
  3. Cache effects: TLB flushes, cache misses
  4. Handler execution: Varies by syscall (nanoseconds to microseconds)

Optimization techniques:

  • vDSO (Virtual Dynamic Shared Object): Some syscalls (e.g., gettimeofday()) execute in user-space
  • Batch syscalls: io_uring, sendmmsg() reduce per-call overhead
  • Avoid unnecessary syscalls: Cache results, use memory-mapped I/O

Reducing Mode Transitions

Example - Reducing gettimeofday() overhead:

// Traditional approach (syscall each time)
for (int i = 0; i < 1000; i++) {
    gettimeofday(&tv, NULL);  // 1000 syscalls
}

// Optimized approach (vDSO - no syscall)
#include <sys/time.h>
#include <time.h>

for (int i = 0; i < 1000; i++) {
    clock_gettime(CLOCK_REALTIME, &ts);  // May use vDSO, no syscall
}

Debugging and Inspection Tools

Identifying Mode Transitions

1. Using strace:

# Trace all syscalls (shows user->kernel transitions)
strace -c ls -l

# Output:
# % time     seconds  usecs/call     calls    errors syscall
# ------ ----------- ----------- --------- --------- ----------------
#  23.45    0.000234         234      1000           openat
#  15.23    0.000152         152      1000           read
#  10.12    0.000101         101      1000           close

2. Using perf:

# Sample syscall events
sudo perf record -e syscalls:sys_enter_* -a sleep 1
sudo perf report

# Trace syscall latency
sudo perf trace --duration 1

3. Using gdb with kernel:

# Attach to running kernel (requires kgdb)
(gdb) target remote /dev/ttyS0
(gdb) break do_syscall_64
(gdb) continue
# Breakpoint shows entry to kernel mode

Kernel Logging

Monitor mode transitions in kernel logs:

# Enable kernel debugging
echo 8 > /proc/sys/kernel/printk  # Enable all log levels

# Watch for syscall-related messages
dmesg -w | grep -i "syscall\|fault\|segfault"

Common Pitfalls and Misconceptions

Misconception 1: “Kernel mode is always faster”

Reality: Mode transitions have overhead. Kernel mode itself isn’t faster; it’s necessary for privileged operations. User-space code can be faster for pure computation.

Misconception 2: “System calls are expensive”

Reality: Modern CPUs optimize syscall transitions. Overhead is typically 200-500 nanoseconds, negligible for most applications. I/O operations (disk, network) dominate latency, not syscall overhead.

Misconception 3: “Kernel code always runs in kernel mode”

Reality: Kernel code executes in kernel mode, but kernel threads can be scheduled like user processes. Interrupt handlers run in interrupt context, not process context.

Misconception 4: “User-space cannot access kernel memory”

Reality: User-space can access kernel memory through /proc, /sys, and mmap() of kernel-exported memory (e.g., /dev/mem with proper permissions). The kernel controls what is exposed.

References and Further Reading

  • Intel 64 and IA-32 Architectures Software Developer’s Manual: Volume 3A (System Programming Guide) - Privilege levels and protection
  • ARM Architecture Reference Manual: Exception levels and privilege model
  • Linux Kernel Source: arch/x86/entry/ (system call and interrupt entry code)
  • Understanding the Linux Kernel (3rd ed.): Chapter 1 (Introduction), Chapter 3 (Processes)
  • Linux Device Drivers (3rd ed.): Chapter 9 (Interrupt Handling)
  • man 2 syscalls: System call documentation
  • man 2 syscall: Generic system call wrapper

Note: The distinction between kernel mode and user mode is fundamental to operating system security and stability. All privileged operations must occur in kernel mode, while user applications run in isolated user mode with controlled access to system resources via system calls.

2.5 - Understanding CPU Statistics in Linux (/proc/stat)

Deep technical dive into CPU time accounting in Linux, covering user, nice, system, idle, iowait, irq, softirq, steal, guest, and guest_nice statistics with practical examples and kernel internals.

Kernel-Level CPU Time Accounting

Summary

The Linux kernel maintains precise, per-CPU time accounting across ten distinct execution contexts. These statistics, exposed via /proc/stat, represent cumulative jiffy counters (typically 1/100th or 1/1000th of a second) since system boot. Understanding these counters is essential for performance analysis, capacity planning, and diagnosing CPU contention, I/O bottlenecks, interrupt storms, and virtualization overhead.

The /proc/stat Interface

/proc/stat is a virtual file provided by the kernel’s proc filesystem. It contains system-wide statistics aggregated across all CPUs and individual per-CPU lines. The format is non-blocking and updated atomically by the kernel scheduler’s tick handler.

View the raw statistics:

cat /proc/stat

Example output:

cpu  225748 1981 87654 3456789 1234 456 789 1011 2222 3333
cpu0 56387 495 21916 864197 308 114 197 252 555 833
cpu1 56390 495 21917 864198 309 114 197 253 556 833
cpu2 56385 495 21910 864197 308 114 197 252 555 833
cpu3 56386 496 21911 864197 309 114 198 254 556 834
intr 1234567890 123456 123456 123456 ...
ctxt 87654321
btime 1705123456
processes 23456
procs_running 2
procs_blocked 0
softirq 123456 789 12345 1234 5678 9012 3456 7890 123 456 789

The first line (cpu) aggregates all CPUs; subsequent cpuN lines show per-CPU statistics. Each CPU line contains ten fields:

cpuX user nice system idle iowait irq softirq steal guest guest_nice

Note: All values are cumulative counters measured in jiffies (kernel ticks). To calculate percentages or rates, you must sample at two points in time and compute deltas.

Field-by-Field Breakdown

1. user (usr)

Kernel context: Time spent executing user-space code in normal priority processes.

Increment condition: Kernel tick handler (account_process_tick()) counts time when a process is running in user mode with default priority (nice value 0-0).

Kernel source: kernel/sched/cputime.c::account_user_time()

Example interpretation:

  • High user time indicates CPU-bound applications
  • Typical range: 20-80% on interactive systems
  • Can exceed 100% per CPU if multiple threads run concurrently (SMP accounting)

Practical example:

# Monitor user CPU time growth over 5 seconds
T1=$(grep '^cpu ' /proc/stat | awk '{print $2}')
sleep 5
T2=$(grep '^cpu ' /proc/stat | awk '{print $2}')
DELTA=$((T2 - T1))
HZ=$(getconf CLK_TCK)  # Usually 100 or 1000
PERCENT=$(echo "scale=2; ($DELTA * 100) / ($HZ * 5)" | bc)
echo "User CPU usage: ${PERCENT}%"

Kernel implementation detail:

// Simplified representation of kernel accounting
void account_user_time(struct task_struct *p, u64 cputime)
{
    p->utime += cputime;
    if (task_nice(p) <= 0)
        account_cputime_user(p, cputime);
}

2. nice

Kernel context: Time spent executing user-space code in niced processes (nice value < 0 means higher priority, > 0 means lower priority).

Increment condition: Same as user, but only when task_nice(p) != 0 (non-zero nice value).

Kernel source: kernel/sched/cputime.c::account_user_time() with nice check

Example interpretation:

  • Non-zero values indicate processes with adjusted priorities
  • nice < 0: high-priority processes (e.g., RT priority mapped to nice)
  • nice > 0: low-priority background tasks
  • Modern kernels may report 0 if no niced processes exist

Practical example:

# Find processes contributing to nice time
ps -eo pid,ni,comm,pcpu --sort=-pcpu | head -10

# Generate nice time by running a niced process
nice -n 19 dd if=/dev/zero of=/dev/null bs=1M count=1000 &
NICE_PID=$!
sleep 2
# Check nice time increment
grep '^cpu ' /proc/stat | awk '{print $3}'

Advanced: Nice value to priority mapping:

# Show nice values and their scheduling impact
for nice in -20 -10 0 10 19; do
    echo "Nice $nice -> Priority: $((100 + nice))"
done

3. system (sys)

Kernel context: Time spent executing kernel code on behalf of user processes (system calls, kernel services).

Increment condition: Kernel tick handler counts time when the process is in kernel mode (handling syscalls, page faults, exceptions, etc.).

Kernel source: kernel/sched/cputime.c::account_system_time()

Example interpretation:

  • High system time indicates frequent syscalls or kernel processing
  • Typical range: 5-30% on normal systems
  • Spikes suggest I/O-bound workloads, context switching, or kernel-intensive operations
  • 50% may indicate kernel bottlenecks or driver issues

Practical example:

# Monitor system call rate (indirectly via system time)
T1=$(grep '^cpu ' /proc/stat | awk '{print $4}')
strace -c -e trace=all sleep 1 2>&1 | tail -1
T2=$(grep '^cpu ' /proc/stat | awk '{print $4}')
echo "System time delta: $((T2 - T1)) jiffies"

# High system time scenarios:
# 1. Frequent file I/O
dd if=/dev/urandom of=/tmp/test bs=4K count=10000

# 2. Network operations
curl -s https://example.com > /dev/null

# 3. Process creation
for i in {1..1000}; do true; done

Kernel code path:

// System time accounting during syscall
long sys_xyz(...) {
    // Pre-syscall timestamp
    account_system_time(current, cputime_before);
    // ... kernel work ...
    account_system_time(current, cputime_after);
}

4. idle

Kernel context: Time the CPU spent idle (no runnable tasks, waiting in idle loop).

Increment condition: Kernel idle loop (do_idle()) executes when the runqueue is empty. The idle task (PID 0, swapper) runs and increments this counter.

Kernel source: kernel/sched/idle.c::do_idle()

Example interpretation:

  • High idle = low CPU utilization
  • Idle time should decrease under load
  • 100% - idle% ≈ total CPU utilization
  • On SMP systems, one CPU can be idle while others are busy

Practical example:

# Calculate CPU utilization percentage
get_cpu_usage() {
    local cpu_line=$(grep '^cpu ' /proc/stat)
    local user=$(echo $cpu_line | awk '{print $2}')
    local nice=$(echo $cpu_line | awk '{print $3}')
    local system=$(echo $cpu_line | awk '{print $4}')
    local idle=$(echo $cpu_line | awk '{print $5}')
    local iowait=$(echo $cpu_line | awk '{print $6}')
    
    local total=$((user + nice + system + idle + iowait))
    local used=$((user + nice + system))
    
    echo "scale=2; ($used * 100) / $total" | bc
}

# Monitor over time
while true; do
    echo "$(date): CPU Usage: $(get_cpu_usage)%"
    sleep 1
done

Idle loop internals:

// Simplified idle loop (arch-specific)
static void do_idle(void) {
    while (1) {
        if (need_resched()) {
            schedule_idle();
            continue;
        }
        // Enter low-power state (HLT, MWAIT, etc.)
        arch_cpu_idle();
        account_idle_time(cpu, cputime);
    }
}

5. iowait

Kernel context: Time the CPU spent idle while waiting for I/O operations to complete.

Increment condition: CPU is idle (idle would increment) but there are outstanding I/O requests in flight. This is a special case of idle time.

Kernel source: kernel/sched/cputime.c::account_idle_time() with I/O pending check

Example interpretation:

  • Indicates I/O-bound workloads
  • High iowait suggests disk/network bottlenecks
  • Important: iowait does NOT mean the CPU is busy; it’s idle time waiting for I/O
  • Combined with low user/system = I/O bottleneck, not CPU bottleneck
  • Can be misleading on systems with async I/O (io_uring, etc.)

Practical example:

# Generate iowait by saturating disk I/O
dd if=/dev/zero of=/tmp/stress bs=1M count=10000 oflag=direct &
DD_PID=$!

# Monitor iowait growth
watch -n 1 "grep '^cpu ' /proc/stat | awk '{print \"iowait: \" \$6 \" jiffies\"}'"

# Stop the stress
kill $DD_PID

# Compare with actual I/O stats
iostat -x 1 5

Kernel accounting logic:

void account_idle_time(struct rq *rq, u64 cputime) {
    if (nr_iowait_cpu(smp_processor_id()) > 0)
        account_cputime_iowait(rq, cputime);
    else
        account_cputime_idle(rq, cputime);
}

Common misconception: iowait is NOT CPU time spent on I/O. The CPU is idle; the I/O device (disk controller, NIC) is busy.

6. irq (hardirq)

Kernel context: Time spent servicing hardware interrupts (IRQs).

Increment condition: Kernel interrupt handler executes. Each IRQ handler increments per-CPU and per-IRQ counters.

Kernel source: kernel/softirq.c, interrupt handlers

Example interpretation:

  • High irq indicates interrupt-heavy workloads (network, storage, timers)
  • Typical range: <1% on idle systems, 1-5% under load
  • Spikes suggest hardware issues or misconfigured interrupt affinity
  • Can be distributed via smp_affinity masks

Practical example:

# View interrupt distribution
cat /proc/interrupts

# Generate high IRQ load (network interrupts)
iperf3 -s &
SERVER_PID=$!
iperf3 -c localhost -t 30 -P 8  # 8 parallel streams
kill $SERVER_PID

# Monitor IRQ time
T1=$(grep '^cpu ' /proc/stat | awk '{print $7}')
sleep 5
T2=$(grep '^cpu ' /proc/stat | awk '{print $7}')
echo "IRQ time delta: $((T2 - T1)) jiffies"

# Set interrupt affinity (example: bind IRQ 24 to CPU 0)
echo 1 > /proc/irq/24/smp_affinity

Interrupt handler accounting:

irqreturn_t handle_irq_event(struct irq_desc *desc) {
    u64 start = local_clock();
    // ... handle interrupt ...
    account_hardirq_time(current, local_clock() - start);
    return IRQ_HANDLED;
}

7. softirq

Kernel context: Time spent executing softirqs (deferred interrupt processing, bottom halves).

Increment condition: Kernel softirq daemon (ksoftirqd) or direct softirq execution in interrupt context processes pending softirqs.

Kernel source: kernel/softirq.c::__do_softirq()

Softirq types (from include/linux/interrupt.h):

  • HI_SOFTIRQ: High-priority tasklets
  • TIMER_SOFTIRQ: Timer callbacks
  • NET_TX_SOFTIRQ: Network transmit
  • NET_RX_SOFTIRQ: Network receive
  • BLOCK_SOFTIRQ: Block device I/O completion
  • IRQ_POLL_SOFTIRQ: IRQ polling
  • TASKLET_SOFTIRQ: Normal tasklets
  • SCHED_SOFTIRQ: Scheduler callbacks
  • HRTIMER_SOFTIRQ: High-resolution timers
  • RCU_SOFTIRQ: RCU callbacks

Example interpretation:

  • High softirq indicates deferred processing load (network, timers, RCU)
  • Network-heavy workloads show high NET_RX_SOFTIRQ/NET_TX_SOFTIRQ
  • RCU-heavy systems (many CPUs) show high RCU_SOFTIRQ
  • Softirq time can exceed 10% on network servers

Practical example:

# View softirq breakdown
cat /proc/softirq

# Example output:
#                    CPU0       CPU1       CPU2       CPU3
#          HI:          0          0          0          0
#       TIMER:     123456     123457     123458     123459
#      NET_TX:       1234       1235       1236       1237
#      NET_RX:      56789      56790      56791      56792
#       BLOCK:          0          0          0          0
#    IRQ_POLL:          0          0          0          0
#     TASKLET:         12         13         14         15
#       SCHED:      23456      23457      23458      23459
#     HRTIMER:          0          0          0          0
#         RCU:     345678     345679     345680     345681

# Generate NET_RX softirq load
iperf3 -s &
SERVER_PID=$!
iperf3 -c localhost -t 10 -P 16
kill $SERVER_PID

# Monitor softirq time growth
watch -n 1 "grep '^cpu ' /proc/stat | awk '{print \"softirq: \" \$8}'"

Softirq execution contexts:

// Softirqs can run in:
// 1. Interrupt return path (if pending)
irq_exit() {
    if (in_interrupt() && local_softirq_pending())
        invoke_softirq();
}

// 2. ksoftirqd kernel thread
static int ksoftirqd(void *data) {
    while (!kthread_should_stop()) {
        __do_softirq();
        schedule();
    }
}

// 3. Explicit raise (local_bh_enable, etc.)

8. steal

Kernel context: Time “stolen” by the hypervisor from a virtual CPU (only in virtualized environments).

Increment condition: Hypervisor preempts the guest VM’s virtual CPU to schedule other VMs or host tasks. The guest kernel detects this via paravirtualized time sources (e.g., KVM’s kvm_steal_time).

Kernel source: arch/x86/kernel/kvm.c::kvm_steal_time_setup()

Example interpretation:

  • Only non-zero in VMs (KVM, Xen, VMware, Hyper-V)
  • High steal indicates host CPU overcommit or noisy neighbors
  • 10% suggests VM should be migrated or host CPU capacity increased

  • steal time is “lost” from the guest’s perspective

Practical example:

# Check if running in a VM
if [ -d /sys/devices/virtual/dmi/id ]; then
    if grep -q "KVM\|VMware\|Xen\|Microsoft" /sys/devices/virtual/dmi/id/product_name 2>/dev/null; then
        echo "Running in virtualized environment"
    fi
fi

# Monitor steal time
watch -n 1 "grep '^cpu ' /proc/stat | awk '{print \"steal: \" \$9 \" jiffies\"}'"

# Calculate steal percentage
calc_steal_pct() {
    local cpu_line=$(grep '^cpu ' /proc/stat)
    local total=$(echo $cpu_line | awk '{sum=0; for(i=2;i<=NF;i++) sum+=$i; print sum}')
    local steal=$(echo $cpu_line | awk '{print $9}')
    echo "scale=2; ($steal * 100) / $total" | bc
}

KVM steal time mechanism:

// Host side (KVM)
static void record_steal_time(struct kvm_vcpu *vcpu) {
    struct kvm_steal_time *st = vcpu->arch.st;
    st->steal += current->sched_info.run_delay;
}

// Guest side (Linux kernel)
static void kvm_steal_time_setup(void) {
    // Read steal time from shared page
    steal = st->steal;
    account_steal_time(steal);
}

9. guest

Kernel context: Time spent running a guest OS (nested virtualization or KVM guest time accounting).

Increment condition: Host kernel accounts time when a guest VM’s virtual CPU is executing. This is the inverse of steal from the host’s perspective.

Kernel source: kernel/sched/cputime.c::account_guest_time()

Example interpretation:

  • Non-zero on hypervisor hosts running VMs
  • Represents CPU time consumed by guest VMs
  • In nested virtualization, a guest VM can itself host VMs
  • Typically only relevant for hypervisor monitoring

Practical example:

# On a KVM host, monitor guest time
watch -n 1 "grep '^cpu ' /proc/stat | awk '{print \"guest: \" \$10}'"

# Compare with VM CPU usage (from host perspective)
virsh domstats --cpu <domain>

Guest time accounting:

// When guest VM executes on host CPU
void account_guest_time(struct task_struct *p, u64 cputime) {
    account_cputime_guest(p, cputime);
    // Guest time is also counted as user time from host perspective
    account_user_time(p, cputime);
}

10. guest_nice

Kernel context: Time spent running niced guest OS processes (nested virtualization).

Increment condition: Same as guest, but for processes with non-zero nice values in the guest.

Kernel source: kernel/sched/cputime.c::account_guest_time() with nice check

Example interpretation:

  • Parallel to nice but for guest VM processes
  • Rarely non-zero unless guests run niced workloads
  • Mostly relevant for hypervisor capacity planning

Practical example:

# Monitor guest_nice (typically zero)
grep '^cpu ' /proc/stat | awk '{print "guest_nice: " $11}'

Complete CPU Utilization Calculation

To calculate accurate CPU percentages, sample /proc/stat at two points and compute deltas:

#!/bin/bash
# Comprehensive CPU statistics calculator

get_cpu_stats() {
    grep '^cpu ' /proc/stat | awk '{
        user=$2; nice=$3; system=$4; idle=$5;
        iowait=$6; irq=$7; softirq=$8; steal=$9;
        guest=$10; guest_nice=$11;
        
        # Total is sum of all fields (idle included)
        total = user + nice + system + idle + iowait + irq + softirq + steal
        
        # Active time (excluding idle and iowait)
        active = user + nice + system + irq + softirq
        
        # Idle time (true idle + iowait)
        idle_total = idle + iowait
        
        print user, nice, system, idle, iowait, irq, softirq, steal, guest, guest_nice, total, active, idle_total
    }'
}

# Sample 1
S1=($(get_cpu_stats))
sleep 1

# Sample 2
S2=($(get_cpu_stats))

# Calculate deltas
for i in {0..12}; do
    DELTA[$i]=$((${S2[$i]} - ${S1[$i]}))
done

# Calculate percentages (assuming HZ=100)
HZ=100
TOTAL_DELTA=${DELTA[10]}

echo "CPU Statistics (1 second sample):"
echo "================================"
printf "User:      %6.2f%%\n" $(echo "scale=2; (${DELTA[0]} * 100) / $TOTAL_DELTA" | bc)
printf "Nice:      %6.2f%%\n" $(echo "scale=2; (${DELTA[1]} * 100) / $TOTAL_DELTA" | bc)
printf "System:    %6.2f%%\n" $(echo "scale=2; (${DELTA[2]} * 100) / $TOTAL_DELTA" | bc)
printf "Idle:      %6.2f%%\n" $(echo "scale=2; (${DELTA[3]} * 100) / $TOTAL_DELTA" | bc)
printf "IOWait:    %6.2f%%\n" $(echo "scale=2; (${DELTA[4]} * 100) / $TOTAL_DELTA" | bc)
printf "IRQ:       %6.2f%%\n" $(echo "scale=2; (${DELTA[5]} * 100) / $TOTAL_DELTA" | bc)
printf "SoftIRQ:   %6.2f%%\n" $(echo "scale=2; (${DELTA[6]} * 100) / $TOTAL_DELTA" | bc)
printf "Steal:     %6.2f%%\n" $(echo "scale=2; (${DELTA[7]} * 100) / $TOTAL_DELTA" | bc)
printf "Guest:     %6.2f%%\n" $(echo "scale=2; (${DELTA[8]} * 100) / $TOTAL_DELTA" | bc)
printf "GuestNice: %6.2f%%\n" $(echo "scale=2; (${DELTA[9]} * 100) / $TOTAL_DELTA" | bc)
echo "================================"
printf "Total CPU Usage: %6.2f%%\n" $(echo "scale=2; (${DELTA[11]} * 100) / $TOTAL_DELTA" | bc)

Per-CPU Analysis

Individual CPU cores can have vastly different statistics:

# Show per-CPU breakdown
for cpu in $(grep '^cpu[0-9]' /proc/stat | cut -d' ' -f1); do
    echo "=== $cpu ==="
    grep "^$cpu " /proc/stat | awk '{
        total = $2+$3+$4+$5+$6+$7+$8+$9
        printf "User: %.1f%%, System: %.1f%%, Idle: %.1f%%, IOWait: %.1f%%\n",
            ($2*100)/total, ($4*100)/total, ($5*100)/total, ($6*100)/total
    }'
done

Kernel Implementation Details

Tick Accounting

The kernel updates /proc/stat during periodic timer interrupts (tick handler):

// Simplified tick handler flow
void update_process_times(int user_tick) {
    struct task_struct *p = current;
    u64 cputime = cputime_one_jiffy;
    
    if (user_tick) {
        account_user_time(p, cputime);
    } else {
        account_system_time(p, cputime);
    }
    
    // Update per-CPU stats
    account_cputime_index(cpu, index, cputime);
}

Counter Precision

  • Jiffy resolution: Typically 1ms (HZ=1000) or 10ms (HZ=100)
  • Cumulative counters: Never reset, wrap around after ~497 days (32-bit) or effectively never (64-bit)
  • Atomic updates: Counters updated atomically to prevent race conditions
  • Per-CPU storage: Reduces cache line contention

Reading /proc/stat Safely

The kernel provides a consistent snapshot:

// Kernel side: seq_file interface ensures atomic reads
static int stat_show(struct seq_file *p, void *v) {
    // Lockless read of per-CPU counters
    for_each_possible_cpu(cpu) {
        sum_cpu_stats(cpu, &stats);
    }
    seq_printf(p, "cpu %llu %llu ...\n", stats.user, stats.nice, ...);
}

Advanced Use Cases

1. Real-time CPU Monitoring Script

#!/bin/bash
# Continuous CPU statistics monitor

HZ=$(getconf CLK_TCK)
INTERVAL=1

while true; do
    clear
    echo "CPU Statistics (refreshing every ${INTERVAL}s)"
    echo "=============================================="
    
    # Aggregate all CPUs
    S1=$(grep '^cpu ' /proc/stat)
    sleep $INTERVAL
    S2=$(grep '^cpu ' /proc/stat)
    
    # Parse and calculate
    read -r u1 n1 s1 i1 w1 x1 y1 z1 st1 g1 gn1 <<< $(echo $S1 | awk '{print $2,$3,$4,$5,$6,$7,$8,$9,$10,$11}')
    read -r u2 n2 s2 i2 w2 x2 y2 z2 st2 g2 gn2 <<< $(echo $S2 | awk '{print $2,$3,$4,$5,$6,$7,$8,$9,$10,$11}')
    
    total=$(( (u2+n2+s2+i2+w2+x2+y2+z2) - (u1+n1+s1+i1+w1+x1+y1+z1) ))
    
    if [ $total -gt 0 ]; then
        printf "User:    %5.1f%%  [%8d jiffies]\n" $(echo "scale=1; (($u2-$u1)*100)/$total" | bc) $((u2-u1))
        printf "Nice:    %5.1f%%  [%8d jiffies]\n" $(echo "scale=1; (($n2-$n1)*100)/$total" | bc) $((n2-n1))
        printf "System:  %5.1f%%  [%8d jiffies]\n" $(echo "scale=1; (($s2-$s1)*100)/$total" | bc) $((s2-s1))
        printf "Idle:    %5.1f%%  [%8d jiffies]\n" $(echo "scale=1; (($i2-$i1)*100)/$total" | bc) $((i2-i1))
        printf "IOWait:  %5.1f%%  [%8d jiffies]\n" $(echo "scale=1; (($w2-$w1)*100)/$total" | bc) $((w2-w1))
        printf "IRQ:     %5.1f%%  [%8d jiffies]\n" $(echo "scale=1; (($x2-$x1)*100)/$total" | bc) $((x2-x1))
        printf "SoftIRQ: %5.1f%%  [%8d jiffies]\n" $(echo "scale=1; (($y2-$y1)*100)/$total" | bc) $((y2-y1))
        printf "Steal:   %5.1f%%  [%8d jiffies]\n" $(echo "scale=1; (($z2-$z1)*100)/$total" | bc) $((z2-z1))
        printf "Guest:   %5.1f%%  [%8d jiffies]\n" $(echo "scale=1; (($g2-$g1)*100)/$total" | bc) $((g2-g1))
    fi
    
    echo "=============================================="
    echo "Press Ctrl+C to exit"
    sleep $((INTERVAL - 1))
done

2. Historical CPU Statistics Collection

#!/bin/bash
# Collect CPU stats over time for analysis

LOG_FILE="/tmp/cpu_stats.log"
INTERVAL=5
DURATION=3600  # 1 hour

echo "timestamp,user,nice,system,idle,iowait,irq,softirq,steal,guest,guest_nice" > $LOG_FILE

START=$(date +%s)
while [ $(($(date +%s) - START)) -lt $DURATION ]; do
    TIMESTAMP=$(date +%s)
    STATS=$(grep '^cpu ' /proc/stat | awk '{print $2","$3","$4","$5","$6","$7","$8","$9","$10","$11}')
    echo "$TIMESTAMP,$STATS" >> $LOG_FILE
    sleep $INTERVAL
done

# Analyze collected data
echo "Analysis complete. Log: $LOG_FILE"

3. Detect CPU Anomalies

#!/bin/bash
# Alert on unusual CPU statistics

THRESHOLD_IOWAIT=20  # Alert if iowait > 20%
THRESHOLD_STEAL=10   # Alert if steal > 10%
THRESHOLD_SOFTIRQ=15 # Alert if softirq > 15%

check_cpu_stats() {
    S1=$(grep '^cpu ' /proc/stat)
    sleep 1
    S2=$(grep '^cpu ' /proc/stat)
    
    read -r u1 n1 s1 i1 w1 x1 y1 z1 st1 g1 gn1 <<< $(echo $S1 | awk '{print $2,$3,$4,$5,$6,$7,$8,$9,$10,$11}')
    read -r u2 n2 s2 i2 w2 x2 y2 z2 st2 g2 gn2 <<< $(echo $S2 | awk '{print $2,$3,$4,$5,$6,$7,$8,$9,$10,$11}')
    
    total=$(( (u2+n2+s2+i2+w2+x2+y2+z2) - (u1+n1+s1+i1+w1+x1+y1+z1) ))
    
    if [ $total -gt 0 ]; then
        iowait_pct=$(echo "scale=1; (($w2-$w1)*100)/$total" | bc)
        steal_pct=$(echo "scale=1; (($z2-$z1)*100)/$total" | bc)
        softirq_pct=$(echo "scale=1; (($y2-$y1)*100)/$total" | bc)
        
        if (( $(echo "$iowait_pct > $THRESHOLD_IOWAIT" | bc -l) )); then
            echo "WARNING: High IOWait detected: ${iowait_pct}%"
        fi
        
        if (( $(echo "$steal_pct > $THRESHOLD_STEAL" | bc -l) )); then
            echo "WARNING: High steal time detected: ${steal_pct}% (possible host overcommit)"
        fi
        
        if (( $(echo "$softirq_pct > $THRESHOLD_SOFTIRQ" | bc -l) )); then
            echo "WARNING: High softirq time detected: ${softirq_pct}% (possible interrupt storm)"
        fi
    fi
}

check_cpu_stats

Relationship to Other Kernel Interfaces

/proc/loadavg

Load average is related but distinct:

  • Load average = runnable tasks + uninterruptible sleep (I/O wait)
  • CPU statistics = actual CPU time spent in each state
  • High load + low CPU usage = I/O bottleneck

/proc/uptime

System uptime used to calculate rates:

uptime_seconds=$(awk '{print $1}' /proc/uptime)
cpu_jiffies=$(grep '^cpu ' /proc/stat | awk '{sum=0; for(i=2;i<=NF;i++) sum+=$i; print sum}')
HZ=$(getconf CLK_TCK)
cpu_time=$((cpu_jiffies / HZ))
echo "CPU time accumulated: ${cpu_time}s over ${uptime_seconds}s uptime"

top/htop Internals

Tools like top read /proc/stat and /proc/[pid]/stat to compute CPU percentages:

// Simplified top calculation
cpu_percent = (delta_user + delta_system) / delta_total * 100

Troubleshooting Scenarios

Scenario 1: High System Time

Symptoms: system time > 30%, slow application response

Investigation:

# Check syscall rate
strace -c -p $(pgrep -f <application>) 2>&1 | head -20

# Check context switch rate
vmstat 1 5

# Check kernel function profiling (requires perf)
perf top -g -p $(pgrep -f <application>)

Common causes:

  • Frequent page faults
  • Excessive system calls
  • Kernel lock contention
  • Driver issues

Scenario 2: High IOWait

Symptoms: iowait > 20%, system feels sluggish

Investigation:

# Check I/O wait queue depth
iostat -x 1 5

# Check block device statistics
cat /proc/diskstats

# Identify I/O-intensive processes
iotop -o

# Check filesystem I/O
df -h
mount | grep -E '(noatime|nodiratime)'

Common causes:

  • Slow storage (HDD vs SSD)
  • Saturated I/O bandwidth
  • Swapping (check /proc/meminfo)
  • Network filesystem latency

Scenario 3: High SoftIRQ Time

Symptoms: softirq > 15%, network performance issues

Investigation:

# Break down softirq types
watch -n 1 'cat /proc/softirq'

# Check network receive drops
cat /proc/net/softnet_stat

# Monitor network interrupts
cat /proc/interrupts | grep -i eth

# Check for receive livelock
ethtool -S <interface> | grep -i drop

Common causes:

  • High network packet rate
  • Small receive buffers
  • Interrupt coalescing misconfiguration
  • Network driver issues

Scenario 4: High Steal Time in VM

Symptoms: steal > 10%, VM performance degradation

Investigation:

# Confirm virtualization
systemd-detect-virt

# Check CPU pinning
virsh vcpuinfo <domain>

# Monitor from host perspective
virsh domstats <domain> --cpu

# Check for CPU overcommit
# (On host) virsh nodeinfo

Solutions:

  • Migrate VM to less loaded host
  • Increase host CPU capacity
  • Use CPU pinning/affinity
  • Enable CPU features (virtio, paravirtualization)

References and Further Reading

  • Kernel Documentation: Documentation/filesystems/proc.rst (CPU statistics section)
  • Kernel Source: kernel/sched/cputime.c (CPU time accounting implementation)
  • man proc(5): /proc/stat format documentation
  • Linux Performance and Tuning Guide: CPU accounting and analysis
  • Understanding the Linux Kernel (3rd ed.): Chapter 4 (Interrupts and Exceptions), Chapter 7 (Kernel Synchronization)

Note: All statistics are cumulative since boot. To calculate rates or percentages, always sample at two points in time and compute deltas. The kernel tick rate (HZ) determines counter resolution and can be queried via getconf CLK_TCK.