Performance Engineering

Table of Contents


Resources

General

Specific

Algorithms

Cache

Memory

Performance Testing

Load Testing

Tools

AB

Bombardier

  • bombardier Fast cross-platform HTTP benchmarking tool written in Go

Koi Pond

Gatling

Megaload

Tsung

Siege

Denis Bakhvalov


Tools

perf

oprofile

tracy

valgrind

kcachegrind

flamegraph

fgtrace

C

Java

Go

Books

The Art of Computer Systems Performance Analysis (Raj Jain)

The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling

System Performance Tuning (Mike Loukides)

Computer Systems Performance Evaluation and Prediction (Michel, Fortier)

Systems Performance (Brendan Gregg)

  • 1 Introduction
    • 1.1 Systems Performance Systems performance is the study of the entire system, including all physical components and the full software stack.
    • 1.2 Roles
    • 1.3 Activities
    • 1.4 Perspectives
    • 1.5 Performance Is Challenging
    • 1.5.1 Performance Is Subjective
    • 1.5.2 Systems Are Complex
    • 1.5.3 There Can Be Multiple Performance Issues
    • 1.6 Latency
    • 1.7 Dynamic Tracing
    • 1.8 Cloud Computing
    • 1.9 Case Studies
    • 1.9.1 Slow Disks
    • 1.9.2 Software Change
    • 1.9.3 More Reading
  • 2 Methodology
    • 2.1 Terminology
    • 2.2 Models
    • 2.2.1 System under Test
    • 2.2.2 Queueing System
    • 2.3 Concepts
    • 2.3.1 Latency
    • 2.3.2 Time Scales
    • 2.3.3 Trade-offs
    • 2.3.4 Tuning Efforts
    • 2.3.5 Level of Appropriateness
    • 2.3.6 Point-in-Time Recommendations
    • 2.3.7 Load versus Architecture
    • 2.3.8 Scalability
    • 2.3.9 Known-Unknowns
    • 2.3.10 Metrics
    • 2.3.11 Utilization
    • 2.3.12 Saturation
    • 2.3.13 Profiling
    • 2.3.14 Caching
    • 2.4 Perspectives
    • 2.4.1 Resource Analysis
    • 2.4.2 Workload Analysis
    • 2.5 Methodology
    • 2.5.1 Streetlight Anti-Method
    • 2.5.2 Random Change Anti-Method
    • 2.5.3 Blame-Someone-Else Anti-Method
    • 2.5.4 Ad Hoc Checklist Method
    • 2.5.5 Problem Statement
    • 2.5.6 Scientific Method
    • 2.5.7 Diagnosis Cycle
    • 2.5.8 Tools Method
    • 2.5.9 The USE Method
    • 2.5.10 Workload Characterization
    • 2.5.11 Drill-Down Analysis
    • 2.5.12 Latency Analysis
    • 2.5.13 Method R
    • 2.5.14 Event Tracing
    • 2.5.15 Baseline Statistics
    • 2.5.16 Static Performance Tuning
    • 2.5.17 Cache Tuning
    • 2.5.18 Micro-Benchmarking
    • 2.6 Modeling
    • 2.6.1 Enterprise versus Cloud
    • 2.6.2 Visual Identification
    • 2.6.3 Amdahl’s Law of Scalability
    • 2.6.4 Universal Scalability Law
    • 2.6.5 Queueing Theory
    • 2.7 Capacity Planning
    • 2.7.1 Resource Limits
    • 2.7.2 Factor Analysis
    • 2.7.3 Scaling Solutions
    • 2.8 Statistics
    • 2.8.1 Quantifying Performance
    • 2.8.2 Averages
    • 2.8.3 Standard Deviations, Percentiles, Median
    • 2.8.4 Coefficient of Variation
    • 2.8.5 Multimodal Distributions
    • 2.8.6 Outliers
    • 2.9 Monitoring
    • 2.9.1 Time-Based Patterns
    • 2.9.2 Monitoring Products
    • 2.9.3 Summary-since-Boot
    • 2.10 Visualizations
    • 2.10.1 Line Chart
    • 2.10.2 Scatter Plots
    • 2.10.3 Heat Maps
    • 2.10.4 Surface Plot
    • 2.10.5 Visualization Tools
    • 2.11 Exercises
    • 2.12 References
  • 3 Operating Systems
    • 3.1 Terminology
    • 3.2 Background
    • 3.2.1 Kernel
    • 3.2.2 Stacks
    • 3.2.3 Interrupts and Interrupt Threads
    • 3.2.4 Interrupt Priority Level
    • 3.2.5 Processes
    • 3.2.6 System Calls
    • 3.2.7 Virtual Memory
    • 3.2.8 Memory Management
    • 3.2.9 Schedulers
    • 3.2.10 File Systems
    • 3.2.11 Caching
    • 3.2.12 Networking
    • 3.2.13 Device Drivers
    • 3.2.14 Multiprocessor
    • 3.2.15 Preemption
    • 3.2.16 Resource Management
    • 3.2.17 Observability
    • 3.3 Kernels
    • 3.3.1 Unix
    • 3.3.2 Solaris-Based
    • 3.3.3 Linux-Based
    • 3.3.4 Differences
    • 3.4 Exercises
    • 3.5 References
  • 4 Observability Tools
    • 4.1 Tool Types
    • 4.1.1 Counters
    • 4.1.2 Tracing
    • 4.1.3 Profiling
    • 4.1.4 Monitoring (sar)
    • 4.2 Observability Sources
    • 4.2.1 /proc
    • 4.2.2 /sys
    • 4.2.3 kstat
    • 4.2.4 Delay Accounting
    • 4.2.5 Microstate Accounting
    • 4.2.6 Other Observability Sources
    • 4.3 DTrace
    • 4.3.1 Static and Dynamic Tracing
    • 4.3.2 Probes
    • 4.3.3 Providers
    • 4.3.4 Arguments
    • 4.3.5 D Language
    • 4.3.6 Built-in Variables
    • 4.3.7 Actions
    • 4.3.8 Variable Types
    • 4.3.9 One-Liners
    • 4.3.10 Scripting
    • 4.3.11 Overheads
    • 4.3.12 Documentation and Resources
    • 4.4 SystemTap
    • 4.4.1 Probes
    • 4.4.2 Tapsets
    • 4.4.3 Actions and Built-ins
    • 4.4.4 Examples
    • 4.4.5 Overheads
    • 4.4.6 Documentation and Resources
    • 4.5 perf
    • 4.6 Observing Observability
    • 4.7 Exercises
    • 4.8 References
  • 5 Applications
    • 5.1 Application Basics
    • 5.1.1 Objectives
    • 5.1.2 Optimize the Common Case
    • 5.1.3 Observability
    • 5.1.4 Big O Notation
    • 5.2 Application Performance Techniques
    • 5.2.1 Selecting an I/O Size
    • 5.2.2 Caching
    • 5.2.3 Buffering
    • 5.2.4 Polling
    • 5.2.5 Concurrency and Parallelism
    • 5.2.6 Non-Blocking I/O
    • 5.2.7 Processor Binding
    • 5.3 Programming Languages
    • 5.3.1 Compiled Languages
    • 5.3.2 Interpreted Languages
    • 5.3.3 Virtual Machines
    • 5.3.4 Garbage Collection
    • 5.4 Methodology and Analysis
    • 5.4.1 Thread State Analysis
    • 5.4.2 CPU Profiling
    • 5.4.3 Syscall Analysis
    • 5.4.4 I/O Profiling
    • 5.4.5 Workload Characterization
    • 5.4.6 USE Method
    • 5.4.7 Drill-Down Analysis
    • 5.4.8 Lock Analysis
    • 5.4.9 Static Performance Tuning
    • 5.5 Exercises
    • 5.6 References
  • 6 CPUs
    • 6.1 Terminology
    • 6.2 Models
    • 6.2.1 CPU Architecture
    • 6.2.2 CPU Memory Caches
    • 6.2.3 CPU Run Queues
    • 6.3 Concepts
    • 6.3.1 Clock Rate
    • 6.3.2 Instruction
    • 6.3.3 Instruction Pipeline
    • 6.3.4 Instruction Width
    • 6.3.5 CPI, IPC
    • 6.3.6 Utilization
    • 6.3.7 User-Time/Kernel-Time
    • 6.3.8 Saturation
    • 6.3.9 Preemption
    • 6.3.10 Priority Inversion
    • 6.3.11 Multiprocess, Multithreading
    • 6.3.12 Word Size
    • 6.3.13 Compiler Optimization
    • 6.4 Architecture
    • 6.4.1 Hardware
    • 6.4.2 Software
    • 6.5 Methodology
    • 6.5.1 Tools Method
    • 6.5.2 USE Method
    • 6.5.3 Workload Characterization
    • 6.5.4 Profiling
    • 6.5.5 Cycle Analysis
    • 6.5.6 Performance Monitoring
    • 6.5.7 Static Performance Tuning
    • 6.5.8 Priority Tuning
    • 6.5.9 Resource Controls
    • 6.5.10 CPU Binding
    • 6.5.11 Micro-Benchmarking
    • 6.5.12 Scaling
    • 6.6 Analysis
    • 6.6.1 uptime
    • 6.6.2 vmstat
    • 6.6.3 mpstat
    • 6.6.4 sar
    • 6.6.5 ps
    • 6.6.6 top
    • 6.6.7 prstat
    • 6.6.8 pidstat
    • 6.6.9 time, ptime
    • 6.6.10 DTrace
    • 6.6.11 SystemTap
    • 6.6.12 perf
    • 6.6.13 cpustat
    • 6.6.14 Other Tools
    • 6.6.15 Visualizations
    • 6.7 Experimentation
    • 6.7.1 Ad Hoc
    • 6.7.2 SysBench
    • 6.8 Tuning
    • 6.8.1 Compiler Options
    • 6.8.2 Scheduling Priority and Class
    • 6.8.3 Scheduler Options
    • 6.8.4 Process Binding
    • 6.8.5 Exclusive CPU Sets
    • 6.8.6 Resource Controls
    • 6.8.7 Processor Options (BIOS Tuning)
    • 6.9 Exercises
    • 6.10 References
  • 7 Memory
    • 7.1 Terminology
    • 7.2 Concepts
    • 7.2.1 Virtual Memory
    • 7.2.2 Paging
    • 7.2.3 Demand Paging
    • 7.2.4 Overcommit
    • 7.2.5 Swapping
    • 7.2.6 File System Cache Usage
    • 7.2.7 Utilization and Saturation
    • 7.2.8 Allocators
    • 7.2.9 Word Size
    • 7.3 Architecture
    • 7.3.1 Hardware
    • 7.3.2 Software
    • 7.3.3 Process Address Space
    • 7.4 Methodology
    • 7.4.1 Tools Method
    • 7.4.2 USE Method
    • 7.4.3 Characterizing Usage
    • 7.4.4 Cycle Analysis
    • 7.4.5 Performance Monitoring
    • 7.4.6 Leak Detection
    • 7.4.7 Static Performance Tuning
    • 7.4.8 Resource Controls
    • 7.4.9 Micro-Benchmarking
    • 7.5 Analysis
    • 7.5.1 vmstat
    • 7.5.2 sar
    • 7.5.3 slabtop
    • 7.5.4 ::kmastat
    • 7.5.5 ps
    • 7.5.6 top
    • 7.5.7 prstat
    • 7.5.8 pmap
    • 7.5.9 DTrace
    • 7.5.10 SystemTap
    • 7.5.11 Other Tools
    • 7.6 Tuning
    • 7.6.1 Tunable Parameters
    • 7.6.2 Multiple Page Sizes
    • 7.6.3 Allocators
    • 7.6.4 Resource Controls
    • 7.7 Exercises
    • 7.8 References
  • 8 File Systems
    • 8.1 Terminology
    • 8.2 Models
    • 8.2.1 File System Interfaces
    • 8.2.2 File System Cache
    • 8.2.3 Second-Level Cache
    • 8.3 Concepts
    • 8.3.1 File System Latency
    • 8.3.2 Caching
    • 8.3.3 Random versus Sequential I/O
    • 8.3.4 Prefetch
    • 8.3.5 Read-Ahead
    • 8.3.6 Write-Back Caching
    • 8.3.7 Synchronous Writes
    • 8.3.8 Raw and Direct I/O
    • 8.3.9 Non-Blocking I/O
    • 8.3.10 Memory-Mapped Files
    • 8.3.11 Metadata
    • 8.3.12 Logical versus Physical I/O
    • 8.3.13 Operations Are Not Equal
    • 8.3.14 Special File Systems
    • 8.3.15 Access Timestamps
    • 8.3.16 Capacity
    • 8.4 Architecture
    • 8.4.1 File System I/O Stack
    • 8.4.2 VFS
    • 8.4.3 File System Caches
    • 8.4.4 File System Features
    • 8.4.5 File System Types
    • 8.4.6 Volumes and Pools
    • 8.5 Methodology
    • 8.5.1 Disk Analysis
    • 8.5.2 Latency Analysis
    • 8.5.3 Workload Characterization
    • 8.5.4 Performance Monitoring
    • 8.5.5 Event Tracing
    • 8.5.6 Static Performance Tuning
    • 8.5.7 Cache Tuning
    • 8.5.8 Workload Separation
    • 8.5.9 Memory-Based File Systems
    • 8.5.10 Micro-Benchmarking
    • 8.6 Analysis
    • 8.6.1 vfsstat
    • 8.6.2 fsstat
    • 8.6.3 strace, truss
    • 8.6.4 DTrace
    • 8.6.5 SystemTap
    • 8.6.6 LatencyTOP
    • 8.6.7 free
    • 8.6.8 top
    • 8.6.9 vmstat
    • 8.6.10 sar
    • 8.6.11 slabtop
    • 8.6.12 mdb ::kmastat
    • 8.6.13 fcachestat
    • 8.6.14 /proc/meminfo
    • 8.6.15 mdb ::memstat
    • 8.6.16 kstat
    • 8.6.17 Other Tools
    • 8.6.18 Visualizations
    • 8.7 Experimentation
    • 8.7.1 Ad Hoc
    • 8.7.2 Micro-Benchmark Tools
    • 8.7.3 Cache Flushing
    • 8.8 Tuning
    • 8.8.1 Application Calls
    • 8.8.2 ext3
    • 8.8.3 ZFS
    • 8.9 Exercises
    • 8.10 References
  • 9 Disks
    • 9.1 Terminology
    • 9.2 Models
    • 9.2.1 Simple Disk
    • 9.2.2 Caching Disk
    • 9.2.3 Controller
    • 9.3 Concepts
    • 9.3.1 Measuring Time
    • 9.3.2 Time Scales
    • 9.3.3 Caching
    • 9.3.4 Random versus Sequential I/O
    • 9.3.5 Read/Write Ratio
    • 9.3.6 I/O Size
    • 9.3.7 IOPS Are Not Equal
    • 9.3.8 Non-Data-Transfer Disk Commands
    • 9.3.9 Utilization
    • 9.3.10 Saturation
    • 9.3.11 I/O Wait
    • 9.3.12 Synchronous versus Asynchronous
    • 9.3.13 Disk versus Application I/O
    • 9.4 Architecture
    • 9.4.1 Disk Types
    • 9.4.2 Interfaces
    • 9.4.3 Storage Types
    • 9.4.4 Operating System Disk I/O Stack
    • 9.5 Methodology
    • 9.5.1 Tools Method
    • 9.5.2 USE Method
    • 9.5.3 Performance Monitoring
    • 9.5.4 Workload Characterization
    • 9.5.5 Latency Analysis
    • 9.5.6 Event Tracing
    • 9.5.7 Static Performance Tuning
    • 9.5.8 Cache Tuning
    • 9.5.9 Resource Controls
    • 9.5.10 Micro-Benchmarking
    • 9.5.11 Scaling
    • 9.6 Analysis
    • 9.6.1 iostat
    • 9.6.2 sar
    • 9.6.3 pidstat
    • 9.6.4 DTrace
    • 9.6.5 SystemTap
    • 9.6.6 perf
    • 9.6.7 iotop
    • 9.6.8 iosnoop
    • 9.6.9 blktrace
    • 9.6.10 MegaCli
    • 9.6.11 smartctl
    • 9.6.12 Visualizations
    • 9.7 Experimentation
    • 9.7.1 Ad Hoc
    • 9.7.2 Custom Load Generators
    • 9.7.3 Micro-Benchmark Tools
    • 9.7.4 Random Read Example
    • 9.8 Tuning
    • 9.8.1 Operating System Tunables
    • 9.8.2 Disk Device Tunables
    • 9.8.3 Disk Controller Tunables
    • 9.9 Exercises
    • 9.10 References
  • 10 Network
    • 10.1 Terminology
    • 10.2 Models
    • 10.2.1 Network Interface
    • 10.2.2 Controller
    • 10.2.3 Protocol Stack
    • 10.3 Concepts
    • 10.3.1 Networks and Routing
    • 10.3.2 Protocols
    • 10.3.3 Encapsulation
    • 10.3.4 Packet Size
    • 10.3.5 Latency
    • 10.3.6 Buffering
    • 10.3.7 Connection Backlog
    • 10.3.8 Interface Negotiation
    • 10.3.9 Utilization
    • 10.3.10 Local Connections
    • 10.4 Architecture
    • 10.4.1 Protocols
    • 10.4.2 Hardware
    • 10.4.3 Software
    • 10.5 Methodology
    • 10.5.1 Tools Method
    • 10.5.2 USE Method
    • 10.5.3 Workload Characterization
    • 10.5.4 Latency Analysis
    • 10.5.5 Performance Monitoring
    • 10.5.6 Packet Sniffing
    • 10.5.7 TCP Analysis
    • 10.5.8 Drill-Down Analysis
    • 10.5.9 Static Performance Tuning
    • 10.5.10 Resource Controls
    • 10.5.11 Micro-Benchmarking
    • 10.6 Analysis
    • 10.6.1 netstat
    • 10.6.2 sar
    • 10.6.3 ifconfig
    • 10.6.4 ip
    • 10.6.5 nicstat
    • 10.6.6 dladm
    • 10.6.7 ping
    • 10.6.8 traceroute
    • 10.6.9 pathchar
    • 10.6.10 tcpdump
    • 10.6.11 snoop
    • 10.6.12 Wireshark
    • 10.6.13 DTrace
    • 10.6.14 SystemTap
    • 10.6.15 perf
    • 10.6.16 Other Tools
    • 10.7 Experimentation
    • 10.7.1 iperf
    • 10.8 Tuning
    • 10.8.1 Linux
    • 10.8.2 Solaris
    • 10.8.3 Configuration
    • 10.9 Exercises
    • 10.10 References
  • 11 Cloud Computing
    • 11.1 Background
    • 11.1.1 Price/Performance Ratio
    • 11.1.2 Scalable Architecture
    • 11.1.3 Capacity Planning
    • 11.1.4 Storage
    • 11.1.5 Multitenancy
    • 11.2 OS Virtualization
    • 11.2.1 Overhead
    • 11.2.2 Resource Controls
    • 11.2.3 Observability
    • 11.3 Hardware Virtualization
    • 11.3.1 Overhead
    • 11.3.2 Resource Controls
    • 11.3.3 Observability
    • 11.4 Comparisons
    • 11.5 Exercises
    • 11.6 References
  • 12 Benchmarking
    • 12.1 Background
    • 12.1.1 Activities
    • 12.1.2 Effective Benchmarking
    • 12.1.3 Benchmarking Sins
    • 12.2 Benchmarking Types
    • 12.2.1 Micro-Benchmarking
    • 12.2.2 Simulation
    • 12.2.3 Replay
    • 12.2.4 Industry Standards
    • 12.3 Methodology
    • 12.3.1 Passive Benchmarking
    • 12.3.2 Active Benchmarking
    • 12.3.3 CPU Profiling
    • 12.3.4 USE Method
    • 12.3.5 Workload Characterization
    • 12.3.6 Custom Benchmarks
    • 12.3.7 Ramping Load
    • 12.3.8 Sanity Check
    • 12.3.9 Statistical Analysis
    • 12.4 Benchmark Questions
    • 12.5 Exercises
    • 12.6 References
  • 13 Case Study
    • 13.1 Case Study: The Red Whale
    • 13.1.1 Problem Statement
    • 13.1.2 Support
    • 13.1.3 Getting Started
    • 13.1.4 Choose Your Own Adventure
    • 13.1.5 The USE Method
    • 13.1.6 Are We Done?
    • 13.1.7 Take 2
    • 13.1.8 The Basics
    • 13.1.9 Ignoring the Red Whale
    • 13.1.10 Interrogating the Kernel

BPF Performance Tools (Brendan Gregg)

DTrace : Dynamic Tracing in Oracle Solaris, Mac OS X and FreeBSD (Brendan Gregg)

The Art of Performance Engineering (Vashistha)

  • 1 Getting Started
  • 2 Infrastructure Design
  • 3 Client Side Optimization
  • 4 Web Server Optimization
  • 5 Application Server Optimization
  • 6 Database Server Optimization
  • 7 JVM Tuning
  • 8 Performance Monitoring
  • 9 Performance Counters

Performance Analysis and Tuning on Modern CPUs (Denis Bhakvalov)

  • 1 Introduction
  • Part 1 Perofrmance Analysis on a Modern CPU
  • 2 Measuring Performance
  • 3 CPU Microarchitecture
  • 4 Terminology and Metrics in Performance Analysis
  • 5 Performance Analysis Approaches
  • 6 CPU Features for Performance Analysis
  • Part 2 Source Code Tuning for CPU
  • 7 CPU Front-End Optimizations
  • 8 CPU Back-End Optimizations
  • 9 Optimizing Bad Speculation
  • 10 Other Tuning Areas
  • 11 Optimizing Multithreaded Applications