Performance Engineering

Resources
General
Specific
Algorithms
Cache
Memory
Performance Testing
Load Testing
- Tools
  - AB
  - Bombardier
  - Koi Pond
  - Gatling
  - Megaload
  - Tsung
  - Siege
Denis Bakhvalov
Tools
- perf
- oprofile
- tracy
- valgrind
- kcachegrind
- flamegraph
- fgtrace
- Clang Build Analyzer
C
Java
Go
Books

Resources

Agner's Software Optimization Resources
- Hacker News
Performance Matters Blog
Gentoo Wiki: GCC Optimization
The takeaways of GCC optimization
An Overview of Software Performance Analysis Tools and Techniques: From GProf to DTrace
Brendan Gregg
- perf Examples
- Linux Performance
- Systems Performance Book
- The USE Method
- 2017/10/28 System Performance Analysis Methodologies
- 2018/06/30 Evaluating the Evaluation: A Benchmarking Checklist
- 2020/03/08 LISA2019 Linux Systems Performance
- 2022/03/19 Why Don't You Use ... HN
- 2022/04/15 Netflix End of Series 1 HN

General

Fundamentals of Performance Profiling
2012/07 Things to Optimize Besides Speed and Memory
2013/02 Efficiency is fundamentally at odds with elegance
2018/05 Performance is a shape, not a number
2020/01 Some Useful Probability Facts for Systems Programming
- HN
2020/02 Thoughts on performance & optimization
2020/03 Low latency tuning guide
2020/12 Performance engineering requires stable benchmarks
2022/06 What Metric to Use When Benchmarking?

Specific

Algorithms

2020 Mar New grad vs senior dev

Cache

Memory

Performance Testing

2017/08/01 Reflecting on performance testing

Load Testing

Types:
- Soak testing: steady load for a long time
- Spike testing: load spike in short time
- Scale testing: determines system config required to support anticipated load
- Stress testing: puts system under unrealistic load to expose defects only seen under extreme load, load is increased gradually
- Load testing: responsiveness and stability under particular load
Goals:
- Determine scenarios that exercise the most functionality
- Define system parameters to measure
Configuration
Deployment
Load balancing
Monitors and alerts
Dashboards
Load generation
Protocols & interacting with SUT
Resources

Tools

AB

Bombardier

bombardier Fast cross-platform HTTP benchmarking tool written in Go

Koi Pond

Gatling

Megaload

Tsung

Siege

Denis Bakhvalov

EasyPerf.net
Performance Analysis and Tuning on Modern CPUs
CPU performance monitoring features
- Top-Down performance analysis methodology
- Intel Last Branch Record (LBR) feature
  - Part I: Precise timing of machine code with Linux perf
  - Part II: Estimating branch probability using Intel LBR feature
- Intel Processor Traces
Performance analysis and tuning contests
Performance monitoring counters and profiling
Performance analysis of multithreaded applications
How to conduct fair performance experiments: How to get consistent results when benchmarking on Linux?
Benchmarking: How to process measurements and compare results
Performance analysis terminology
Articles related to CPU microarchitecture
Vectorization

Tools

perf

oprofile

tracy

valgrind

kcachegrind

flamegraph

fgtrace

fgtrace is an experimental profiler/tracer that is capturing wallclock timelines for each goroutine. It's very similar to the Chrome profiler.

Clang Build Analyzer
Clang Build Analyzer

C

Java

2020 Jan What do you use for Java performance testing?

Go

Books

The Art of Computer Systems Performance Analysis (Raj Jain)

The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling

System Performance Tuning (Mike Loukides)

Computer Systems Performance Evaluation and Prediction (Michel, Fortier)

Systems Performance (Brendan Gregg)

1 Introduction
- 1.1 Systems Performance Systems performance is the study of the entire system, including all physical components and the full software stack.
- 1.2 Roles
- 1.3 Activities
- 1.4 Perspectives
- 1.5 Performance Is Challenging
- 1.5.1 Performance Is Subjective
- 1.5.2 Systems Are Complex
- 1.5.3 There Can Be Multiple Performance Issues
- 1.6 Latency
- 1.7 Dynamic Tracing
- 1.8 Cloud Computing
- 1.9 Case Studies
- 1.9.1 Slow Disks
- 1.9.2 Software Change
- 1.9.3 More Reading
2 Methodology
- 2.1 Terminology
- 2.2 Models
- 2.2.1 System under Test
- 2.2.2 Queueing System
- 2.3 Concepts
- 2.3.1 Latency
- 2.3.2 Time Scales
- 2.3.3 Trade-offs
- 2.3.4 Tuning Efforts
- 2.3.5 Level of Appropriateness
- 2.3.6 Point-in-Time Recommendations
- 2.3.7 Load versus Architecture
- 2.3.8 Scalability
- 2.3.9 Known-Unknowns
- 2.3.10 Metrics
- 2.3.11 Utilization
- 2.3.12 Saturation
- 2.3.13 Profiling
- 2.3.14 Caching
- 2.4 Perspectives
- 2.4.1 Resource Analysis
- 2.4.2 Workload Analysis
- 2.5 Methodology
- 2.5.1 Streetlight Anti-Method
- 2.5.2 Random Change Anti-Method
- 2.5.3 Blame-Someone-Else Anti-Method
- 2.5.4 Ad Hoc Checklist Method
- 2.5.5 Problem Statement
- 2.5.6 Scientific Method
- 2.5.7 Diagnosis Cycle
- 2.5.8 Tools Method
- 2.5.9 The USE Method
- 2.5.10 Workload Characterization
- 2.5.11 Drill-Down Analysis
- 2.5.12 Latency Analysis
- 2.5.13 Method R
- 2.5.14 Event Tracing
- 2.5.15 Baseline Statistics
- 2.5.16 Static Performance Tuning
- 2.5.17 Cache Tuning
- 2.5.18 Micro-Benchmarking
- 2.6 Modeling
- 2.6.1 Enterprise versus Cloud
- 2.6.2 Visual Identification
- 2.6.3 Amdahl’s Law of Scalability
- 2.6.4 Universal Scalability Law
- 2.6.5 Queueing Theory
- 2.7 Capacity Planning
- 2.7.1 Resource Limits
- 2.7.2 Factor Analysis
- 2.7.3 Scaling Solutions
- 2.8 Statistics
- 2.8.1 Quantifying Performance
- 2.8.2 Averages
- 2.8.3 Standard Deviations, Percentiles, Median
- 2.8.4 Coefficient of Variation
- 2.8.5 Multimodal Distributions
- 2.8.6 Outliers
- 2.9 Monitoring
- 2.9.1 Time-Based Patterns
- 2.9.2 Monitoring Products
- 2.9.3 Summary-since-Boot
- 2.10 Visualizations
- 2.10.1 Line Chart
- 2.10.2 Scatter Plots
- 2.10.3 Heat Maps
- 2.10.4 Surface Plot
- 2.10.5 Visualization Tools
- 2.11 Exercises
- 2.12 References
3 Operating Systems
- 3.1 Terminology
- 3.2 Background
- 3.2.1 Kernel
- 3.2.2 Stacks
- 3.2.3 Interrupts and Interrupt Threads
- 3.2.4 Interrupt Priority Level
- 3.2.5 Processes
- 3.2.6 System Calls
- 3.2.7 Virtual Memory
- 3.2.8 Memory Management
- 3.2.9 Schedulers
- 3.2.10 File Systems
- 3.2.11 Caching
- 3.2.12 Networking
- 3.2.13 Device Drivers
- 3.2.14 Multiprocessor
- 3.2.15 Preemption
- 3.2.16 Resource Management
- 3.2.17 Observability
- 3.3 Kernels
- 3.3.1 Unix
- 3.3.2 Solaris-Based
- 3.3.3 Linux-Based
- 3.3.4 Differences
- 3.4 Exercises
- 3.5 References
4 Observability Tools
- 4.1 Tool Types
- 4.1.1 Counters
- 4.1.2 Tracing
- 4.1.3 Profiling
- 4.1.4 Monitoring (sar)
- 4.2 Observability Sources
- 4.2.1 /proc
- 4.2.2 /sys
- 4.2.3 kstat
- 4.2.4 Delay Accounting
- 4.2.5 Microstate Accounting
- 4.2.6 Other Observability Sources
- 4.3 DTrace
- 4.3.1 Static and Dynamic Tracing
- 4.3.2 Probes
- 4.3.3 Providers
- 4.3.4 Arguments
- 4.3.5 D Language
- 4.3.6 Built-in Variables
- 4.3.7 Actions
- 4.3.8 Variable Types
- 4.3.9 One-Liners
- 4.3.10 Scripting
- 4.3.11 Overheads
- 4.3.12 Documentation and Resources
- 4.4 SystemTap
- 4.4.1 Probes
- 4.4.2 Tapsets
- 4.4.3 Actions and Built-ins
- 4.4.4 Examples
- 4.4.5 Overheads
- 4.4.6 Documentation and Resources
- 4.5 perf
- 4.6 Observing Observability
- 4.7 Exercises
- 4.8 References
5 Applications
- 5.1 Application Basics
- 5.1.1 Objectives
- 5.1.2 Optimize the Common Case
- 5.1.3 Observability
- 5.1.4 Big O Notation
- 5.2 Application Performance Techniques
- 5.2.1 Selecting an I/O Size
- 5.2.2 Caching
- 5.2.3 Buffering
- 5.2.4 Polling
- 5.2.5 Concurrency and Parallelism
- 5.2.6 Non-Blocking I/O
- 5.2.7 Processor Binding
- 5.3 Programming Languages
- 5.3.1 Compiled Languages
- 5.3.2 Interpreted Languages
- 5.3.3 Virtual Machines
- 5.3.4 Garbage Collection
- 5.4 Methodology and Analysis
- 5.4.1 Thread State Analysis
- 5.4.2 CPU Profiling
- 5.4.3 Syscall Analysis
- 5.4.4 I/O Profiling
- 5.4.5 Workload Characterization
- 5.4.6 USE Method
- 5.4.7 Drill-Down Analysis
- 5.4.8 Lock Analysis
- 5.4.9 Static Performance Tuning
- 5.5 Exercises
- 5.6 References
6 CPUs
- 6.1 Terminology
- 6.2 Models
- 6.2.1 CPU Architecture
- 6.2.2 CPU Memory Caches
- 6.2.3 CPU Run Queues
- 6.3 Concepts
- 6.3.1 Clock Rate
- 6.3.2 Instruction
- 6.3.3 Instruction Pipeline
- 6.3.4 Instruction Width
- 6.3.5 CPI, IPC
- 6.3.6 Utilization
- 6.3.7 User-Time/Kernel-Time
- 6.3.8 Saturation
- 6.3.9 Preemption
- 6.3.10 Priority Inversion
- 6.3.11 Multiprocess, Multithreading
- 6.3.12 Word Size
- 6.3.13 Compiler Optimization
- 6.4 Architecture
- 6.4.1 Hardware
- 6.4.2 Software
- 6.5 Methodology
- 6.5.1 Tools Method
- 6.5.2 USE Method
- 6.5.3 Workload Characterization
- 6.5.4 Profiling
- 6.5.5 Cycle Analysis
- 6.5.6 Performance Monitoring
- 6.5.7 Static Performance Tuning
- 6.5.8 Priority Tuning
- 6.5.9 Resource Controls
- 6.5.10 CPU Binding
- 6.5.11 Micro-Benchmarking
- 6.5.12 Scaling
- 6.6 Analysis
- 6.6.1 uptime
- 6.6.2 vmstat
- 6.6.3 mpstat
- 6.6.4 sar
- 6.6.5 ps
- 6.6.6 top
- 6.6.7 prstat
- 6.6.8 pidstat
- 6.6.9 time, ptime
- 6.6.10 DTrace
- 6.6.11 SystemTap
- 6.6.12 perf
- 6.6.13 cpustat
- 6.6.14 Other Tools
- 6.6.15 Visualizations
- 6.7 Experimentation
- 6.7.1 Ad Hoc
- 6.7.2 SysBench
- 6.8 Tuning
- 6.8.1 Compiler Options
- 6.8.2 Scheduling Priority and Class
- 6.8.3 Scheduler Options
- 6.8.4 Process Binding
- 6.8.5 Exclusive CPU Sets
- 6.8.6 Resource Controls
- 6.8.7 Processor Options (BIOS Tuning)
- 6.9 Exercises
- 6.10 References
7 Memory
- 7.1 Terminology
- 7.2 Concepts
- 7.2.1 Virtual Memory
- 7.2.2 Paging
- 7.2.3 Demand Paging
- 7.2.4 Overcommit
- 7.2.5 Swapping
- 7.2.6 File System Cache Usage
- 7.2.7 Utilization and Saturation
- 7.2.8 Allocators
- 7.2.9 Word Size
- 7.3 Architecture
- 7.3.1 Hardware
- 7.3.2 Software
- 7.3.3 Process Address Space
- 7.4 Methodology
- 7.4.1 Tools Method
- 7.4.2 USE Method
- 7.4.3 Characterizing Usage
- 7.4.4 Cycle Analysis
- 7.4.5 Performance Monitoring
- 7.4.6 Leak Detection
- 7.4.7 Static Performance Tuning
- 7.4.8 Resource Controls
- 7.4.9 Micro-Benchmarking
- 7.5 Analysis
- 7.5.1 vmstat
- 7.5.2 sar
- 7.5.3 slabtop
- 7.5.4 ::kmastat
- 7.5.5 ps
- 7.5.6 top
- 7.5.7 prstat
- 7.5.8 pmap
- 7.5.9 DTrace
- 7.5.10 SystemTap
- 7.5.11 Other Tools
- 7.6 Tuning
- 7.6.1 Tunable Parameters
- 7.6.2 Multiple Page Sizes
- 7.6.3 Allocators
- 7.6.4 Resource Controls
- 7.7 Exercises
- 7.8 References
8 File Systems
- 8.1 Terminology
- 8.2 Models
- 8.2.1 File System Interfaces
- 8.2.2 File System Cache
- 8.2.3 Second-Level Cache
- 8.3 Concepts
- 8.3.1 File System Latency
- 8.3.2 Caching
- 8.3.3 Random versus Sequential I/O
- 8.3.4 Prefetch
- 8.3.5 Read-Ahead
- 8.3.6 Write-Back Caching
- 8.3.7 Synchronous Writes
- 8.3.8 Raw and Direct I/O
- 8.3.9 Non-Blocking I/O
- 8.3.10 Memory-Mapped Files
- 8.3.11 Metadata
- 8.3.12 Logical versus Physical I/O
- 8.3.13 Operations Are Not Equal
- 8.3.14 Special File Systems
- 8.3.15 Access Timestamps
- 8.3.16 Capacity
- 8.4 Architecture
- 8.4.1 File System I/O Stack
- 8.4.2 VFS
- 8.4.3 File System Caches
- 8.4.4 File System Features
- 8.4.5 File System Types
- 8.4.6 Volumes and Pools
- 8.5 Methodology
- 8.5.1 Disk Analysis
- 8.5.2 Latency Analysis
- 8.5.3 Workload Characterization
- 8.5.4 Performance Monitoring
- 8.5.5 Event Tracing
- 8.5.6 Static Performance Tuning
- 8.5.7 Cache Tuning
- 8.5.8 Workload Separation
- 8.5.9 Memory-Based File Systems
- 8.5.10 Micro-Benchmarking
- 8.6 Analysis
- 8.6.1 vfsstat
- 8.6.2 fsstat
- 8.6.3 strace, truss
- 8.6.4 DTrace
- 8.6.5 SystemTap
- 8.6.6 LatencyTOP
- 8.6.7 free
- 8.6.8 top
- 8.6.9 vmstat
- 8.6.10 sar
- 8.6.11 slabtop
- 8.6.12 mdb ::kmastat
- 8.6.13 fcachestat
- 8.6.14 /proc/meminfo
- 8.6.15 mdb ::memstat
- 8.6.16 kstat
- 8.6.17 Other Tools
- 8.6.18 Visualizations
- 8.7 Experimentation
- 8.7.1 Ad Hoc
- 8.7.2 Micro-Benchmark Tools
- 8.7.3 Cache Flushing
- 8.8 Tuning
- 8.8.1 Application Calls
- 8.8.2 ext3
- 8.8.3 ZFS
- 8.9 Exercises
- 8.10 References
9 Disks
- 9.1 Terminology
- 9.2 Models
- 9.2.1 Simple Disk
- 9.2.2 Caching Disk
- 9.2.3 Controller
- 9.3 Concepts
- 9.3.1 Measuring Time
- 9.3.2 Time Scales
- 9.3.3 Caching
- 9.3.4 Random versus Sequential I/O
- 9.3.5 Read/Write Ratio
- 9.3.6 I/O Size
- 9.3.7 IOPS Are Not Equal
- 9.3.8 Non-Data-Transfer Disk Commands
- 9.3.9 Utilization
- 9.3.10 Saturation
- 9.3.11 I/O Wait
- 9.3.12 Synchronous versus Asynchronous
- 9.3.13 Disk versus Application I/O
- 9.4 Architecture
- 9.4.1 Disk Types
- 9.4.2 Interfaces
- 9.4.3 Storage Types
- 9.4.4 Operating System Disk I/O Stack
- 9.5 Methodology
- 9.5.1 Tools Method
- 9.5.2 USE Method
- 9.5.3 Performance Monitoring
- 9.5.4 Workload Characterization
- 9.5.5 Latency Analysis
- 9.5.6 Event Tracing
- 9.5.7 Static Performance Tuning
- 9.5.8 Cache Tuning
- 9.5.9 Resource Controls
- 9.5.10 Micro-Benchmarking
- 9.5.11 Scaling
- 9.6 Analysis
- 9.6.1 iostat
- 9.6.2 sar
- 9.6.3 pidstat
- 9.6.4 DTrace
- 9.6.5 SystemTap
- 9.6.6 perf
- 9.6.7 iotop
- 9.6.8 iosnoop
- 9.6.9 blktrace
- 9.6.10 MegaCli
- 9.6.11 smartctl
- 9.6.12 Visualizations
- 9.7 Experimentation
- 9.7.1 Ad Hoc
- 9.7.2 Custom Load Generators
- 9.7.3 Micro-Benchmark Tools
- 9.7.4 Random Read Example
- 9.8 Tuning
- 9.8.1 Operating System Tunables
- 9.8.2 Disk Device Tunables
- 9.8.3 Disk Controller Tunables
- 9.9 Exercises
- 9.10 References
10 Network
- 10.1 Terminology
- 10.2 Models
- 10.2.1 Network Interface
- 10.2.2 Controller
- 10.2.3 Protocol Stack
- 10.3 Concepts
- 10.3.1 Networks and Routing
- 10.3.2 Protocols
- 10.3.3 Encapsulation
- 10.3.4 Packet Size
- 10.3.5 Latency
- 10.3.6 Buffering
- 10.3.7 Connection Backlog
- 10.3.8 Interface Negotiation
- 10.3.9 Utilization
- 10.3.10 Local Connections
- 10.4 Architecture
- 10.4.1 Protocols
- 10.4.2 Hardware
- 10.4.3 Software
- 10.5 Methodology
- 10.5.1 Tools Method
- 10.5.2 USE Method
- 10.5.3 Workload Characterization
- 10.5.4 Latency Analysis
- 10.5.5 Performance Monitoring
- 10.5.6 Packet Sniffing
- 10.5.7 TCP Analysis
- 10.5.8 Drill-Down Analysis
- 10.5.9 Static Performance Tuning
- 10.5.10 Resource Controls
- 10.5.11 Micro-Benchmarking
- 10.6 Analysis
- 10.6.1 netstat
- 10.6.2 sar
- 10.6.3 ifconfig
- 10.6.4 ip
- 10.6.5 nicstat
- 10.6.6 dladm
- 10.6.7 ping
- 10.6.8 traceroute
- 10.6.9 pathchar
- 10.6.10 tcpdump
- 10.6.11 snoop
- 10.6.12 Wireshark
- 10.6.13 DTrace
- 10.6.14 SystemTap
- 10.6.15 perf
- 10.6.16 Other Tools
- 10.7 Experimentation
- 10.7.1 iperf
- 10.8 Tuning
- 10.8.1 Linux
- 10.8.2 Solaris
- 10.8.3 Configuration
- 10.9 Exercises
- 10.10 References
11 Cloud Computing
- 11.1 Background
- 11.1.1 Price/Performance Ratio
- 11.1.2 Scalable Architecture
- 11.1.3 Capacity Planning
- 11.1.4 Storage
- 11.1.5 Multitenancy
- 11.2 OS Virtualization
- 11.2.1 Overhead
- 11.2.2 Resource Controls
- 11.2.3 Observability
- 11.3 Hardware Virtualization
- 11.3.1 Overhead
- 11.3.2 Resource Controls
- 11.3.3 Observability
- 11.4 Comparisons
- 11.5 Exercises
- 11.6 References
12 Benchmarking
- 12.1 Background
- 12.1.1 Activities
- 12.1.2 Effective Benchmarking
- 12.1.3 Benchmarking Sins
- 12.2 Benchmarking Types
- 12.2.1 Micro-Benchmarking
- 12.2.2 Simulation
- 12.2.3 Replay
- 12.2.4 Industry Standards
- 12.3 Methodology
- 12.3.1 Passive Benchmarking
- 12.3.2 Active Benchmarking
- 12.3.3 CPU Profiling
- 12.3.4 USE Method
- 12.3.5 Workload Characterization
- 12.3.6 Custom Benchmarks
- 12.3.7 Ramping Load
- 12.3.8 Sanity Check
- 12.3.9 Statistical Analysis
- 12.4 Benchmark Questions
- 12.5 Exercises
- 12.6 References
13 Case Study
- 13.1 Case Study: The Red Whale
- 13.1.1 Problem Statement
- 13.1.2 Support
- 13.1.3 Getting Started
- 13.1.4 Choose Your Own Adventure
- 13.1.5 The USE Method
- 13.1.6 Are We Done?
- 13.1.7 Take 2
- 13.1.8 The Basics
- 13.1.9 Ignoring the Red Whale
- 13.1.10 Interrogating the Kernel

BPF Performance Tools (Brendan Gregg)

DTrace : Dynamic Tracing in Oracle Solaris, Mac OS X and FreeBSD (Brendan Gregg)

The Art of Performance Engineering (Vashistha)

1 Getting Started
2 Infrastructure Design
3 Client Side Optimization
4 Web Server Optimization
5 Application Server Optimization
6 Database Server Optimization
7 JVM Tuning
8 Performance Monitoring
9 Performance Counters

Performance Analysis and Tuning on Modern CPUs (Denis Bhakvalov)

1 Introduction
Part 1 Perofrmance Analysis on a Modern CPU
2 Measuring Performance
3 CPU Microarchitecture
4 Terminology and Metrics in Performance Analysis
5 Performance Analysis Approaches
6 CPU Features for Performance Analysis
Part 2 Source Code Tuning for CPU
7 CPU Front-End Optimizations
8 CPU Back-End Optimizations
9 Optimizing Bad Speculation
10 Other Tuning Areas
11 Optimizing Multithreaded Applications

PKB