The causes of performance problems in distributed software systems

A performance problem has three parts:

The event that introduces the problem (e.g., application configuration change)
The symptoms of the problem (e.g., CPU usage spike)
The cause (e.g., logging was left in DEBUG mode)

This page focuses on the causes of performance problems.

Performance problems fall into two broad categories

Failures
- These rapidly take all or part of the system to zero health (i.e., crashes, outages).
- It is difficult or impossible to provide advance warning that a failure will occur.
- Examples: uncaught insufficient permission exceptions, network cable failures.
Resource saturation issues
- These gradually take all or part of the system to zero health.
- You can often identify these early by monitoring resource metrics like CPU or memory utilization.
- Examples: memory leaks, inefficient DB queries.

Performance problems affect any of these five parts of a system

Application (e.g., a bug that introduces a memory leak)
Middleware such as a web server, load balancer, message queue etc. (e.g., web server plug-in failure)
Container or server infrastructure (e.g., disk failure)
Network (e.g., incorrect DNS server settings)
External resources (e.g., 3rd-party payment processing service is down)

A breakdown of causes by the type of problem they produce, and the the part of the system they affect

Once you've isolated the potential source of a performance problem, you can use this table to form a hypothesis about what the problem is & how to remediate it.

Part of stack	Ways it can fail	Ways it can experience resource saturation
Application	Uncaught exception Divide by 0 Access an invalid resource Insufficient permissions Deadlock (causes hang) Data race (causes data corruption) Configuration Configured for the wrong environment Queries a resource that doesn’t exist Wrong library packaged into application Insufficient permissions when accessing a resource	Resource leak Memory leak DB connection leak Handle (e.g., file, TCP) leak Database query issue Inefficient database query N+1 query problem Third-party library issue Unbounded cache Inefficient algorithms Configuration Logging left in DEBUG mode DB connections not being pooled Too many DB connections in connection pool Too many threads in threadpool
Middleware	Bug/crash (typically unable to diagnose further) Plug-in/extension failure Configuration Missing required parameters or contains directory/permissions errors Configured for the wrong environment Queries a resource that doesn’t exist	Dead letter queue full or unavailable Configuration Too many web server worker threads Web server passes down static content requests Slow plug-in/extension
Server or container	Disk/memory hardware failure Data deletion or reformat Power surge	Phantom external process (e.g., cron job) Unused images or libraries hogging disk space Configuration Image type/footprint incompatible with architecture Antivirus scan coincides with peak load
Network	Link failure (i.e., physical cable failure) Power outage ISP outage Cloud provider outage Configuration Incorrect DNS server settings Incorrect security policy Failed firmware upgrade Incorrect hardware installed	Uncompressed files/images sent across network Security attack (e.g., DDOS, virus) Configuration Restrictive throttling setting
External Service	Service failure (typically unable to diagnose further)	Service issue (typically unable to diagnose further)

Data sources

Methodologies and frameworks
Web framework/service configuration parameters & tuning
AppDynamics SaaS operations guides and interviews with AppDynamics colleagues

rxgpt/distsoftproblems.md

Select an option

No results found