Skip to content

Instantly share code, notes, and snippets.

@rxgpt
Last active August 18, 2019 17:15
Show Gist options
  • Select an option

  • Save rxgpt/d1ade4cea2f3b3609397ebda06b0bee8 to your computer and use it in GitHub Desktop.

Select an option

Save rxgpt/d1ade4cea2f3b3609397ebda06b0bee8 to your computer and use it in GitHub Desktop.
The causes of performance problems in distributed software systems

The causes of performance problems in distributed software systems

A performance problem has three parts:

  1. The event that introduces the problem (e.g., application configuration change)
  2. The symptoms of the problem (e.g., CPU usage spike)
  3. The cause (e.g., logging was left in DEBUG mode)

This page focuses on the causes of performance problems.

Performance problems fall into two broad categories

  • Failures
    • These rapidly take all or part of the system to zero health (i.e., crashes, outages).
    • It is difficult or impossible to provide advance warning that a failure will occur.
    • Examples: uncaught insufficient permission exceptions, network cable failures.
  • Resource saturation issues
    • These gradually take all or part of the system to zero health.
    • You can often identify these early by monitoring resource metrics like CPU or memory utilization.
    • Examples: memory leaks, inefficient DB queries.

Performance problems affect any of these five parts of a system

  • Application (e.g., a bug that introduces a memory leak)
  • Middleware such as a web server, load balancer, message queue etc. (e.g., web server plug-in failure)
  • Container or server infrastructure (e.g., disk failure)
  • Network (e.g., incorrect DNS server settings)
  • External resources (e.g., 3rd-party payment processing service is down)

A breakdown of causes by the type of problem they produce, and the the part of the system they affect

Once you've isolated the potential source of a performance problem, you can use this table to form a hypothesis about what the problem is & how to remediate it.

Part of stack Ways it can fail Ways it can experience resource saturation
Application
  • Uncaught exception
    • Divide by 0
    • Access an invalid resource
    • Insufficient permissions
  • Deadlock (causes hang)
  • Data race (causes data corruption)
  • Configuration
    • Configured for the wrong environment
    • Queries a resource that doesn’t exist
    • Wrong library packaged into application
    • Insufficient permissions when accessing a resource
  • Resource leak
    • Memory leak
    • DB connection leak
    • Handle (e.g., file, TCP) leak
  • Database query issue
  • Third-party library issue
  • Unbounded cache
  • Inefficient algorithms
  • Configuration
    • Logging left in DEBUG mode
    • DB connections not being pooled
    • Too many DB connections in connection pool
    • Too many threads in threadpool
Middleware
  • Bug/crash (typically unable to diagnose further)
  • Plug-in/extension failure
  • Configuration
    • Missing required parameters or contains directory/permissions errors
    • Configured for the wrong environment
    • Queries a resource that doesn’t exist
  • Dead letter queue full or unavailable
  • Configuration
    • Too many web server worker threads
    • Web server passes down static content requests
    • Slow plug-in/extension
Server or container
  • Disk/memory hardware failure
  • Data deletion or reformat
  • Power surge
  • Phantom external process (e.g., cron job)
  • Unused images or libraries hogging disk space
  • Configuration
    • Image type/footprint incompatible with architecture
    • Antivirus scan coincides with peak load
Network
  • Link failure (i.e., physical cable failure)
  • Power outage
  • ISP outage
  • Cloud provider outage
  • Configuration
    • Incorrect DNS server settings
    • Incorrect security policy
    • Failed firmware upgrade
    • Incorrect hardware installed
  • Uncompressed files/images sent across network
  • Security attack (e.g., DDOS, virus)
  • Configuration
    • Restrictive throttling setting
External Service
  • Service failure (typically unable to diagnose further)
  • Service issue (typically unable to diagnose further)

Data sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment