On my first week at AppDynamics, my manager gave me a copy of The Phoenix Project to help me understand our customers. The book tells the story of a fictional company, Parts Unlimited, that’s struggling financially and losing business to its competitors who adapted to the trend of business moving online by building great online retail experiences.
- Bill, the newly-promoted VP of IT operations, has 3 months to transform the IT organization otherwise the entire IT department will be outsourced.
- Bill meets and talks regularly with Eric, a board member who gradually helps him realize what manufacturing plant work can teach him about IT operations.
- Bill gradually came to understand the analogy between IT work and manufacturing work, and made dramatic improvements in the reliability, efficiency, uptime, and deployment speed of the IT department.
- The rest of the company eventually comes to realize the importance of IT to the entire business, and it becomes financially successful again.
- No visibility into priorities, status of work in progress, resource availability, and level of internal demand for any IT projects.
- Very long deployment windows (several months or more) & no appreciation of tradeoffs when new requests come in, resulting in constantly delaying commitments.
- Adversarial culture among development, information security, audit, the rest of the business, and within the operations team itself, resulting in blaming each other for outages rather than working together to resolve them.
- Lack of repeatable processes for deploying new applications, leading to botched releases.
- Knowledgable operations engineers who don’t document their work and are constantly barraged with requests from the rest of the company & fighting fires, holding up the work they are accountable for.
The VP of IT operations and the manufacturing plant manager have the same job.
- To ensure the fast, predictable, and uninterrupted flow of planned work that delivers value to the business while minimizing the impact and disruption of unplanned work, in order to provide stable, predictable, and secure operations.
They have to solve the same fundamental problem.
- When a new order arrives at a plant, the manufacturing resource planning coordinator controls the flow of work by first looking at the order and the bill of materials & routings, then looking at the loadings of relevant work centers in the plant and deciding whether accepting the order jeopardizes existing commitments.
- The IT operations organization has to do the same thing with IT work: scoping, sequencing, prioritizing, and releasing across many teams.
They have the same top-level business goal.
- To match the pace of output with customer demand.
- In manufacturing, takt time is the cycle time needed to keep up with customer demand. If any operation in the flow of work takes longer than the takt time, you cannot keep up with customer demand.
- The entire plant must be understood as a holistic system and every part of it should be directed toward the ultimate goal, which is a high output of finished products.
- The speed of the entire plant is dictated by its slowest work center:
- Any improvement made anywhere besides the constraint is an illusion.
- Improvements made before the constraint just increase the size of the constraint’s queue.
- Improvements made after the constraint lead to those resources sitting idle or operating at less than full capacity.
- Unplanned work is the most destructive type of work, since it takes you away from your goals.
- If unplanned work is an issue, the plant manager’s highest priority should be to ensure the orderly handling of incidents and outages to prevent interrupting key resources.
- Time is wasted and processes break down if queues get too large.
- If too much work gets queued up waiting for other work to get completed, certain work will have to get manually escalated in order to jump the queue, compromising planning & prioritization.
- Wait times go up exponentially with work center utilization rates.
- Thus, in order to improve efficiency, you must:
- Identify the constraint & make wait times for it visible.
- Proactively determine which work streams depend on it.
- Relentlessly prioritize the work that flows through it.
- Set the tempo of work according to it.
- Continually optimize and improve it.
- Improving daily work is even more important than doing daily work: if your processes are not continually improving, the law of entropy guarantees that you are actually getting worse.
- Thus, you should shorten and amplify feedback loops so you can fix quality issues at the source since any defects will have to be sent back upstream to be fixed anyway.
- Continually inject chaos to the system periodically to ensure that practices are continually reinforced and improved upon.
Toyota wanted to be able to quickly change which cars they were producing so they could more quickly adapt to market demand. They empowered any worker on the assembly line to stop the entire plant if they identified an improvement (or a problem) so that the change could be addressed at the source.
For example, the hood-stamping process took 3 days to change out, because the industrial dies weighed several tons and took 30 steps to move. The factory workers studied this constraint and followed it all the way up to the beginning of the plant, eventually reducing the change time to 10 minutes.
- Create a high-trust culture that fosters experimentation, learning from failure, and understanding that repetition and practice are the prerequisites to mastery.
- Ensure that 10–20% of manufacturing cycles are spent on crucial non-production tasks like identifying and improving processes & maintenance tasks.
- Understanding what “work” is so that it can be managed, sequenced, and prioritized. There are 4 types of work in the typical IT organization:
- Business project work (e.g., pre-planned development tasks for applications)
- Internal projects (e.g., provisioning servers, deploying applications)
- Changes (e.g., database upgrades, firewall configuration changes)
- Unplanned work (e.g., outages)
- Gaining visibility into all IT work: Tasking project managers to understand levels of demand and capacity by getting visibility into the pipeline of all work & providing resource cost estimates. Ensuring all work is documented in the ticketing system and setting up kanban boards.
- Understanding dependencies & sequencing work appropriately: Setting up a change board to plan work for each day of the week, and requiring all change requests to be written on notecards and put up on the board.
- Identifying & improving the constraint: Assigning a team of engineers to handle escalations to bottleneck operations engineers so they can deliver on his pre-committed work, and requiring them to document all steps they take so that IT work no longer continues to be held up by them.
- Creating the bill of materials for IT work: Fully documenting all processes, equipment, and people required to fully complete all incoming tasks in order to keep work flowing in one direction, minimize unplanned work, and finally be able to properly schedule work.
- Applying systems thinking at the company level: Understanding the top-level company goals & metrics by talking to executives, defining the specific role the IT department plays in achieving each of them, and ensuring that the rest of the business includes IT managers when planning.
- Aligning IT objectives with the business’ objectives by prioritizing all IT work based on its impact to the entire business.
- Planning around capacity constraints: Temporarily freezing the flow of work to IT operations and requiring all new work to be fully scoped out and prioritized against existing commitments so an accept/reject decision can be made.
- Creating and practicing procedures for disruptions: Minimizing the disruptive impact of outages by including all relevant people upfront, developing and testing hypotheses based on data, and ensuring development and operations collaborate rather than blame each other.
- Shortening feedback loops and creating a culture of experimentation: Establishing a continuous delivery pipeline that reduced deployment cycles from months to minutes by:
- reducing batch sizes for each release.
- using automation and cloud computing to reduce the steps required for builds and deployments.
- deploying code that deliberately causes large-scale faults in order to proactively uncover and fix performance issues.
- in-housing previously outsourced IT infrastructure so that it could be included in the continuous delivery pipeline.