Edge Cloud Operations: A Systems Approach

Peterson, Baker, Bavier, Williams and Davie

Table of Contents

Chapter 1: Introduction

Clip source: Summary of - Chapter 1: Introduction

How to Operationalize a Cloud

Starting with bare-metal hardware, all the way to offering one or more managed services to users
The edge is the place where the cloud services connect to the real world, e.g., via sensors and actuators, and where latency-sensitive services are deployed to be close to the consumers of those services.
It is possible to build a cloud-and all the associated lifecycle management and runtime controls that are required to operate it-using readily available open source software packages.

Terminology

The terminology used to talk about operating cloud services represents a mix of "modern" concepts that are native to the cloud and "traditional" concepts from earlier systems
Operations & Maintenance (O&M): A traditional term used to characterize the overall challenge of operationalizing a network
FCAPS: An acronym (Fault, Configuration, Accounting, Performance, Security) historically used in the Telco industry to enumerate the requirements for an operational system
OSS/BSS: Another Telco acronym (Operations Support System, Business Support System), referring to the subsystem that implements both operational logic (OSS) and business logic (BSS)
EMS: Yet another TelCO acronym (Element Management System), corresponding to an intermediate layer in the overall O&M hierarchy
Orchestration: Involves assembling (e.g., allocating, configuring, connecting) a collection of physical or logical resources on behalf of some workload
Playbook/Workflow: A program or script that implements a multi-step orchestration process
Provisioning: Adding capacity (either physical or virtual resources) to a system, usually in response to changes in workload, including the initial deployment
Zero-Touch Provisioning
Adding new hardware without requiring a human to configure it

Disaggregation

Broadly speaking, disaggregation is the process of breaking large bundled components into a set of smaller constituent parts.
The microservice architecture is another example of disaggregation that breaks monolithic cloud applications into a mesh of single-function components, which is an essential step in accelerating feature velocity.

Cloud Technology

Being able to operationalize a cloud starts with the building blocks used to construct the cloud in the first place
This section summarizes the available technology, with the goal of identifying the baseline capabilities of the underlying system
Before identifying these building blocks, we need to acknowledge that we are venturing into a gray area
Where you draw the line shifts over time as technology matures and becomes ubiquitous

Hardware Platform

Building blocks for a cloud include bare-metal servers and switches, built using merchant silicon chips
A physical cloud cluster is then constructed with the hardware building blocks arranged as shown in Figure 1: one or more racks of servers connected by a leaf-spine switching fabric
The servers are shown above the switching fabric to emphasize that software running on the servers controls the switches

Software Building Blocks

Linux is the OS that runs on the bare metal systems
Docker is a container runtime that leverages OS isolation APIs to instantiate and run multiple containers, each of which is an instance defined by a Docker image
Kubernetes, a container management system, provides a programmatic interface for scaling container instances up and down, allocating server resources to them, setting up virtual networks to interconnect those instances, and opening service ports that external clients can use to access those instances
Helm, a configuration set manager, is a Helm Chart, which is a configuration manager used to publish a Helm chart for each microservice

Switching Fabric

We assume the cloud is constructed using an SDN-based switching fabric, with a disaggregated control plane running in the same cloud as the fabric interconnects.
For the purpose of this book, we assume the following SDN software stack: A Network OS hosts a set of control applications, including a control application that manages the leaf-spine switching fabric
ONOS hosts the SD-Fabric control app
A Switch OS runs on each switch, providing a northbound gNMI and gNOI interface through which the Network OS controls and configures each switch
Switches often include a commodity processor, typically running Linux and hosting control software, in addition to any switching chip that implements the data plane.

Repositories

Nearly every mechanism described in this book takes advantage of cloud-hosted repositories such as GitHub, DockerHub, and ArtifactHub

Other Options

There is no master plan for what a cloud management stack should look like
It is important to take a first principles approach
Identify the set of requirements and explore the design space
Only as a final step do you select an existing software component
This approach naturally results in an end-to-end solution with many smaller components

Future of the Sysadmin

System administrators have been responsible for operating enterprise networks since the first file servers, client workstations, and LANs were deployed over 30 years ago
The introduction of virtualization technology led to server consolidation, but did not greatly reduce the management overhead
Cloud providers, because of the scale of the systems they build, cannot survive with operational silos
They introduced increasingly sophisticated cloud orchestration technologies
These cloud best-practices are now available to enterprises as well, but they are often bundled as a managed service, with the cloud provider playing an ever-greater role in operating the enterprise's services

Chapter 2: Architecture

Clip source: Summary of - Chapter 2: Architecture

Aether is a Kubernetes-based edge cloud, augmented with a 5G-based connectivity service

Aether is targeted at enterprises that want to take advantage of 5G connectivity in support of mission-critical edge applications requiring predictable, low-latency connectivity
The combination of features to support deployment of edge applications, coupled with Aether being offered as a managed service, makes Aether a Platform-as-a-Service (PaaS).

Edge Cloud

Aether Connected Edge (ACE) is a Kubernetes-based cluster similar to the one shown in Chapter 1. It consists of one or more server racks interconnected by a leaf-spine switching fabric, with an SDN control plane (denoted SD-Fabric) managing the fabric.
ACE hosts two additional microservice-based subsystems on top of this platform: SD-RAN and SD-Core, which collectively implement 5G-Connectivity-as-a-Service.

Hybrid Cloud

Aether is designed to support multiple ACE deployments, all of which are managed from the central cloud.
Each ACE site corresponds to a physical cluster built out of bare-metal components, while each of the SD-Core CP subsystems shown in Figure 4 is actually deployed in a logical Kubernetes cluster on a commodity cloud.

Stakeholders

The target environment is a collection of Kubernetes clusters-some running on bare-metal hardware at edge sites and some running in central datacenters-there is an orthogonal issue of how decision-making responsibility for those clusters is shared among multiple stakeholders.
For Aether, we care about two primary stakeholders: cloud operators who manage the hybrid cloud as a whole and enterprise users who decide on a per-site basis how to take advantage of the local cloud resources (e.g., what edge applications to run and how to slice connectivity resources among those apps).

Control and Management

AMP includes four subsystems: Resource Provisioning, Lifecycle Management, Runtime Control, Monitoring & Telemetry
The design is cloud-agnostic, so AMP can be deployed in a public cloud, an operator-owned Telco cloud, or an enterprise-owned private cloud.

Resource Provisioning

Configures and bootstraps resources (both physical and virtual), bringing them up to a state so Lifecycle Management can take over and manage the software running on those resources.
As a consequence of the operations team physically connecting resources to the cloud and recording attributes for those resources in an Inventory Repo, a Zero-Touch Provisioning system (a) generates a set of configuration artifacts that are stored in a Config Repo and used during Lifecycles Management, and (b) initializes the newly deployed resources so they are in a state that lifecycle Management is able to control.

Lifecycle Management

The process of integrating debugged, extended, and refactored components (often microservices) into a set of artifacts (e.g., Docker containers and Helm charts), and subsequently deploying those artifacts to the operational cloud.
It includes a comprehensive testing regime, and typically, a procedure by which developers inspect and comment on each others' code.
Version control includes evaluating dependencies and rolling out new and old versions of software.

Runtime Control

Once deployed and running, Runtime Control provides a programmatic API that can be used by various stakeholders to manage whatever abstract service(s) the system offers (e.g., 5G connectivity in the case of Aether).
Partially addresses the "management silo" issue raised in Chapter 1 by not requiring users to know how to control/configure each component of the service.

Monitoring and Telemetry

A running system has to be continuously monitored so operators can diagnose and respond to failures, tune performance, do root cause analysis, perform security audits, and understand when it is necessary to provision additional capacity.
This requires mechanisms to observe system behavior, collect and archive the resulting data, analyze the data and trigger various actions in response, and visualize the data in human consumable dashboards.

Summary

The system evolved bottom up, solving the next immediate problem one at a time, all the while creating a large ecosystem of open source components that can be used in different combinations.
Each aspect of management has to be supported in a well-defined, efficient, and repeatable way.

DevOps

Operational processes and procedures, which in a cloud setting, are now commonly organized around the DevOps model
When it comes to a set of services (or user-visible features), developers play a role in deploying and operating those services
All of the activity outlined in the previous paragraph is possible only because of the rich set of capabilities built into the Control and Management Platform
Someone had to build that platform, which includes a testing framework that individual tests can be plugged into
An automated deployment framework that is able to roll upgrades out to a scalable number of servers and sites without manual intervention

Chapter 3: Resource Provisioning

Clip source: Summary of - Chapter 3: Resource Provisioning

Resource Provisioning is the process of bringing virtual and physical resources online. It has both a hands-on component (racking and connecting devices) and a bootstrap component (configuring how the resources boot into a "ready" state).

The goal is to minimize the number and complexity of configuration steps required beyond physically connecting the device.
When a cloud is built from virtual resources (e.g., VMs instantiated on a commercial cloud), the "rack and connect" step is carried out by a sequence of API calls.

Physical Infrastructure

The process of stacking and racking hardware is inherently human-intensive, and includes considerations such as airflow and cable management.
Defining logical groupings of hardware resources is not unique to Aether; we can ask a commercial cloud provider to provision multiple logical clusters in the same way that a private cloud provider can do so.

Document Infrastructure

Documenting the physical infrastructure's logical structure in a database is how we cross the physical-to-virtual divide. It involves both defining a set of models for the information being collected and entering the corresponding facts about the physical devices.
NetBox is an open-source tool that supports IP address management, inventory-related information about types of devices and where they are installed, and how infrastructure is organized by group and site.

Configure and Boot

After installing the hardware and recording the relevant facts about the installation, the next step is to configure and boot the hardware so that it is "ready" for the automated procedures that follow.
The goal is to minimize manual configuration required to onboard physical infrastructure like that shown in Figure 12. The automated aspects of configuration are implemented as a set of Ansible roles and playbooks.

Provisioning API

This involves setting up a GCP-like API for the bare-metal edge clouds.
The API needs to provide a means to install and configure Kubernetes on each physical cluster, and to set up accounts (and associated credentials).
It also needs to manage independent projects that are to be deployed on a given cluster, such as managing namespaces.

Provisioning VMs

VMs are a way to isolate Kubernetes workloads on a limited number of physical servers.
Being able to "split" one or more servers between multiple uses-by instantiating VMs-gives the operator more flexibility in allocating resources, which usually translates into requiring fewer overall resources.

Infrastructure-as-Code

The provisioning interface for each of the Kubernetes variants includes a programmatic API, a Command Line Interface (CLI), and a Graphical User Interface (GUI).
For operational deployments, however, having a human operator interact with a CLI or GUI is problematic.
To solve this, find a declarative way of saying what your infrastructure is to look like, and automate the task of making calls against the API to make it so.

Platform Definition

Being explicit and consistent about what is platform and what is application is a prerequisite for a sound overall design
Aether draws two lines: Aether's base platform (Kubernetes plus SD-Fabric) and Aether PaaS, which includes SD-Core and SD-RAN running on top of the platform, plus AMP managing the whole system

Lifecycle Management

Clip source: Summary of - Chapter 4: Lifecycle Management

Concerned with updating and evolving a running system over time
Starts with the development process-the creation of new features and capabilities
The innovation can come from many sources, including open source, so the real objective is to democratize the integration and deployment end of the pipeline

Design Overview

An overview of the pipeline/toolchain that make up the two halves of Lifecycle Management-Continuous Integration (CI) and Continuous Deployment (CD)-expanding on the high-level introduction presented in Chapter 2.
CI/CD keeps both the software-related components in the underlying cloud platform and the microservice workloads that run on top of that platform up to date
There are three takeaways from this overview
By having well-defined artifacts passed between CI and CD (and between Resource Provisioning and CD), all three subsystems are loosely coupled, and able to perform their respective tasks independently
All authoritative state needed to successfully build and deploy the system is contained within the pipeline, specifically, as declarative specifications in the Config Repo
The third is that there is an opportunity for operators to apply discretion to the pipeline

Testing Strategy

Ensuring code quality requires that it be subjected to a battery of tests, but the linchpin for doing so "at speed" is the effective use of automation
The best-practice for testing in the Cloud/DevOps environment is to adopt a Shift Left strategy
First, understand what types of tests you need, then set up the infrastructure required to automate those tests

Categories of Tests

Integration Gate: These tests are run against every attempt to check in a patch set, and so must complete quickly
Unit Tests: Developer-written tests that narrowly test a single module
Smoke Tests: A form of functional testing, typically run against a set of related modules, but in a shallow/superficial way
QA Cluster: Tests are run periodically (e.g., once day, once a week) and so can be more extensive
Performance Tests: Measure quantifiable performance parameters, including the ability to scale the workload, rather than correctness
Staging Cluster: Candidate releases are run on the Staging cluster for an extensive period of time before being rolled out to Production
Soak Tests: Sometimes referred to as Canary Tests, these require realistic workloads to be placed on a complete system

Testing Framework

The purpose of a testing framework is to provide a means to (1) automate the execution of a range of tests; (2) collect and archive the resulting test results; (3) evaluate and analyze the test results.
Each of these testing frameworks requires a set of resources.

Continuous Integration

This is all about translating source code checked in by developers into a deployable set of Docker Images
Tests against the code first to test if it is ready to be integrated and then tests if it was successfully integrated
The integration itself is entirely carried out according to a declarative specification

Code Repositories

Provide a means to tentatively submit a patch set, triggering a set of static checks (e.g., passes linter, license, and CLA checks), and giving code reviewers a chance to inspect and comment on the code.
Once all such checks complete to the satisfaction of the engineers responsible for the affected modules, the patch set is merged.

Build-Integrate-Test

The heart of the CI pipeline is a mechanism for executing a set of processes that (a) build the component(s) impacted by a given patch set, (b) integrate the resulting executable images with other images to construct larger subsystems, (c) run tests against those integrated subsystems and post the results, and (d) optionally publish new deployment artifacts to the downstream image repository.
There are no special cases, just different "off-ramps" for the end-to-end CI/CD pipeline.

Continuous Deployment

Terraform Templates specify the underlying infrastructure, and Helm Charts specify the collection of microservices (sometimes called applications) that are to be deployed on that infrastructure.
Fleet can be viewed as the mechanism that implements the Deployment Gate shown in Figure 18, although other factors can also be taken into account (e.g., not starting a rollout at 5pm on a Friday afternoon).

Versioning Strategy

The CI/CD toolchain introduced in this chapter works only when applied in concert with an end-to-end versioning strategy, ensuring that the right combination of source modules get integrated, and later, the best combination of images gets deployed.
There are three main phases of the software lifecycle: development, integration, and deployment.
Development: The CI toolchain does a sanity check on each component’s version number, ensuring it doesn’t regress, and when it sees a new number for a microservice, builds a new image and uploads it to the image repo.
Integration: The CD toolchain instantiates the set of Docker Images, as specified by name in one or more Helm Charts. Deployment: Each Helm Chart is also checked into a repository, and has its own version number.

Managing Secrets

Secrets are part of the hybrid cloud’s configuration state, so they should be stored in the Config Repo, but repositories are not designed to be secure
There are two ways to solve this issue: git-crypt tool and Kubernetes’ SealedSecrets mechanism
One approach is to trust a process running within the cluster to manage secrets
This comes with the downside of putting significant trust in Jenkins, or more to the point, in DevOps practices

GitOps

The CI/CD pipeline described in this chapter is consistent with GitOps, an approach to DevOps designed around the idea of Configuration-as-Code-making the code repo the single source of truth for building and deploying a cloud native system.
Three considerations point to there being a distinction between build-time configuration state and runtime control state:
People who develop software and people who build and operate systems using that software. DevOps (in its simplest formulation) implies there should be no distinction. In practice, developers are often far removed from operators, or more to the point, they are far from design decisions about exactly how others will end up using their software.
The possibility that not all state is created equal, and there is a continuum of configuration state variables.

Chapter 5: Runtime Control

Clip source: Summary of - Chapter 5: Runtime Control

Runtime Control provides an API by which various principals, such as end-users, enterprise admins, and cloud operators, can make changes to a running system by specifying new values for one or more runtime parameters.

Runtime Control defines an abstraction layer on top of a collection of backend components, effectively turning them into externally visible (and controllable) cloud services.

Design Overview

The purpose of Runtime Control is to offer an API that various stakeholders can use to configure and control cloud services.
Central to this role is the requirement that Runtime Control be able to represent a set of abstract objects, which is to say, it implements a data model.
Runtime Control must support new end-to-end abstractions that may cross multiple backend subsystems, associate control and configuration state with those abstractions, and adopt best practices of performance, high availability, reliability, and security in how this abstraction layer is implemented
Support Role-Based Access Controls (RBAC), so that different principals have different visibility into and control over the underlying abstract objects
Be extensible and able to incorporate new services and new abstractions for existing services over time

Models & State

x-config is the core of the Runtime Control. Its job is to store and version configuration data.
Configuration is pushed to xconfig through its northbound gNMI interface, stored in a persistent key-value store, and pushed to backend subsystems using a southbound nMI interface.
A collection of YANG-based models define the schema for this configuration state, and collectively define the data model for all the configuration and control state that Runtime Control is responsible for.
Four important aspects of this mechanism include:
Persistent Store
Loading Models
Versioning and Migration
Synchronization

Runtime Control API

An API provides an interface wrapper that sits between x-config and higher-layer portals and applications.
Unlike gNMI, a RESTful API is expected for GUI development.
The API layer defines a "gate" that can be used to audit the history of who performs what operation when (also taking advantage of the identity management mechanism described next).

Identity Management

Runtime Control leverages an external identity database (an LDAP server) to store user data such as account names and passwords for users who are able to log in.
It has the capability to associate users with groups, so adding administrators to the AetherAdmin group would be an obvious way to grant those individuals with administrative privileges within Runtime Control.

Adapters

An adapter is one means of taking an abstraction that spans multiple services and applying it to each of those services.
Some care is needed to deal with partial failure, in case one service accepts the change, but the other does not. In this case, the adapter keeps trying the failed backend service until it succeeds.

Workflow Engine

This is where multi-step workflows are implemented.
Currently, the current implementation is ad hoc, with imperative code watching a target set of models and taking appropriate action whenever they change. Defining a more rigorous approach to workflows is a subject of ongoing development.

Secure Communication

gNMI naturally lends itself to mutual TLS for authentication, and that is the recommended way to secure communications between components that speak nMI.
Distributing certificates between components is a problem outside the scope of Runtime Control, so another tool will be responsible for doing this.

Modeling Connectivity Service

This section sketches the data model for Aether's connectivity service as a way of illustrating the role Runtime Control plays.
Each object is an instance of one of the YANG-defined models, where every object contains an id field that is used to identify the object.
In addition to the id field, several other fields are also common to all models. These include description, display-name, and object-specific identifiers.

Enterprises

Aether is deployed in enterprises, and so defines a representative set of organizational abstractions. These include Enterprise, which forms the root of a customer-specific hierarchy.
The Enterprise model contains the following fields: connectivity-service, enterprise, small-cell, address, enable, and format.

Slices

Aether models 5G connectivity as a Slice, which represents an isolated communication channel (and associated QoS parameters) that connects a set of devices (modeled as a Device-Group) to a sets of applications (each of which is modeled as an Application).
Each slice is nested within some site (which is in turn nested inside some enterprise), where for example, an enterprise might configure one slice to carry IoT traffic.
The Slice model has the following fields: device-group, app-list, template, upf, and application-specific fields.

Templates and Traffic Classes

Associated with each Slice is a QoS-related profile that governs how traffic that slice carries is to be treated.
The Traffic-Class model specifies the classes of traffic and includes the following fields: arp: Allocation and retention priority
qci: QoS class identifier
pelr: Packet error loss rate
pdbPacket delay budget
For completeness, the corresponding YANG for the Template model.

Other Models

IP-Domain: specifies IP and DNS settings
UPF: specifies the User Plane Function (the data plane element of the SD-Core)
The UPF model is necessary because an Aether deployment can run many UPF instances
Multiple microservice-based UPFs can be instantiated at any given time, each isolating a distinct traffic flow

Revisiting GitOps

One critical factor is whether or not a programmatic interface (coupled with an access control mechanism) is required for accessing and changing that state.
Cloud operators and DevOps teams are perfectly capable of checking configuration changes into a Config Repo, which can make it tempting to view all state that could be specified in a configuration file as Lifecycle-managed configuration state.
But any state that might be touched by someone other than an operator-including enterprise admins and runtime control applications-needs to be accessed via a well-defined API.

Chapter 6: Monitoring and Telemetry

Clip source: Summary of - Chapter 6: Monitoring and Telemetry

Collecting telemetry data for a running system is an essential function of the management platform

Metrics are quantitative data about a system
Logs are qualitative data that is generated whenever a noteworthy event occurs
Traces are a record of causal relationships (e.g., Service A calls Service B) resulting from user-initiated transactions or jobs
The more aspects of monitoring and troubleshooting that can be automated, the better
For example, alerts that automatically detect potential problems; typically include dashboards that make it easy for humans to see patterns and drill down for relevant details across all three types of data; and declare-of-loop control

Metrics and Alerts

A popular open source monitoring stack uses Prometheus to collect and store platform and service metrics, Grafana to visualize metrics over time, and Alertmanager to notify the operations team of events that require attention.

Exporting metrics

Individual components implement a Prometheus Exporter to provide the current value of the component's metrics.
A component's Exporter is queried via HTTP, with the corresponding metrics returned using a simple text format.
Prometheus periodically scrapes the Exporter's HTTP endpoint and stores the metrics in its Time Series Database (TSDB).
Many client libraries are available for instrumenting code to produce metrics in Prometheus format.

Creating Dashboards

The metrics collected by Prometheus are visualized using Grafana dashboards.
A dashboard is constructed from a set of panels, where each panel has a well-defined type (e.g., graph, table, gauge, heatmap) bound to a particular Prometheus query.
New dashboards are created by creating a GUI and saving the resulting configuration in a JSON file.

Defining Alerts

An alert for a particular component is defined by an alerting rule, an expression involving a Prometheus query, such that whenever it evaluates to true for the indicated time period, it triggers a corresponding message to be routed to a set of receivers.
These rules are recorded in a YAML file that is checked into the Config Repo and loaded into Prometheus and can be used to define Helm charts for individual components.

Logging

OS programmers have been writing diagnostic messages to a syslog since the earliest days of Unix
One typical open source logging stack uses Fluentd to collect (aggregate, buffer, and route) log messages written by a set of components, with Fluentbit serving as a client-side agent running in each component helping developers normalize their log messages
ElasticSearch is then used to store, search, and analyze those messages

Common Schema

The key challenge in logging is to adopt a uniform message format across all components
Fluentbit plays a role in normalizing these messages by supporting a set of filters
These filters parse "raw" log messages written by the component (an ASCII string) and output canonical log messages as structured JSON
This example highlights the challenge the DevOps team faces in building the management platform

Best Practices

Establishing a shared logging platform is of little value unless all the individual components are properly instrumented to write log messages
Log shipping is handled by the platform
File logging should be disabled
Asynchronous logging is encouraged
Components should write logs asynchronously
Timestamps should be created by the program's logger
Must be able to change log levels without interrupting service

Distributed Tracing

Tracing is challenging in a cloud setting because it involves following the flow of control for each user-initiated request across multiple microservices.
The good news is that activating tracing support in the underlying language runtime system-typically in the RPC stubs-is more efficient than asking app developers to explicitly instrument their programs.

Integrated Dashboards

Creating useful panels and organizing them into intuitive dashboards is part of the solution
Integrating information across the subsystems of the management platform is also a requirement
Two general strategies
Both Kibana and Grafana can be configured to display telemetry data from multiple sources
Having access to the data needed to know what changes (if any) need to be made is a prerequisite for making informed decisions
To this end, it is ideal to have access to both the "knobs" and the "dials" on an integrated dashboard

Observability

Knowing what telemetry data to collect, so you have exactly the right information when you need it, but doing so without negatively impacting system performance is a difficult problem.
Observability is the quality of a system that makes visible the facts about its internal operation needed to make informed management and control decisions.

edecoux/Edge-Cloud.md

Edge Cloud Operations: A Systems Approach

Peterson, Baker, Bavier, Williams and Davie

Chapter 1: Introduction

How to Operationalize a Cloud

Terminology

Disaggregation

Cloud Technology

Hardware Platform

Software Building Blocks

Switching Fabric

Repositories

Other Options

Future of the Sysadmin

Chapter 2: Architecture

Aether is a Kubernetes-based edge cloud, augmented with a 5G-based connectivity service

Edge Cloud

Hybrid Cloud

Stakeholders

Control and Management

Resource Provisioning

Lifecycle Management

Runtime Control

Monitoring and Telemetry

Summary

DevOps

Chapter 3: Resource Provisioning

Resource Provisioning is the process of bringing virtual and physical resources online. It has both a hands-on component (racking and connecting devices) and a bootstrap component (configuring how the resources boot into a "ready" state).

Physical Infrastructure

Document Infrastructure

Configure and Boot

Provisioning API

Provisioning VMs

Infrastructure-as-Code

Platform Definition

Lifecycle Management

Design Overview

Testing Strategy

Categories of Tests

Testing Framework

Continuous Integration

Code Repositories

Build-Integrate-Test

Continuous Deployment

Versioning Strategy

Managing Secrets

GitOps

Chapter 5: Runtime Control

Runtime Control provides an API by which various principals, such as end-users, enterprise admins, and cloud operators, can make changes to a running system by specifying new values for one or more runtime parameters.

Design Overview

Models & State

Runtime Control API

Identity Management

Adapters

Workflow Engine

Secure Communication

Modeling Connectivity Service

Enterprises

Slices

Templates and Traffic Classes

Other Models

Revisiting GitOps

Chapter 6: Monitoring and Telemetry

Collecting telemetry data for a running system is an essential function of the management platform

Metrics and Alerts

Exporting metrics

Creating Dashboards

Defining Alerts

Logging

Common Schema

Best Practices

Distributed Tracing

Integrated Dashboards

Observability