Data Engineering Roadmap

Hi, there! I have built this roadmap with topics that I think that are essential for you to become an awesome data engineer, also, to organize my own studies and fill some gaps that I have identified in myself.

The Roadmap follows an order, the order that I think that's the ideal, however, conider it a suggestion, not mandatory.

Also, I divided the roadmap in three parts:

Essentials
Advanced Topics
Extras (Optional)

1. Essentials

In this section, I put all the topics that I judge essential for one to become a good Data Engineer. By crafting these skills, you can get your first job, be promoted and recognized within your company. However, if you want to keep growing, I strongly recommend you to check the Advanced Topics section as well!

The Basic Concepts

Here's probably where you will spent most of your time, but don't lose heart, keep the pace! 🚀 🚀 It's important that you spend a lot of your precious time studying the basics, because they will help you to build solid knowledge, develop critical thinking and avoid skill gaps in the future.

Python Programming
- Python is the best language for data workflows, because of its rich ecosystem of libraries and simplicity of use.
Git and Github
Networking Fundamentals
Operational Systems
- Don't forget to learn a Linux distro after understanding how an operational system works

Database Basic Concepts

Since you are a Data Engineer, you must know a lot about databases, if you don't, time to learn. Here are some basic database concepts that you need to know:

Why do I need a database?
What is a table?
What is Normalization?
What are indexes?
OLAP vs OLTP

Data Engineering Lifecycle

For building good data pipelines, you need to understand the entire data engineering lifecycle.

Data Generation
Data Storage
Data Ingestion
Data Serving

Relational Databases

That's very likely that you will spent a lot of time here too, because relational databases are a big part of the foundation of most existent systems in the internet. Even some Non-Relational Databases were inspired by some topics of the relational world, including SQL, many of them are SQL compatible.

Indexes and Keys
Data Modeling
Boyce/Codd Normal Forms
Relational Algebra
SQL
- Nowadays, SQL is not limited to just querying relational databases, it is also used for data transformation, testing and many more utilities

Introduction to Distributed Systems

As systems scale, distibuted systems rise as an option to handle the heavy workflows. It's essential to understand them, so you know how to coordinate and distribute tasks across different computers, speeding up the execution.

What is a Distributed System?
Nodes and Clusters
Load Balancing
Microservices

ACID and CAP Theorem

Here we dive into the principles that guarantee database reliability and how distributed systems make trade-offs. It’s about understanding how consistency, availability, and partition tolerance play together, and what happens when things go wrong.

ACID Principles
CAP Theorem
Is it possible to build a fully CAP system?
- This is a polemic one. There's plenty of content and debate about it in internet forums
Locks and Distributed Locks
Database Replication
Database Sharding
Byzantine Fault

Client-Server communication

Any system depends on how the client and server talk to each other. Here we’ll see the main ways this communication happens, from the traditional request-response style to real-time event-driven models.

REST APIs
GraphQL
SOAP
Websockets
Server-Sent Events

No-SQL Databases

Not all data fits nicely into relational tables. NoSQL databases offer different approaches to storage and querying, each one optimized for specific use cases like documents, graphs, or key-value access.

Document Databases
Key-Value Databases
Graph Databases
New-SQL Databases

Row vs Columnar format

The way data is stored has a huge impact on performance. Here we compare row-oriented and column-oriented storage, when to use each, and examples of systems that combine both.

Understanding the difference
Columnar Store Examples
Row Store Examples
Hybrid Store Examples

Data Architectures (Lake, Ponds, Warehouse, Marts, Mesh)

Learn the different ways organizations store and manage their data. From raw lakes to structured warehouses and modern concepts like mesh, each architecture has its trade-offs.

Data Lakes
Data Warehouses
Data Marts
Data Mesh
Data Ponds (niche/local storage for experiments)

Data Pipelines

Pipelines are the backbone of data engineering. Here you’ll learn how to build, maintain, and optimize the flow of data from source to destination.

Batch Pipelines
Real-Time Pipelines
ETL vs ELT
Popular Tools
- It's very important that you learn the main tools, so you can build great ata pipelines, and meet the job market expectations. Some of these tools are:
  - Apache AirFlow
  - Dagster
  - Prefect
  - dbt
  - Apache Kafka
  - Apache Spark
  - AirByte

Cloud Computing

More and more, companies are leaving the on-premises and joining the cloud world. For this reason, is important that you understand how the major cloud providers work, and its main data engineering tools.

Main Concepts
SLAs and SLOs
Amazon Web Services
Google Cloud
Microsoft Azure

Containers & Orchestration

Modern data engineering depends heavily on containerization and orchestration. Understanding Docker, Kubernetes, and other orchestrators will give you the ability to scale and manage workflows efficiently.

Docker Basics
Kubernetes Essentials
Workflow Orchestration (Airflow, Prefect, Dagster)
CI/CD Integration

2. Advanced Topics

Now that you’ve mastered the essentials, it’s time to dive deeper. These topics will help you refine your craft, optimize your solutions, and deal with large-scale, complex environments. Mastering these can make you stand out as a senior or lead data engineer.

Monitoring

Monitoring ensures that your data systems are reliable and performant. Learn how to implement observability with metrics, logging, and alerting to detect and fix issues quickly.

Metrics & Dashboards (Prometheus, Grafana)
Logging (ELK Stack, OpenTelemetry)
Alerting & Incident Management

Data Quality & Governance

Building trust in data requires governance. Understand how to ensure accuracy, consistency, and compliance with standards such as GDPR, HIPAA, or company-specific policies.

Data Validation & Testing
Metadata Management
Building a Data Catalog
Master Data Management (MDM)
Compliance & Security

Massive Parallel Processing (MPP)

MPP systems distribute workloads across multiple nodes, making them crucial for handling big data. Learning them will help you design scalable and performant solutions.

Shared-Nothing Architecture
Query Optimization in MPP Systems
Common MPP Tools (Snowflake, Redshift, BigQuery)

Data Streaming

Streaming technologies like Kafka, Flink, or Spark Streaming are key for real-time processing. Knowing how to implement and manage streaming pipelines is an advanced yet very in-demand skill.

Message Brokers (Kafka, RabbitMQ)
Stream Processing Engines (Flink, Spark Streaming, ksqlDB)
Exactly-Once Processing & Checkpointing

Trade-off analysis between data engineering tools

There’s no silver bullet in data engineering. Every tool has strengths and weaknesses. Learn to analyze trade-offs and make informed decisions that fit your company’s needs.

Cost vs Performance
Scalability vs Simplicity
Vendor Lock-in vs Open Source

Data Visualization

While not the primary job of a data engineer, visualization helps you understand and communicate data insights. Learning how to build clear dashboards and reports adds great value.

BI Tools (Tableau, Power BI, Looker)
Custom Dashboards (Superset, Metabase)
Storytelling through Visualization

Database Internals

Going under the hood helps you understand performance at a deeper level. Study query optimizers, storage engines, and concurrency control to better design your data systems.

Query Execution & Optimizers
Storage Engines (InnoDB, RocksDB)
Concurrency Control & Transactions
B-Tree vs B+Tree vs LSM-Tree

Programming in Scala

Scala is widely used in big data frameworks such as Spark. Learning it allows you to extend and optimize your usage of these tools.

Functional Programming Basics
Using Scala in Spark
Performance Benefits over Python

Stakeholder management

Technical skills aren’t enough. A senior data engineer must also manage expectations, communicate clearly with stakeholders, and align data initiatives with business goals.

Communicating with Non-Technical Stakeholders
Translating Business Needs into Technical Requirements
Prioritization & Expectation Management

3. Extras (Optional)

These are not mandatory, but if you want to make your profile even stronger, these topics can differentiate you. They’re great for polishing your career and expanding beyond the typical data engineering scope.

Introduction to Statistics

Understanding statistics helps you reason about data distributions, variability, and significance. Even as a data engineer, this knowledge strengthens your analytical thinking.

Probability Basics
Descriptive vs Inferential Statistics
Hypothesis Testing

Introduction to Machine Learning

Not every data engineer needs ML, but having a basic understanding will help you collaborate better with data scientists and even build ML pipelines.

Supervised vs Unsupervised Learning
Feature Engineering Basics
Model Deployment Basics

Infrastructure as a Code

IaC tools like Terraform or Ansible allow you to manage infrastructure programmatically. This knowledge bridges the gap between DevOps and data engineering.

Terraform Basics
Configuration Management (Ansible, Puppet, Chef)
Cloud Infrastructure as Code (AWS CDK, Pulumi)

Data Viz with JavaScript

Beyond dashboards, JavaScript libraries like D3.js or ECharts let you create highly interactive data visualizations for custom applications.

D3.js Fundamentals
ECharts & Plotly
Integrating with React/Vue

Data Storytelling and Design Concepts

Storytelling transforms raw numbers into meaningful insights. Learn how to design intuitive dashboards and communicate findings effectively.

Principles of Data Storytelling
Dashboard Design Best Practices
Cognitive Bias & Visual Perception in Data

josethz00/data-eng-roadmap.md

Data Engineering Roadmap

1. Essentials

The Basic Concepts

Python Programming

Git and Github

Networking Fundamentals

Operational Systems

Database Basic Concepts

Why do I need a database?

What is a table?

What is Normalization?

What are indexes?

OLAP vs OLTP

Data Engineering Lifecycle

Data Generation

Data Storage

Data Ingestion

Data Serving

Relational Databases

Indexes and Keys

Data Modeling

Boyce/Codd Normal Forms

Relational Algebra

SQL

Introduction to Distributed Systems

What is a Distributed System?

Nodes and Clusters

Load Balancing

Microservices

ACID and CAP Theorem

ACID Principles

CAP Theorem

Is it possible to build a fully CAP system?

Locks and Distributed Locks

Database Replication

Database Sharding

Byzantine Fault

Client-Server communication

REST APIs

GraphQL

SOAP

Websockets

Server-Sent Events

No-SQL Databases

Document Databases

Key-Value Databases

Graph Databases

New-SQL Databases

Row vs Columnar format

Understanding the difference

Columnar Store Examples

Row Store Examples

Hybrid Store Examples

Data Architectures (Lake, Ponds, Warehouse, Marts, Mesh)

Data Lakes

Data Warehouses

Data Marts

Data Mesh

Data Ponds (niche/local storage for experiments)

Data Pipelines

Batch Pipelines

Real-Time Pipelines

ETL vs ELT

Popular Tools

Cloud Computing

Main Concepts

SLAs and SLOs

Amazon Web Services

Google Cloud

Microsoft Azure

Containers & Orchestration

Docker Basics

Kubernetes Essentials

Workflow Orchestration (Airflow, Prefect, Dagster)

CI/CD Integration

2. Advanced Topics

Monitoring

Metrics & Dashboards (Prometheus, Grafana)

Logging (ELK Stack, OpenTelemetry)