Skip to content

Instantly share code, notes, and snippets.

@josethz00
Last active September 17, 2025 21:41
Show Gist options
  • Select an option

  • Save josethz00/091113b66504be05929e38636c25a963 to your computer and use it in GitHub Desktop.

Select an option

Save josethz00/091113b66504be05929e38636c25a963 to your computer and use it in GitHub Desktop.
Data Engineering Roadmap

Data Engineering Roadmap

Hi, there! I have built this roadmap with topics that I think that are essential for you to become an awesome data engineer, also, to organize my own studies and fill some gaps that I have identified in myself.

The Roadmap follows an order, the order that I think that's the ideal, however, conider it a suggestion, not mandatory.

Also, I divided the roadmap in three parts:

  1. Essentials
  2. Advanced Topics
  3. Extras (Optional)

1. Essentials

In this section, I put all the topics that I judge essential for one to become a good Data Engineer. By crafting these skills, you can get your first job, be promoted and recognized within your company. However, if you want to keep growing, I strongly recommend you to check the Advanced Topics section as well!

The Basic Concepts

Here's probably where you will spent most of your time, but don't lose heart, keep the pace! 🚀 🚀 It's important that you spend a lot of your precious time studying the basics, because they will help you to build solid knowledge, develop critical thinking and avoid skill gaps in the future.

  • Python Programming

    • Python is the best language for data workflows, because of its rich ecosystem of libraries and simplicity of use.
  • Git and Github

  • Networking Fundamentals

  • Operational Systems

    • Don't forget to learn a Linux distro after understanding how an operational system works

Database Basic Concepts

Since you are a Data Engineer, you must know a lot about databases, if you don't, time to learn. Here are some basic database concepts that you need to know:

  • Why do I need a database?

  • What is a table?

  • What is Normalization?

  • What are indexes?

  • OLAP vs OLTP

Data Engineering Lifecycle

For building good data pipelines, you need to understand the entire data engineering lifecycle.

  • Data Generation

  • Data Storage

  • Data Ingestion

  • Data Serving

Relational Databases

That's very likely that you will spent a lot of time here too, because relational databases are a big part of the foundation of most existent systems in the internet. Even some Non-Relational Databases were inspired by some topics of the relational world, including SQL, many of them are SQL compatible.

  • Indexes and Keys

  • Data Modeling

  • Boyce/Codd Normal Forms

  • Relational Algebra

  • SQL

    • Nowadays, SQL is not limited to just querying relational databases, it is also used for data transformation, testing and many more utilities

Introduction to Distributed Systems

As systems scale, distibuted systems rise as an option to handle the heavy workflows. It's essential to understand them, so you know how to coordinate and distribute tasks across different computers, speeding up the execution.

  • What is a Distributed System?

  • Nodes and Clusters

  • Load Balancing

  • Microservices

ACID and CAP Theorem

Here we dive into the principles that guarantee database reliability and how distributed systems make trade-offs. It’s about understanding how consistency, availability, and partition tolerance play together, and what happens when things go wrong.

  • ACID Principles

  • CAP Theorem

  • Is it possible to build a fully CAP system?

    • This is a polemic one. There's plenty of content and debate about it in internet forums
  • Locks and Distributed Locks

  • Database Replication

  • Database Sharding

  • Byzantine Fault

Client-Server communication

Any system depends on how the client and server talk to each other. Here we’ll see the main ways this communication happens, from the traditional request-response style to real-time event-driven models.

  • REST APIs

  • GraphQL

  • SOAP

  • Websockets

  • Server-Sent Events

No-SQL Databases

Not all data fits nicely into relational tables. NoSQL databases offer different approaches to storage and querying, each one optimized for specific use cases like documents, graphs, or key-value access.

  • Document Databases

  • Key-Value Databases

  • Graph Databases

  • New-SQL Databases

Row vs Columnar format

The way data is stored has a huge impact on performance. Here we compare row-oriented and column-oriented storage, when to use each, and examples of systems that combine both.

  • Understanding the difference

  • Columnar Store Examples

  • Row Store Examples

  • Hybrid Store Examples

Data Architectures (Lake, Ponds, Warehouse, Marts, Mesh)

Learn the different ways organizations store and manage their data. From raw lakes to structured warehouses and modern concepts like mesh, each architecture has its trade-offs.

  • Data Lakes

  • Data Warehouses

  • Data Marts

  • Data Mesh

  • Data Ponds (niche/local storage for experiments)

Data Pipelines

Pipelines are the backbone of data engineering. Here you’ll learn how to build, maintain, and optimize the flow of data from source to destination.

  • Batch Pipelines

  • Real-Time Pipelines

  • ETL vs ELT

  • Popular Tools

    • It's very important that you learn the main tools, so you can build great ata pipelines, and meet the job market expectations. Some of these tools are:
      • Apache AirFlow
      • Dagster
      • Prefect
      • dbt
      • Apache Kafka
      • Apache Spark
      • AirByte

Cloud Computing

More and more, companies are leaving the on-premises and joining the cloud world. For this reason, is important that you understand how the major cloud providers work, and its main data engineering tools.

  • Main Concepts

  • SLAs and SLOs

  • Amazon Web Services

  • Google Cloud

  • Microsoft Azure

Containers & Orchestration

Modern data engineering depends heavily on containerization and orchestration. Understanding Docker, Kubernetes, and other orchestrators will give you the ability to scale and manage workflows efficiently.

  • Docker Basics

  • Kubernetes Essentials

  • Workflow Orchestration (Airflow, Prefect, Dagster)

  • CI/CD Integration


2. Advanced Topics

Now that you’ve mastered the essentials, it’s time to dive deeper. These topics will help you refine your craft, optimize your solutions, and deal with large-scale, complex environments. Mastering these can make you stand out as a senior or lead data engineer.

Monitoring

Monitoring ensures that your data systems are reliable and performant. Learn how to implement observability with metrics, logging, and alerting to detect and fix issues quickly.

  • Metrics & Dashboards (Prometheus, Grafana)

  • Logging (ELK Stack, OpenTelemetry)

  • Alerting & Incident Management

Data Quality & Governance

Building trust in data requires governance. Understand how to ensure accuracy, consistency, and compliance with standards such as GDPR, HIPAA, or company-specific policies.

  • Data Validation & Testing

  • Metadata Management

  • Building a Data Catalog

  • Master Data Management (MDM)

  • Compliance & Security

Massive Parallel Processing (MPP)

MPP systems distribute workloads across multiple nodes, making them crucial for handling big data. Learning them will help you design scalable and performant solutions.

  • Shared-Nothing Architecture

  • Query Optimization in MPP Systems

  • Common MPP Tools (Snowflake, Redshift, BigQuery)

Data Streaming

Streaming technologies like Kafka, Flink, or Spark Streaming are key for real-time processing. Knowing how to implement and manage streaming pipelines is an advanced yet very in-demand skill.

  • Message Brokers (Kafka, RabbitMQ)

  • Stream Processing Engines (Flink, Spark Streaming, ksqlDB)

  • Exactly-Once Processing & Checkpointing

Trade-off analysis between data engineering tools

There’s no silver bullet in data engineering. Every tool has strengths and weaknesses. Learn to analyze trade-offs and make informed decisions that fit your company’s needs.

  • Cost vs Performance

  • Scalability vs Simplicity

  • Vendor Lock-in vs Open Source

Data Visualization

While not the primary job of a data engineer, visualization helps you understand and communicate data insights. Learning how to build clear dashboards and reports adds great value.

  • BI Tools (Tableau, Power BI, Looker)

  • Custom Dashboards (Superset, Metabase)

  • Storytelling through Visualization

Database Internals

Going under the hood helps you understand performance at a deeper level. Study query optimizers, storage engines, and concurrency control to better design your data systems.

  • Query Execution & Optimizers

  • Storage Engines (InnoDB, RocksDB)

  • Concurrency Control & Transactions

  • B-Tree vs B+Tree vs LSM-Tree

Programming in Scala

Scala is widely used in big data frameworks such as Spark. Learning it allows you to extend and optimize your usage of these tools.

  • Functional Programming Basics

  • Using Scala in Spark

  • Performance Benefits over Python

Stakeholder management

Technical skills aren’t enough. A senior data engineer must also manage expectations, communicate clearly with stakeholders, and align data initiatives with business goals.

  • Communicating with Non-Technical Stakeholders

  • Translating Business Needs into Technical Requirements

  • Prioritization & Expectation Management


3. Extras (Optional)

These are not mandatory, but if you want to make your profile even stronger, these topics can differentiate you. They’re great for polishing your career and expanding beyond the typical data engineering scope.

Introduction to Statistics

Understanding statistics helps you reason about data distributions, variability, and significance. Even as a data engineer, this knowledge strengthens your analytical thinking.

  • Probability Basics

  • Descriptive vs Inferential Statistics

  • Hypothesis Testing

Introduction to Machine Learning

Not every data engineer needs ML, but having a basic understanding will help you collaborate better with data scientists and even build ML pipelines.

  • Supervised vs Unsupervised Learning

  • Feature Engineering Basics

  • Model Deployment Basics

Infrastructure as a Code

IaC tools like Terraform or Ansible allow you to manage infrastructure programmatically. This knowledge bridges the gap between DevOps and data engineering.

  • Terraform Basics

  • Configuration Management (Ansible, Puppet, Chef)

  • Cloud Infrastructure as Code (AWS CDK, Pulumi)

Data Viz with JavaScript

Beyond dashboards, JavaScript libraries like D3.js or ECharts let you create highly interactive data visualizations for custom applications.

  • D3.js Fundamentals

  • ECharts & Plotly

  • Integrating with React/Vue

Data Storytelling and Design Concepts

Storytelling transforms raw numbers into meaningful insights. Learn how to design intuitive dashboards and communicate findings effectively.

  • Principles of Data Storytelling

  • Dashboard Design Best Practices

  • Cognitive Bias & Visual Perception in Data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment