Hi, there! I have built this roadmap with topics that I think that are essential for you to become an awesome data engineer, also, to organize my own studies and fill some gaps that I have identified in myself.
The Roadmap follows an order, the order that I think that's the ideal, however, conider it a suggestion, not mandatory.
Also, I divided the roadmap in three parts:
- Essentials
- Advanced Topics
- Extras (Optional)
In this section, I put all the topics that I judge essential for one to become a good Data Engineer. By crafting these skills, you can get your first job, be promoted and recognized within your company. However, if you want to keep growing, I strongly recommend you to check the Advanced Topics section as well!
Here's probably where you will spent most of your time, but don't lose heart, keep the pace! 🚀 🚀 It's important that you spend a lot of your precious time studying the basics, because they will help you to build solid knowledge, develop critical thinking and avoid skill gaps in the future.
-
- Python is the best language for data workflows, because of its rich ecosystem of libraries and simplicity of use.
-
- Don't forget to learn a Linux distro after understanding how an operational system works
Since you are a Data Engineer, you must know a lot about databases, if you don't, time to learn. Here are some basic database concepts that you need to know:
For building good data pipelines, you need to understand the entire data engineering lifecycle.
That's very likely that you will spent a lot of time here too, because relational databases are a big part of the foundation of most existent systems in the internet. Even some Non-Relational Databases were inspired by some topics of the relational world, including SQL, many of them are SQL compatible.
-
- Nowadays, SQL is not limited to just querying relational databases, it is also used for data transformation, testing and many more utilities
As systems scale, distibuted systems rise as an option to handle the heavy workflows. It's essential to understand them, so you know how to coordinate and distribute tasks across different computers, speeding up the execution.
Here we dive into the principles that guarantee database reliability and how distributed systems make trade-offs. It’s about understanding how consistency, availability, and partition tolerance play together, and what happens when things go wrong.
-
- This is a polemic one. There's plenty of content and debate about it in internet forums
Any system depends on how the client and server talk to each other. Here we’ll see the main ways this communication happens, from the traditional request-response style to real-time event-driven models.
Not all data fits nicely into relational tables. NoSQL databases offer different approaches to storage and querying, each one optimized for specific use cases like documents, graphs, or key-value access.
The way data is stored has a huge impact on performance. Here we compare row-oriented and column-oriented storage, when to use each, and examples of systems that combine both.
Learn the different ways organizations store and manage their data. From raw lakes to structured warehouses and modern concepts like mesh, each architecture has its trade-offs.
Pipelines are the backbone of data engineering. Here you’ll learn how to build, maintain, and optimize the flow of data from source to destination.
-
- It's very important that you learn the main tools, so you can build great ata pipelines, and meet the job market expectations. Some of these tools are:
- Apache AirFlow
- Dagster
- Prefect
- dbt
- Apache Kafka
- Apache Spark
- AirByte
- It's very important that you learn the main tools, so you can build great ata pipelines, and meet the job market expectations. Some of these tools are:
More and more, companies are leaving the on-premises and joining the cloud world. For this reason, is important that you understand how the major cloud providers work, and its main data engineering tools.
Modern data engineering depends heavily on containerization and orchestration. Understanding Docker, Kubernetes, and other orchestrators will give you the ability to scale and manage workflows efficiently.
Now that you’ve mastered the essentials, it’s time to dive deeper. These topics will help you refine your craft, optimize your solutions, and deal with large-scale, complex environments. Mastering these can make you stand out as a senior or lead data engineer.
Monitoring ensures that your data systems are reliable and performant. Learn how to implement observability with metrics, logging, and alerting to detect and fix issues quickly.
Building trust in data requires governance. Understand how to ensure accuracy, consistency, and compliance with standards such as GDPR, HIPAA, or company-specific policies.
MPP systems distribute workloads across multiple nodes, making them crucial for handling big data. Learning them will help you design scalable and performant solutions.
Streaming technologies like Kafka, Flink, or Spark Streaming are key for real-time processing. Knowing how to implement and manage streaming pipelines is an advanced yet very in-demand skill.
There’s no silver bullet in data engineering. Every tool has strengths and weaknesses. Learn to analyze trade-offs and make informed decisions that fit your company’s needs.
While not the primary job of a data engineer, visualization helps you understand and communicate data insights. Learning how to build clear dashboards and reports adds great value.
Going under the hood helps you understand performance at a deeper level. Study query optimizers, storage engines, and concurrency control to better design your data systems.
Scala is widely used in big data frameworks such as Spark. Learning it allows you to extend and optimize your usage of these tools.
Technical skills aren’t enough. A senior data engineer must also manage expectations, communicate clearly with stakeholders, and align data initiatives with business goals.
These are not mandatory, but if you want to make your profile even stronger, these topics can differentiate you. They’re great for polishing your career and expanding beyond the typical data engineering scope.
Understanding statistics helps you reason about data distributions, variability, and significance. Even as a data engineer, this knowledge strengthens your analytical thinking.
Not every data engineer needs ML, but having a basic understanding will help you collaborate better with data scientists and even build ML pipelines.
IaC tools like Terraform or Ansible allow you to manage infrastructure programmatically. This knowledge bridges the gap between DevOps and data engineering.
Beyond dashboards, JavaScript libraries like D3.js or ECharts let you create highly interactive data visualizations for custom applications.
Storytelling transforms raw numbers into meaningful insights. Learn how to design intuitive dashboards and communicate findings effectively.