Data engineering

Processing frameworks
Automation (scheduling)

Storage

MySQL
PostgreSQL
MongoDB

Processing Data

Data cleaning
Data aggregation
Clustering
Batch and Streaming

Tools

Spark
Hive
Flink and Kafka

Automation (scheduling)

Setup and manage workflow
Plan jobs with a schedule
Resolve dependency requirements

Tooling

Apache Airflow
Oozie
Luigi

Pipeline Example

Get data from different sources (databases, API, etc.)
Extract, process, and load data using Apache Spark and save the new data into an analytics-ready source
Schedule the jobs using Apache Airflow

Hadoop and MapReduce

Hive: offers extraction features for the ETL pipeline

It allows to query a cluster with SQL-like syntax, transforming it into Hadoop jobs

Hadoop does a lot of write operations on disk

Spark

Distributes data processing tasks between clusters

Spark keeps the processing in memory

Spark is an improvement on the limitation of map-reduce

Resilient Distributed Dataset (RDD) is a Spark datastructure that manages data across multiple nodes.

RDD are read-only, partitioned collection of elements

ML pipelines

You might use Airflow to schedule a complex ML pipeline. Within that pipeline, you could use Beam for data processing and preparation, and then use MLflow to track the experiments and manage the deployed model.

Data engineering workflows: Airflow can manage the scheduling of data pipelines that process data using Beam.

You can use Airflow to schedule and author the execution of a Beam pipeline, which is then run on a processing engine like Google Cloud Dataflow.

Airflow acts as the "project manager" and Beam defines the "work" that needs to be done within that schedule.

Tools

apache nifi
airflow
Prefect
Dagster
Airbyte
AWS Step Functions
Google Cloud Composer
Amazon MWAA

simonespa/data-engineering.md

Select an option

No results found

Select an option

No results found

Data engineering

Storage

Processing Data

Tools

Automation (scheduling)

Tooling

Pipeline Example

Hadoop and MapReduce

Spark

ML pipelines

Tools