Skip to content

Instantly share code, notes, and snippets.

@simonespa
Created November 25, 2025 09:54
Show Gist options
  • Select an option

  • Save simonespa/565d3702f503a0256d4853c0aea53fb4 to your computer and use it in GitHub Desktop.

Select an option

Save simonespa/565d3702f503a0256d4853c0aea53fb4 to your computer and use it in GitHub Desktop.
Data Engineering notes

Data engineering

  • Processing frameworks
  • Automation (scheduling)

Storage

  • MySQL
  • PostgreSQL
  • MongoDB

Processing Data

  • Data cleaning
  • Data aggregation
  • Clustering
  • Batch and Streaming

Tools

  • Spark
  • Hive
  • Flink and Kafka

Automation (scheduling)

  • Setup and manage workflow
  • Plan jobs with a schedule
  • Resolve dependency requirements

Tooling

  • Apache Airflow
  • Oozie
  • Luigi

Pipeline Example

  • Get data from different sources (databases, API, etc.)
  • Extract, process, and load data using Apache Spark and save the new data into an analytics-ready source
  • Schedule the jobs using Apache Airflow

Hadoop and MapReduce

Hive: offers extraction features for the ETL pipeline

It allows to query a cluster with SQL-like syntax, transforming it into Hadoop jobs

Hadoop does a lot of write operations on disk

Spark

Distributes data processing tasks between clusters

Spark keeps the processing in memory

Spark is an improvement on the limitation of map-reduce

Resilient Distributed Dataset (RDD) is a Spark datastructure that manages data across multiple nodes.

RDD are read-only, partitioned collection of elements

ML pipelines

You might use Airflow to schedule a complex ML pipeline. Within that pipeline, you could use Beam for data processing and preparation, and then use MLflow to track the experiments and manage the deployed model.

Data engineering workflows: Airflow can manage the scheduling of data pipelines that process data using Beam.

You can use Airflow to schedule and author the execution of a Beam pipeline, which is then run on a processing engine like Google Cloud Dataflow.

Airflow acts as the "project manager" and Beam defines the "work" that needs to be done within that schedule.

Tools

  • apache nifi
  • airflow
  • Prefect
  • Dagster
  • Airbyte
  • AWS Step Functions
  • Google Cloud Composer
  • Amazon MWAA
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment