- Processing frameworks
- Automation (scheduling)
- MySQL
- PostgreSQL
- MongoDB
- Data cleaning
- Data aggregation
- Clustering
- Batch and Streaming
- Spark
- Hive
- Flink and Kafka
- Setup and manage workflow
- Plan jobs with a schedule
- Resolve dependency requirements
- Apache Airflow
- Oozie
- Luigi
- Get data from different sources (databases, API, etc.)
- Extract, process, and load data using Apache Spark and save the new data into an analytics-ready source
- Schedule the jobs using Apache Airflow
Hive: offers extraction features for the ETL pipeline
It allows to query a cluster with SQL-like syntax, transforming it into Hadoop jobs
Hadoop does a lot of write operations on disk
Distributes data processing tasks between clusters
Spark keeps the processing in memory
Spark is an improvement on the limitation of map-reduce
Resilient Distributed Dataset (RDD) is a Spark datastructure that manages data across multiple nodes.
RDD are read-only, partitioned collection of elements
You might use Airflow to schedule a complex ML pipeline. Within that pipeline, you could use Beam for data processing and preparation, and then use MLflow to track the experiments and manage the deployed model.
Data engineering workflows: Airflow can manage the scheduling of data pipelines that process data using Beam.
You can use Airflow to schedule and author the execution of a Beam pipeline, which is then run on a processing engine like Google Cloud Dataflow.
Airflow acts as the "project manager" and Beam defines the "work" that needs to be done within that schedule.
- apache nifi
- airflow
- Prefect
- Dagster
- Airbyte
- AWS Step Functions
- Google Cloud Composer
- Amazon MWAA