Spark Core

By Gemini

Here’s a brief explanation of the key elements you’ll see in the Apache Spark History Server:

🔹 Application A Spark application is a complete program that runs on the cluster. It includes everything from reading data to processing and writing results. Each application has its own driver and executors. 🧠 Think of it as: The entire job submitted to Spark.

🔹 Job A job is triggered by an action (e.g., collect(), save(), count()). Spark breaks down the application into one or more jobs. Each job consists of stages. 🧠 Think of it as: A unit of work triggered by a user action.

🔹 Stage A stage is a set of tasks that can be executed in parallel. Spark divides jobs into stages based on shuffle boundaries. There are typically map stages and reduce stages. 🧠 Think of it as: A phase in the job where tasks can run in parallel.

🔹 Task A task is the smallest unit of execution in Spark. Each task runs on a partition of the data. Tasks are distributed across executors. 🧠 Think of it as: One piece of work on one chunk of data.

20220924 All About Spark- Jobs, Stages and Tasks - Analytics Vidhya, EN

Spark Jobs

So, what Spark does is that as soon as action operations like collect(), count(), etc., is triggered, the driver program, which is responsible for launching the spark application as well as considered the entry point of any spark application, converts this spark application into a single job which can be seen in the figure below

Spark Stages

Now here comes the concept of Stage. Whenever there is a shuffling of data over the network, Spark divides the job into multiple stages. Therefore, a stage is created when the shuffling of data takes place. These stages can be either processed parallelly or sequentially depending upon the dependencies of these stages between each other.

Spark Tasks

The single computation unit performed on a single data partition is called a task. It is computed on a single core of the worker node. Whenever Spark is performing any computation operation like transformation etc, Spark is executing a task on a partition of data.

Spark SQL

Databricks

Data Ingestion

Data Sources

SQL databases using JDBC

You can use two DataFrameReader APIs to specify partitioning:

jdbc(url:String,table:String,partitionColumn:String,lowerBound:Long,upperBound:Long,numPartitions:Int,...)
jdbc(url:String,table:String,predicates:Array[String],...)

leoricklin/res_spark.md

Select an option

No results found

Select an option

No results found

Spark Core

By Gemini

20220924 All About Spark- Jobs, Stages and Tasks - Analytics Vidhya, EN

20210413 Our Optimized Spark Docker Images Are Now Available

20200828 Apache Spark 3.0 Memory Monitoring Improvements

20190409 Best practices for successfully managing memory for Apache Spark applications on Amazon EMR

Spark Architecture

Spark SQL

Databricks

Data Ingestion

Data Sources

SQL databases using JDBC

Resources

SparkSqlParser — Default SQL Parser

20190807 Writing custom optimization in Apache Spark SQL - parser

20191020 Implementing a ConnectionPool in Apache Spark’s foreachPartition(..)