Skip to content

Instantly share code, notes, and snippets.

@YannMoisan
Created November 10, 2016 13:58
Show Gist options
  • Select an option

  • Save YannMoisan/5b8f6cd51399db1a5e2f6a93a2dbfba2 to your computer and use it in GitHub Desktop.

Select an option

Save YannMoisan/5b8f6cd51399db1a5e2f6a93a2dbfba2 to your computer and use it in GitHub Desktop.
Does catalyst optimize consecutive unionAll ?
I was wondering if catalyst optimize consecutive unionAll (so that the number of shuffles will be limited).
So let's write a test
@ val dfs = (1 to 5).map(i => sc.parallelize((1 to 3).map(j => i + 3*j)).toDF("a"))
dfs: collection.immutable.IndexedSeq[org.apache.spark.sql.DataFrame] = Vector([a: int], [a: int], [a: int], [a: int], [a: int])
@ val merged = dfs.reduceLeft(_ unionAll _)
merged: org.apache.spark.sql.DataFrame = [a: int]
And show the plan
@ merged.explain(true)
== Parsed Logical Plan ==
Union
:- Union
: :- Union
: : :- Union
: : : :- Project [_1#32 AS a#33]
: : : : +- LogicalRDD [_1#32], MapPartitionsRDD[33] at intRddToDataFrameHolder at cmd18.sc:1
: : : +- Project [_1#34 AS a#35]
: : : +- LogicalRDD [_1#34], MapPartitionsRDD[35] at intRddToDataFrameHolder at cmd18.sc:1
: : +- Project [_1#36 AS a#37]
: : +- LogicalRDD [_1#36], MapPartitionsRDD[37] at intRddToDataFrameHolder at cmd18.sc:1
: +- Project [_1#38 AS a#39]
: +- LogicalRDD [_1#38], MapPartitionsRDD[39] at intRddToDataFrameHolder at cmd18.sc:1
+- Project [_1#40 AS a#41]
+- LogicalRDD [_1#40], MapPartitionsRDD[41] at intRddToDataFrameHolder at cmd18.sc:1
== Analyzed Logical Plan ==
a: int
Union
:- Union
: :- Union
: : :- Union
: : : :- Project [_1#32 AS a#33]
: : : : +- LogicalRDD [_1#32], MapPartitionsRDD[33] at intRddToDataFrameHolder at cmd18.sc:1
: : : +- Project [_1#34 AS a#35]
: : : +- LogicalRDD [_1#34], MapPartitionsRDD[35] at intRddToDataFrameHolder at cmd18.sc:1
: : +- Project [_1#36 AS a#37]
: : +- LogicalRDD [_1#36], MapPartitionsRDD[37] at intRddToDataFrameHolder at cmd18.sc:1
: +- Project [_1#38 AS a#39]
: +- LogicalRDD [_1#38], MapPartitionsRDD[39] at intRddToDataFrameHolder at cmd18.sc:1
+- Project [_1#40 AS a#41]
+- LogicalRDD [_1#40], MapPartitionsRDD[41] at intRddToDataFrameHolder at cmd18.sc:1
== Optimized Logical Plan ==
Union
:- Union
: :- Union
: : :- Union
: : : :- Project [_1#32 AS a#33]
: : : : +- LogicalRDD [_1#32], MapPartitionsRDD[33] at intRddToDataFrameHolder at cmd18.sc:1
: : : +- Project [_1#34 AS a#35]
: : : +- LogicalRDD [_1#34], MapPartitionsRDD[35] at intRddToDataFrameHolder at cmd18.sc:1
: : +- Project [_1#36 AS a#37]
: : +- LogicalRDD [_1#36], MapPartitionsRDD[37] at intRddToDataFrameHolder at cmd18.sc:1
: +- Project [_1#38 AS a#39]
: +- LogicalRDD [_1#38], MapPartitionsRDD[39] at intRddToDataFrameHolder at cmd18.sc:1
+- Project [_1#40 AS a#41]
+- LogicalRDD [_1#40], MapPartitionsRDD[41] at intRddToDataFrameHolder at cmd18.sc:1
== Physical Plan ==
Union
:- Project [_1#32 AS a#33]
: +- Scan ExistingRDD[_1#32]
:- Project [_1#34 AS a#35]
: +- Scan ExistingRDD[_1#34]
:- Project [_1#36 AS a#37]
: +- Scan ExistingRDD[_1#36]
:- Project [_1#38 AS a#39]
: +- Scan ExistingRDD[_1#38]
+- Project [_1#40 AS a#41]
+- Scan ExistingRDD[_1#40]
So the physical plan contains a single operation. That's good news !
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment