Created
November 10, 2016 13:58
-
-
Save YannMoisan/5b8f6cd51399db1a5e2f6a93a2dbfba2 to your computer and use it in GitHub Desktop.
Does catalyst optimize consecutive unionAll ?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| I was wondering if catalyst optimize consecutive unionAll (so that the number of shuffles will be limited). | |
| So let's write a test | |
| @ val dfs = (1 to 5).map(i => sc.parallelize((1 to 3).map(j => i + 3*j)).toDF("a")) | |
| dfs: collection.immutable.IndexedSeq[org.apache.spark.sql.DataFrame] = Vector([a: int], [a: int], [a: int], [a: int], [a: int]) | |
| @ val merged = dfs.reduceLeft(_ unionAll _) | |
| merged: org.apache.spark.sql.DataFrame = [a: int] | |
| And show the plan | |
| @ merged.explain(true) | |
| == Parsed Logical Plan == | |
| Union | |
| :- Union | |
| : :- Union | |
| : : :- Union | |
| : : : :- Project [_1#32 AS a#33] | |
| : : : : +- LogicalRDD [_1#32], MapPartitionsRDD[33] at intRddToDataFrameHolder at cmd18.sc:1 | |
| : : : +- Project [_1#34 AS a#35] | |
| : : : +- LogicalRDD [_1#34], MapPartitionsRDD[35] at intRddToDataFrameHolder at cmd18.sc:1 | |
| : : +- Project [_1#36 AS a#37] | |
| : : +- LogicalRDD [_1#36], MapPartitionsRDD[37] at intRddToDataFrameHolder at cmd18.sc:1 | |
| : +- Project [_1#38 AS a#39] | |
| : +- LogicalRDD [_1#38], MapPartitionsRDD[39] at intRddToDataFrameHolder at cmd18.sc:1 | |
| +- Project [_1#40 AS a#41] | |
| +- LogicalRDD [_1#40], MapPartitionsRDD[41] at intRddToDataFrameHolder at cmd18.sc:1 | |
| == Analyzed Logical Plan == | |
| a: int | |
| Union | |
| :- Union | |
| : :- Union | |
| : : :- Union | |
| : : : :- Project [_1#32 AS a#33] | |
| : : : : +- LogicalRDD [_1#32], MapPartitionsRDD[33] at intRddToDataFrameHolder at cmd18.sc:1 | |
| : : : +- Project [_1#34 AS a#35] | |
| : : : +- LogicalRDD [_1#34], MapPartitionsRDD[35] at intRddToDataFrameHolder at cmd18.sc:1 | |
| : : +- Project [_1#36 AS a#37] | |
| : : +- LogicalRDD [_1#36], MapPartitionsRDD[37] at intRddToDataFrameHolder at cmd18.sc:1 | |
| : +- Project [_1#38 AS a#39] | |
| : +- LogicalRDD [_1#38], MapPartitionsRDD[39] at intRddToDataFrameHolder at cmd18.sc:1 | |
| +- Project [_1#40 AS a#41] | |
| +- LogicalRDD [_1#40], MapPartitionsRDD[41] at intRddToDataFrameHolder at cmd18.sc:1 | |
| == Optimized Logical Plan == | |
| Union | |
| :- Union | |
| : :- Union | |
| : : :- Union | |
| : : : :- Project [_1#32 AS a#33] | |
| : : : : +- LogicalRDD [_1#32], MapPartitionsRDD[33] at intRddToDataFrameHolder at cmd18.sc:1 | |
| : : : +- Project [_1#34 AS a#35] | |
| : : : +- LogicalRDD [_1#34], MapPartitionsRDD[35] at intRddToDataFrameHolder at cmd18.sc:1 | |
| : : +- Project [_1#36 AS a#37] | |
| : : +- LogicalRDD [_1#36], MapPartitionsRDD[37] at intRddToDataFrameHolder at cmd18.sc:1 | |
| : +- Project [_1#38 AS a#39] | |
| : +- LogicalRDD [_1#38], MapPartitionsRDD[39] at intRddToDataFrameHolder at cmd18.sc:1 | |
| +- Project [_1#40 AS a#41] | |
| +- LogicalRDD [_1#40], MapPartitionsRDD[41] at intRddToDataFrameHolder at cmd18.sc:1 | |
| == Physical Plan == | |
| Union | |
| :- Project [_1#32 AS a#33] | |
| : +- Scan ExistingRDD[_1#32] | |
| :- Project [_1#34 AS a#35] | |
| : +- Scan ExistingRDD[_1#34] | |
| :- Project [_1#36 AS a#37] | |
| : +- Scan ExistingRDD[_1#36] | |
| :- Project [_1#38 AS a#39] | |
| : +- Scan ExistingRDD[_1#38] | |
| +- Project [_1#40 AS a#41] | |
| +- Scan ExistingRDD[_1#40] | |
| So the physical plan contains a single operation. That's good news ! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment