Skip to content

Instantly share code, notes, and snippets.

@smdmts
Last active June 29, 2018 01:23
Show Gist options
  • Select an option

  • Save smdmts/1fd6492a092126d3b8712239910691ef to your computer and use it in GitHub Desktop.

Select an option

Save smdmts/1fd6492a092126d3b8712239910691ef to your computer and use it in GitHub Desktop.
Spark2.0 : how to add hash row
import org.apache.spark.sql.types._
import java.security._
val df = sc.parallelize(Seq(
(1.0, 2.0), (0.0, -1.0),
(3.0, 4.0), (6.0, -2.3))).toDF("x", "y")
def mkMD5(text:String) = MessageDigest.getInstance("MD5").digest(text.getBytes).map("%02x".format(_)).mkString
def transformRow(row: Row): Row = Row.fromSeq(row.toSeq ++ Array[String](mkMD5(row.mkString(","))))
def transformRows(iter: Iterator[Row]): Iterator[Row] = iter.map(transformRow)
val newSchema = StructType(df.schema.fields ++ Array(StructField("row_hash", StringType, false)))
sqlContext.createDataFrame(df.rdd.mapPartitions(transformRows), newSchema).show
@smdmts
Copy link
Author

smdmts commented Jun 29, 2018

+---+----+--------------------+
|  x|   y|            row_hash|
+---+----+--------------------+
|1.0| 2.0|9a7ed087de322c808...|
|0.0|-1.0|7bdc66657c67783ba...|
|3.0| 4.0|91734025503db2a16...|
|6.0|-2.3|c4e852625fee76eca...|
+---+----+--------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment