Skip to content

Instantly share code, notes, and snippets.

@nomoa
Created August 31, 2020 07:40
Show Gist options
  • Select an option

  • Save nomoa/dbd1d1575c912746645638c2d07bfcdf to your computer and use it in GitHub Desktop.

Select an option

Save nomoa/dbd1d1575c912746645638c2d07bfcdf to your computer and use it in GitHub Desktop.
extract turtle from parquet
def extract(implicit spark: SparkSession): Unit = {
val df = spark.read.parquet("...")
val prefix = "/path/file-"
val encoder = new StatementEncoder()
df.foreachPartition(rows => {
val partition = TaskContext.getPartitionId()
val writer = new GZIPOutputStream(new BufferedOutputStream(Files.newOutputStream(Paths.get(s"$prefix-$partition.ttl.gz"))))
val rdfWriter = Rio.createWriter(RDFFormat.TURTLE, writer)
rdfWriter.startRDF()
rows.foreach(row => rdfWriter.handleStatement(encoder.decode(row)))
rdfWriter.endRDF()
writer.close()
})
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment