LZO is a splittable compression format for files stored in Hadoop’s HDFS. It has valuable combination of speed and compression size. Thanks to hadoop-lzo the .lzo files could be splittable too.
-
Install
lzoandlzopcodes [OSX].
$ brew install lzo lzop
-
Find where the headers and libraries are installed
$ brew list lzo
The output should look like follows
/usr/local/Cellar/lzo/2.06/include/lzo/ (13 files) /usr/local/Cellar/lzo/2.06/lib/liblzo2.2.dylib /usr/local/Cellar/lzo/2.06/lib/ (2 other files) /usr/local/Cellar/lzo/2.06/share/doc/ (7 files)
-
Clone
hadoop-lzorepository.
$ git clone https://github.com/twitter/hadoop-lzo $ cd hadoop-lzo
-
Build the project (
mavenrequired)
$ C_INCLUDE_PATH=/usr/local/Cellar/lzo/2.06/include/lzo/LIBRARY_PATH=/usr/local/Cellar/lzo/2.06/lib/ mvn clean install
-
Copy the libraries into the
Hadoopinstallation directory. We assume that theHADOOP_INSTALLpoints to the hadoop installation folder (for example/usr/local/hadoop)
$ cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar $HADOOP_INSTALL/lib $ mkdir -p $HADOOP_INSTALL/lib/lzo $ cp -r target/native/* $HADOOP_INSTALL/lib/lzo
-
Add
hadoop-lzojar and native libraries to hadoop’s classpath and library path. Do it either in~/.bash_profileor$HADOOP_INSTALL/etc/hadoop/hadoop-env.sh
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar export HADOOP_OPTS=„$HADOOP_OPTS -Djava.library.path=$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo”
-
Add
lzocompression codes to the hadoop’s$HADOOP_INSTALL/etc/hadoop/core-site.xml
<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec </value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property>
-
Add
lzodependencies to the Apache Spark configuration$SPARK_INSTALL/conf/spark-env.sh
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_INSTALL/lib/native/osx:$HADOOP_INSTALL/lib/native/lzo export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar
-
Add
lzocompression codec to the HadoopConfigurationinstance that you pass toSparkContext(driver) instance
conf.set(„io.compression.codecs”, ”com.hadoop.compression.lzo.LzopCodec”);
-
Convert file (for example
bz2) to thelzoformat and import new file to the Hadoop’sHDFS
$ bzip2 --stdout file.bz2 | lzop -o file.lzo $ hdfs dfs -put file.lzo input
-
Index
lzocompressed files directly inHDFS
$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input/file.lzo
or index all lzo file in the input folder
$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer input
or index lzo files with map reduce job
$ hadoop jar $HADOOP_INSTALL/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer input
REFERENCES
-
[[[1]]] http://xiaming.me/posts/2014/05/03/enable-lzo-compression-on-hadoop-pig-and-spark/
-
[[[3]]] https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/reading-lzo-files.md
-
[[[4]]] https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/reading-lzo-files.md