Background
Probably the largest free dataset available on the Internet is the full XML dump of the English Wikipedia. This dataset in it’s uncompressed form is about 5.5Tb and still growing. The sheer size of this dataset poses some serious challenges to analyze the data. In theory, Hadoop would be a great tool to analyze this dataset but it turns out that this is not necessarily the case.
Jimmy Lin wrote the Cloud9 a Hadoop InputReader that can handle the stub Wikipedia dump files (the stub dump files contain all variables as in the full dump file with the exception of the text of each revision). Unfortanutely, his InputReader does not support the full XML dump files.
The way that the XML dump files are organized is as follows: each dump file starts with some metadata tags and after that come the
- Some pages are so large (10′s of Gb’s) that you will run inevitable into out of memory errors.
- Splitting by
tag leads to serious information loss as you don’t know to which page a revision belongs.
Hence, Hadoop’s StreamXmlRecordReader is not suitable to analyze the full Wikipedia dump files.
During the last couple of weeks, the Wikimedia Foundation fellows of the Summer of Research have been working hard on tackling this problem. In particular a big thank you to Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin. We have released a customized InputFormat for the full Wikipedia dump files that supports both the compressed (bz2) and uncompressed files. The project is called WikiHadoop and the code is available on Github at https://github.com/whym/wikihadoop
Features of WikiHadoop
Wikihadoop offers the following:
- WikiHadoop uses Hadoop’s streaming interface, so you can write your own mapper in Python, Ruby, Hadoop Pipes or Java.
- You can choose between sending 1 or 2 revisions to a mapper. If you choose two revisions then it will send two consecutive revisions from a single page to a mapper. These two revisions can be used to create a diff between them (what has been added / removed). The syntax for this option is:
-D org.wikimedia.wikihadoop.previousRevision=false (true is the default)
- You can specify which namespaces to include when parsing the XML files. Default behavior is to include all namespaces. You can specify this by entering a regular expression. The syntax for this option is:
-D org.wikimedia.wikihadoop.ignorePattern='xxxx'
- You can parse both bz2 compressed and uncompressed files using WikiHadoop.
Getting Ready
- Install and configure Hadoop 0.21. The reason you need Hadoop 0.21 is that it has streaming support for bz2 files and Hadoop 0.20 does not support this. Good places to look for help on configuration can be found http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ and http://hadoop.apache.org/common/docs/current/cluster_setup.html.
- Download WikiHadoop and extract the source tree. Confirm there is a directory called mapreduce.
- Download Hadoop Common and extract the source tree. Confirm there is a directory called mapreduce.
- Move to the top directory of the source tree of your copy of Hadoop Common.
- Merge the mapreduce directory of your copy of WikiHadoop into that of Hadoop Common.
rsync -r ../wikihadoop/mapreduce/ mapreduce/
- Move to the directory called mapreduce/src/contrib/streaming under the source tree of Hadoop Common.
cd mapreduce/src/contrib/streaming
- Run Ant to build a jar file.
ant jar
- Find the jar file at mapreduce/build/contrib/streaming/hadoop-${version}-streaming.jar under the Hadoop common source tree.
If everything went smoothly then you should now have built the Wikihadoop InputReader and a functioning installation of Hadoop. If you have difficulties compiling WikiHadoop then please contact us, we are happy to help you out.
Tutorial
So now we are ready to start crunching some serious data!
- Download the latest full dump from http://dumps.wikimedia.org/enwiki/latest/. Look for the files that start with enwiki-latest-pages-meta-history and end with ‘bz2′. You can also download the 7z files but then you will need to decompress them. Hadoop cannot stream 7z files at the moment.
- Copy the bz2 files to HDFS. Make sure you have enough space, you can delete the bz2 files from your regular partition after they have been copied to HDFS.
hdfs dfs -copyFromLocal /path/to/dump/files/enwiki-
-pages-meta-history .xml.bz2 /path/on/hdfs/ You can check to see if the files were successfully copy to hdfs via:
hdfs dfs -ls /path/on/hdfs/
- Once the files are in HDFS, you can launch Hadoop by entering the following command:
hadoop jar hadoop-0.<version>-streaming.jar -D mapred.child.ulimit=3145728 -D mapreduce.task.timeout=0 -D mapreduce.input.fileinputformat.split.minsize=400000000 #Sets the file split size a smaller size will mean more seeking and SLOWER processing time -D mapred.output.compress=true -D mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec -input /path/on/hdfs/enwiki-<date>-pages-meta-history<file>.xml.bz2 -output /path/on/hdfs/out -mapper <name of mapper> -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat
- You can customize your job with the following parameters:
- -D
. This is a regular expression that determines which namespaces to include. The default is to include all the namspaces. - -D mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec. This compresses the output of the mapper using the LZO compression algorithm. This is optional but it saves hard disk space.
Real Life Application
At the Wikimedia Foundation we wanted to have a more fine-grained understanding of the different types of editors that we have. To analyze this, we need to generate the diffs between two revisions to see what type of content an editor has removed and added. In the examples folder of https://github.com/whym/wikihadoop you can find our mapper function that creates diffs based on the two revisions it receives from WikiHadoop. We set the number of reducers to 0 as there is no aggregation over the diffs, we want just want to store them.
You can launch this as follows:
hadoop jar hadoop-0.<version>-streaming.jar -D mapred.child.ulimit=3145728 -D mapreduce.task.timeout=0 -D mapreduce.input.fileinputformat.split.minsize=400000000 -D mapred.reduce.tasks=0 -D mapred.output.compress=true -D mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec -input /path/on/hdfs/enwiki-<date>-pages-meta-history<file>.xml.bz2 -output /path/on/hdfs/out -mapper /path/to/wikihadoop/examples/mapper.py -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat
Depending on the number of nodes in your cluster, the number of cores in each node, and memory on each node, this job will run for quite a while. We are running it on a three node mini-cluster with quad-core machines and the job takes about 14 days to parse the entire English Wikpedia dump files.
Pingback: pinboard August 16, 2011 — arghh.net
Nice job! Can I use the InputFormat in Hadoop 0.20.2 if I have the dump uncompressed? We’re running only this version on our cluster at the moment. Or maybe I can backport it?
Making WikiHadoop 0.20-compatible seems a bit of work. To make it backwards compatible, you have to use classes and methods from 0.20 that will be deprecated in 0.21. So yes, it is possible but migrating to 0.21 might be more easy.
Pingback: Strata Week: Cracking a book’s genetic code | National Cyber Security
Very nice, thanks a lot!
What about the latest stable version 0.20.203.0 ? Since version 0.21 is still considered unstable, admins from bigger cluster setups are likely avoid migrating in the moment.
Pingback: Strata Week: Cracking a book’s genetic code - NEW BIOTECHNOLOGIES – NEW BIOTECHNOLOGIES
Could you specify which version of Hadoop-Common 0.21 you are using? rc0?
There is no directory named mapreduce in the hadoop commons package.
But there is a directory with name mapred.
I cannot find a directory mapreduce in hadoop common. Which one should I download and where is the directory?
To those of you who could not compile it:
A compiled copy of WikiHadoop is now available on GitHub. Please try using it, to avoid the pain of building it by yourself:
https://github.com/whym/wikihadoop/downloads
klvn says:
> September 25, 2011 at 7:04 pm
> Could you specify which version of Hadoop-Common 0.21 you are using? rc0?
Please use the repository version of Hadoop-common 0.21:
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.21/
or Hadoop-common 0.21.0 on the GitHub site:
https://github.com/apache/hadoop-common/downloads
I have been using the repository version of Hadoop 0.21. The one I recently compiled against was revision 1135003.
I have seen some problems in building against other versions of Hadoop, even if they are called 0.21.x.
It would be great if there was a version of wikihadoop which compiles towards hadoop 1.0.1. Is there any working going on to get this done?
Or can you give me some more background information on the splitting feature?
Kind regards,
Herbert
Hi Herbert,
Yes we are looking in updating the code and make it work with more recent Hadoop versions. Hopefully I can detail progress soon.
Best,
Diederik
Hi Diederik.
Can you keep me updated on the development, I wouldn’t mind to be a beta tester for the updated code.
Best wishes,
Herbert
Hi Herbert,
I will connect you with our developer! Thanks so much for the offer.
Best,
Diederik
Hi Herbert,
You might want to visit and see what wil happen at https://github.com/whym/wikihadoop/issues/8 where we are trying to port WikiHadoop to Hadoop 0.20.x. Hopefully it will help us get it work with 1.0.0 too, since 1.0.0 was essentially a renamed 0.20-security.
Best,
Yusuke
I like to participate in this project as a Developer.
Contact me via email.
Best,
Mihir
Hi Mihir!
That’s an amazing offer, probably the easiest thing to do is to clone the repository on github: https://github.com/whym/wikihadoop
or check CDH4 compatibility. Please let me know if you have any questions, we are here to help you!
Then, have a look at the open issues list to see what bugs need to be fixed or you can work on Hadoop 1.0 compatibility
Best,
Diederik
Pingback: Strata Week: Cracking a book's genetic code - O'Reilly Radar
Thank you very much for the tutorial, Diederik. However, I have some problem from merging mapreduce/ directory.
So, I downloaded wikihadoop from here: https://github.com/whym/wikihadoop, and after building, there’s a directory called mapred/ that contains auto.bz2
However, hadoop-common (downloaded from here https://github.com/apache/hadoop-common) does not have a mapreduce/ directory. There’s a hadoop-mapreduce-project/ directory but it does not have mapreduce/src/contrib/streaming/ subdirectory as mentioned above.
Did I misunderstand something? Can you clarify please?
Thanks in advance.
Khanh
(responding to Khanh’s post above)
I think the most likely situation would be that the instruction you followed was outdated. Could you try the “How to use” section onhttps://github.com/whym/wikihadoop/blob/master/README.rst? That would give you most up-to-date information.
We have made a significant change in the building procedure (namely from ant to maven) since this blog post, and if you followed the “Getting Ready” section here, things would not have work properly.
Cheers,
Yusuke
Thanks for your response, Yusuke. I ended up using the jar file and it seems working fine. It ran well with Hadoop’s default mapper. I’m trying to write mapper and reducer for it now.
Thanks,
Thanks a lot for the tutorial, Diederik. Though, I am getting an error message when I run it with a sample bz2 dump. I am on CDH4.1.2. Here are the steps:
- I followed the instructions and cloned the repo for the latest wikihadoop
- built it fine
- changed the mvn’s settings.xml and added the repo as instructed
- ran the following command, it ran for a while then gave the following error (received the same error even with another bz2 dump file):
Command:
hadoop jar /home/james/hadoop-2.0.0-cdh4.1.2/share/hadoop/tools/lib/hadoop-streaming-2.0.0-cdh4.1.2.jar -libjars target/wikihadoop-0.2.jar -D mapreduce.input.fileinputformat.split.minsize=300000000 -D mapreduce.task.timeout=6000000 -input /home/james/Class-Project/enwiki-latest-pages-meta-history21.xml-p015526402p015725000.bz2 -output output-wiki -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat
Error message:
13/04/15 13:08:57 WARN mapred.LocalJobRunner: job_local_0001
java.lang.Exception: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.LongWritable, received org.apache.hadoop.io.Text
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:400)
Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.LongWritable, received org.apache.hadoop.io.Text
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:998)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:550)
at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:43)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:399)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:232)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
13/04/15 13:08:57 INFO mapreduce.Job: Job job_local_0001 failed with state FAILED due to: NA
13/04/15 13:08:58 INFO mapreduce.Job: Counters: 23
File System Counters
FILE: Number of bytes read=4393078884
FILE: Number of bytes written=1362860
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=5
Map output records=0
Map output bytes=0
Map output materialized bytes=0
Input split bytes=745
Combine input records=0
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=19636
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=2842165248
File Input Format Counters
Bytes Read=1299193764
org.wikimedia.wikihadoop.StreamWikiDumpInputFormat$WikiDumpCounters
FOUND_PAGES=152920
WRITTEN_PAGES=5
WRITTEN_REVISIONS=5
13/04/15 13:08:58 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Any help is appreciated.
James
Hey James! Thanks for sending this to me, maybe you could file this as a bug report on the Github repo? I don’t have a solution off the cuff of my hand
Diederik
Hi Diederik
we can change the input format to resolve it by adding ” -D mapred.mapoutput.key.class=org.apache.hadoop.io.Text”