Using Hadoop to analyze the full Wikipedia dump files using WikiHadoop

Background

Probably the largest free dataset available on the Internet is the full XML dump of the English Wikipedia. This dataset in it’s uncompressed form is about 5.5Tb and still growing. The sheer size of this dataset poses some serious challenges to analyze the data. In theory, Hadoop would be a great tool to analyze this dataset but it turns out that this is not necessarily the case.

Jimmy Lin wrote the Cloud9 a Hadoop InputReader that can handle the stub Wikipedia dump files (the stub dump files contain all variables as in the full dump file with the exception of the text of each revision). Unfortanutely, his InputReader does not support the full XML dump files.

The way that the XML dump files are organized is as follows: each dump file starts with some metadata tags and after that come the tags that contain the revisions. Hadoop has a StreamXmlRecordReader that allows you to grab an XML fragment and send it as input to a mapper. This poses two problems:

  • Some pages are so large (10′s of Gb’s) that you will run inevitable into out of memory errors.
  • Splitting by tag leads to serious information loss as you don’t know to which page a revision belongs.

Hence, Hadoop’s StreamXmlRecordReader is not suitable to analyze the full Wikipedia dump files.

During the last couple of weeks, the Wikimedia Foundation fellows of the Summer of Research have been working hard on tackling this problem. In particular a big thank you to Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin. We have released a customized InputFormat for the full Wikipedia dump files that supports both the compressed (bz2) and uncompressed files. The project is called WikiHadoop and the code is available on Github at https://github.com/whym/wikihadoop

Features of WikiHadoop

Wikihadoop offers the following:

  • WikiHadoop uses Hadoop’s streaming interface, so you can write your own mapper in Python, Ruby, Hadoop Pipes or Java.
  • You can choose between sending 1 or 2 revisions to a mapper. If you choose two revisions then it will send two consecutive revisions from a single page to a mapper. These two revisions can be used to create a diff between them (what has been added / removed). The syntax for this option is:
    -D org.wikimedia.wikihadoop.previousRevision=false (true is the default)
    
  • You can specify which namespaces to include when parsing the XML files. Default behavior is to include all namespaces. You can specify this by entering a regular expression. The syntax for this option is:
    -D org.wikimedia.wikihadoop.ignorePattern='xxxx'
    
  • You can parse both bz2 compressed and uncompressed files using WikiHadoop.

Getting Ready

  • Install and configure Hadoop 0.21. The reason you need Hadoop 0.21 is that it has streaming support for bz2 files and Hadoop 0.20 does not support this. Good places to look for help on configuration can be found http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ and http://hadoop.apache.org/common/docs/current/cluster_setup.html.
  • Download WikiHadoop and extract the source tree. Confirm there is a directory called mapreduce.
  • Download Hadoop Common and extract the source tree. Confirm there is a directory called mapreduce.
  • Move to the top directory of the source tree of your copy of Hadoop Common.
  • Merge the mapreduce directory of your copy of WikiHadoop into that of Hadoop Common.
    rsync -r ../wikihadoop/mapreduce/ mapreduce/
    
  • Move to the directory called mapreduce/src/contrib/streaming under the source tree of Hadoop Common.
    cd mapreduce/src/contrib/streaming
    
  • Run Ant to build a jar file.
    ant jar
    
  • Find the jar file at mapreduce/build/contrib/streaming/hadoop-${version}-streaming.jar under the Hadoop common source tree.

If everything went smoothly then you should now have built the Wikihadoop InputReader and a functioning installation of Hadoop. If you have difficulties compiling WikiHadoop then please contact us, we are happy to help you out.

Tutorial

So now we are ready to start crunching some serious data!

  • Download the latest full dump from http://dumps.wikimedia.org/enwiki/latest/. Look for the files that start with enwiki-latest-pages-meta-history and end with ‘bz2′. You can also download the 7z files but then you will need to decompress them. Hadoop cannot stream 7z files at the moment.
  • Copy the bz2 files to HDFS. Make sure you have enough space, you can delete the bz2 files from your regular partition after they have been copied to HDFS.
    hdfs dfs -copyFromLocal /path/to/dump/files/enwiki--pages-meta-history.xml.bz2 /path/on/hdfs/
    

    You can check to see if the files were successfully copy to hdfs via:

    hdfs dfs -ls /path/on/hdfs/
    
  • Once the files are in HDFS, you can launch Hadoop by entering the following command:
    hadoop jar hadoop-0.<version>-streaming.jar 
    -D mapred.child.ulimit=3145728 
    -D mapreduce.task.timeout=0 
    -D mapreduce.input.fileinputformat.split.minsize=400000000  #Sets the file split size a smaller size will mean more seeking and SLOWER processing time
    -D mapred.output.compress=true 
    -D mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec 
    -input /path/on/hdfs/enwiki-<date>-pages-meta-history<file>.xml.bz2 
    -output /path/on/hdfs/out 
    -mapper <name of mapper>
    -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat
    
  • You can customize your job with the following parameters:
  • -D . This is a regular expression that determines which namespaces to include. The default is to include all the namspaces.
  • -D mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec. This compresses the output of the mapper using the LZO compression algorithm. This is optional but it saves hard disk space.

Real Life Application

At the Wikimedia Foundation we wanted to have a more fine-grained understanding of the different types of editors that we have. To analyze this, we need to generate the diffs between two revisions to see what type of content an editor has removed and added. In the examples folder of https://github.com/whym/wikihadoop you can find our mapper function that creates diffs based on the two revisions it receives from WikiHadoop. We set the number of reducers to 0 as there is no aggregation over the diffs, we want just want to store them.

You can launch this as follows:

hadoop jar hadoop-0.<version>-streaming.jar 
-D mapred.child.ulimit=3145728 
-D mapreduce.task.timeout=0 
-D mapreduce.input.fileinputformat.split.minsize=400000000 
-D mapred.reduce.tasks=0
-D mapred.output.compress=true 
-D mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec 
-input /path/on/hdfs/enwiki-<date>-pages-meta-history<file>.xml.bz2 
-output /path/on/hdfs/out 
-mapper /path/to/wikihadoop/examples/mapper.py
-inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat

Depending on the number of nodes in your cluster, the number of cores in each node, and memory on each node, this job will run for quite a while. We are running it on a three node mini-cluster with quad-core machines and the job takes about 14 days to parse the entire English Wikpedia dump files.

This entry was posted in hadoop, wikipedia and tagged , , . Bookmark the permalink.

24 Responses to Using Hadoop to analyze the full Wikipedia dump files using WikiHadoop

  1. Pingback: pinboard August 16, 2011 — arghh.net

  2. Evert says:

    Nice job! Can I use the InputFormat in Hadoop 0.20.2 if I have the dump uncompressed? We’re running only this version on our cluster at the moment. Or maybe I can backport it?

    • Diederik says:

      Making WikiHadoop 0.20-compatible seems a bit of work. To make it backwards compatible, you have to use classes and methods from 0.20 that will be deprecated in 0.21. So yes, it is possible but migrating to 0.21 might be more easy.

  3. Pingback: Strata Week: Cracking a book’s genetic code | National Cyber Security

  4. Kalvin says:

    Very nice, thanks a lot!

    What about the latest stable version 0.20.203.0 ? Since version 0.21 is still considered unstable, admins from bigger cluster setups are likely avoid migrating in the moment.

  5. Pingback: Strata Week: Cracking a book’s genetic code - NEW BIOTECHNOLOGIES – NEW BIOTECHNOLOGIES

  6. klvn says:

    Could you specify which version of Hadoop-Common 0.21 you are using? rc0?

  7. ablimit says:

    There is no directory named mapreduce in the hadoop commons package.
    But there is a directory with name mapred.

  8. chun says:

    I cannot find a directory mapreduce in hadoop common. Which one should I download and where is the directory?

  9. To those of you who could not compile it:
    A compiled copy of WikiHadoop is now available on GitHub. Please try using it, to avoid the pain of building it by yourself:
    https://github.com/whym/wikihadoop/downloads

    klvn says:
    > September 25, 2011 at 7:04 pm
    > Could you specify which version of Hadoop-Common 0.21 you are using? rc0?

    Please use the repository version of Hadoop-common 0.21:
    https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.21/
    or Hadoop-common 0.21.0 on the GitHub site:
    https://github.com/apache/hadoop-common/downloads

    I have been using the repository version of Hadoop 0.21. The one I recently compiled against was revision 1135003.

    I have seen some problems in building against other versions of Hadoop, even if they are called 0.21.x.

  10. It would be great if there was a version of wikihadoop which compiles towards hadoop 1.0.1. Is there any working going on to get this done?

    Or can you give me some more background information on the splitting feature?

    Kind regards,
    Herbert

    • Diederik says:

      Hi Herbert,
      Yes we are looking in updating the code and make it work with more recent Hadoop versions. Hopefully I can detail progress soon.
      Best,
      Diederik

  11. Hi Diederik.
    Can you keep me updated on the development, I wouldn’t mind to be a beta tester for the updated code. :-)

    Best wishes,
    Herbert

  12. Mihir Patel says:

    I like to participate in this project as a Developer.
    Contact me via email.

    Best,
    Mihir

    • Diederik says:

      Hi Mihir!

      That’s an amazing offer, probably the easiest thing to do is to clone the repository on github: https://github.com/whym/wikihadoop
      Then, have a look at the open issues list to see what bugs need to be fixed or you can work on Hadoop 1.0 compatibility :) or check CDH4 compatibility. Please let me know if you have any questions, we are here to help you!
      Best,
      Diederik

  13. Pingback: Strata Week: Cracking a book's genetic code - O'Reilly Radar

  14. Khanh says:

    Thank you very much for the tutorial, Diederik. However, I have some problem from merging mapreduce/ directory.
    So, I downloaded wikihadoop from here: https://github.com/whym/wikihadoop, and after building, there’s a directory called mapred/ that contains auto.bz2
    However, hadoop-common (downloaded from here https://github.com/apache/hadoop-common) does not have a mapreduce/ directory. There’s a hadoop-mapreduce-project/ directory but it does not have mapreduce/src/contrib/streaming/ subdirectory as mentioned above.
    Did I misunderstand something? Can you clarify please?
    Thanks in advance.
    Khanh

  15. (responding to Khanh’s post above)

    I think the most likely situation would be that the instruction you followed was outdated. Could you try the “How to use” section onhttps://github.com/whym/wikihadoop/blob/master/README.rst? That would give you most up-to-date information.

    We have made a significant change in the building procedure (namely from ant to maven) since this blog post, and if you followed the “Getting Ready” section here, things would not have work properly.

    Cheers,
    Yusuke

    • Khanh says:

      Thanks for your response, Yusuke. I ended up using the jar file and it seems working fine. It ran well with Hadoop’s default mapper. I’m trying to write mapper and reducer for it now.

      Thanks,

  16. James Derrick says:

    Thanks a lot for the tutorial, Diederik. Though, I am getting an error message when I run it with a sample bz2 dump. I am on CDH4.1.2. Here are the steps:
    - I followed the instructions and cloned the repo for the latest wikihadoop
    - built it fine
    - changed the mvn’s settings.xml and added the repo as instructed
    - ran the following command, it ran for a while then gave the following error (received the same error even with another bz2 dump file):

    Command:
    hadoop jar /home/james/hadoop-2.0.0-cdh4.1.2/share/hadoop/tools/lib/hadoop-streaming-2.0.0-cdh4.1.2.jar -libjars target/wikihadoop-0.2.jar -D mapreduce.input.fileinputformat.split.minsize=300000000 -D mapreduce.task.timeout=6000000 -input /home/james/Class-Project/enwiki-latest-pages-meta-history21.xml-p015526402p015725000.bz2 -output output-wiki -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat

    Error message:
    13/04/15 13:08:57 WARN mapred.LocalJobRunner: job_local_0001
    java.lang.Exception: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.LongWritable, received org.apache.hadoop.io.Text
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:400)
    Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.LongWritable, received org.apache.hadoop.io.Text
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:998)
    at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:550)
    at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:43)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:399)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:232)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:722)
    13/04/15 13:08:57 INFO mapreduce.Job: Job job_local_0001 failed with state FAILED due to: NA
    13/04/15 13:08:58 INFO mapreduce.Job: Counters: 23
    File System Counters
    FILE: Number of bytes read=4393078884
    FILE: Number of bytes written=1362860
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    Map-Reduce Framework
    Map input records=5
    Map output records=0
    Map output bytes=0
    Map output materialized bytes=0
    Input split bytes=745
    Combine input records=0
    Spilled Records=0
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=19636
    CPU time spent (ms)=0
    Physical memory (bytes) snapshot=0
    Virtual memory (bytes) snapshot=0
    Total committed heap usage (bytes)=2842165248
    File Input Format Counters
    Bytes Read=1299193764
    org.wikimedia.wikihadoop.StreamWikiDumpInputFormat$WikiDumpCounters
    FOUND_PAGES=152920
    WRITTEN_PAGES=5
    WRITTEN_REVISIONS=5
    13/04/15 13:08:58 ERROR streaming.StreamJob: Job not Successful!
    Streaming Command Failed!

    Any help is appreciated.

    James

    • Diederik says:

      Hey James! Thanks for sending this to me, maybe you could file this as a bug report on the Github repo? I don’t have a solution off the cuff of my hand :(
      Diederik

      • shaw says:

        Hi Diederik
        we can change the input format to resolve it by adding ” -D mapred.mapoutput.key.class=org.apache.hadoop.io.Text”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>