Sunday, 5 October 2014

Hadoop - A Map/Reduce Implementation

Hadoop is a framework for managing large data processing, analysing and getting useful results out of that.

1.1. The Magic of HDFS

The idea underpinning map/reduce--bringing compute to the data instead of the opposite--should sound like a very simple solution to the I/O bottleneck inherent in traditional parallelism. However, the devil is in the details, and implementing a framework where a single large file is transparently diced up and distributed across multiple physical computing elements (all while appearing to remain a single file to the user) is not trivial.
Hadoop, perhaps the most widely used map/reduce framework, accomplishes this feat using HDFS, the Hadoop Distributed File System. HDFS is fundamental to Hadoop because it provides the data chunking and distribution across compute elements necessary for map/reduce applications to be efficient. Since we're now talking about an actual map/reduce implementation and not an abstract concept, let's refer to the abstract compute elements now as compute nodes.
HDFS exists as a filesystem into which you can copy files to and from in a manner not unlike any other filesystem. Many of the typical commands for manipulating files (lsmkdirrmmvcpcattail, and chmod, to name a few) behave as you might expect in any other standard filesystem (e.g., Linux's ext4).
The magical part of HDFS is what is going on just underneath the surface. Although it appears to be a filesystem that contains files like any other, in reality those files are distributed across multiple physical compute nodes:
Schematic depicting the magic of HDFS
When you copy a file into HDFS as depicted above, that file is transparently sliced into 64 MB "chunks" and replicated three times for reliability. Each of these chunks are distributed to various compute nodes in the Hadoop cluster so that a given 64 MB chunk exists on three independent nodes. Although physically chunked up and distributed in triplicate, all of your interactions with the file on HDFS still make it appear as the same single file you copied into HDFS initially. Thus, HDFS handles all of the burden of slicing, distributing, and recombining your data for you.
HDFS's chunk size and replication
The 64 MB chunk (block) size and the choice to replicate your data three times are only HDFS's default values. These decisions can be changed:
  • the 64 MB block size can be modified by changing the dfs.block.size property in hdfs-site.xml. It is common to increase this to 128 MB in production environments.
  • the replication factor can be modified by changing the dfs.replication property in hdfs-site.xml. It can also be changed on a per-file basis by specifying -D dfs.replication=1 on your -put command line, or using thehadoop dfs -setrep -w 1 command.

1.2. Map/Reduce Jobs

HDFS is an interesting technology in that it provides data distribution, replication, and automatic recovery in a user-space filesystem that is relatively easy to configure and, conceptually, easy to understand. However, its true utility comes to light when map/reduce jobs are executed on data stored in HDFS.
As the name implies, map/reduce jobs are principally comprised of two steps: the map step and the reduce step. The overall workflow generally looks something like this:
Program flow of a map/reduce application
The left half of the diagram depicts the HDFS magic described in the previous section, where the hadoop dfs -copyFromLocal command is used to move a large data file into HDFS and it is automatically replicated and distributed across multiple physical compute nodes. While this step of moving data into HDFS is not strictly a part of a map/reduce job (i.e., your dataset may already have a permanent home on HDFS just like it would any other filesystem), a map/reduce job's input data must already exist on HDFS before the job can be started.

1.2.1. The Map Step

Once a map/reduce job is initiated, the map step
  1. Launches a number of parallel mappers across the compute nodes that contain chunks of your input data
  2. For each chunk, a mapper then "splits" the data into individual lines of text on newline characters (\n)
  3. Each split (line of text that was terminated by \n) is given to your mapper function
  4. Your mapper function is expected to turn each line into zero or more key-value pairs and then "emit" these key-value pairs for the subsequent reduce step
That is, the map step's job is to transform your raw input data into a series of key-value pairs with the expectation that these parsed key-value pairs can be analyzed meaningfully by the reduce step. It's perfectly fine for duplicate keys to be emitted by mappers.
Input splitting
The decision to split your input data along newline characters is just the default behavior, which assumes your input data is just an ascii text file. You can change how input data is split before being passed to your mapper function using alternate InputFormats.

1.2.2. The Reduce Step

Once all of the mappers have finished digesting the input data and have emitted all of their key-value pairs, those key-value pairs are sorted according to their keys and then passed on to the reducers. The reducers are given key-value pairs in such a way that all key-value pairs sharing the same key always go to the same reducer. The corollary is then that if one particular reducer has one specific key, it is guaranteed to have all other key-value pairs sharing that same key, and all those common keys will be in a continuous strip of key-value pairs that reducer received.
Your job's reducer function then does some sort of calculation based on all of the values that share a common key. For example, the reducer might calculate the sum of all values for each key (e.g., the word count example). The reducers then emit key-value pairs back to HDFS where each key is unique, and each of these unique keys' values are the result of the reducer function's calculation.
The Sort and Shuffle
The process of sorting and distributing the mapper's output to the reducers can be seen as a separate step often called the "shuffle". What really happens is that as mappers emit key-value pairs, the keys are passed through thePartitioner to determine which reducer they are sent to.
The default Partitioner is a function which hashes the key and then takes the modulus of this hash and the number of reducers to determine which reducer gets that key-value pair. Since the hash of a given key will always be the same, all key-value pairs sharing the same key will get the same output value from the Partitioner and therefore wind up on the same reducer.
Once all key-value pairs are assigned to their reducers, the reducers all sort their keys so that a single loop over all of a reducer's keys will examine all the values of a single key before moving on to the next key. As you will see in my tutorial on writing mappers and reducers in Python, this is an essential feature of the Hadoop streaming interface.

No comments:

Post a Comment