MapReduce (M/R) is a technique for dividing work across a distributed system. This takes advantage of the parallel processing power of distributed systems, and also reduces network bandwidth as the algorithm is passed around to where the data lives, rather than a potentially huge dataset transferred to a client algorithm. Developers can use MapReduce for things like filtering documents by tags, counting words in documents, and extracting links to related data.
In Riak, MapReduce is one method for non-key-based querying. MapReduce jobs can be submitted through the HTTP API or the Protocol Buffers API. Also, note that Riak MapReduce is intended for batch processing, not real-time querying.
Comparing Map/Reduce to Traditional Parallelism
In order to appreciate what map/reduce brings to the table, I think it is most meaningful to contrast it to what I calltraditional computing problems. I define "traditional" computing problems as those which use libraries like MPI, OpenMP, CUDA, or pthreads to produce results by utilizing multiple CPUs to perform some sort of numerical calculation concurrently. Problems that are well suited to being solved with these traditional methods typically share two common features:
- They are cpu-bound: the part of the problem that takes the most time is doing calculations involving floating point or integer arithmetic
- Input data is gigabyte-scale: the data that is necessary to describe the conditions of the calculation are typically less than a hundred gigabytes, and very often only a few hundred megabytes at most
Item #1 may seem trivial; after all, computers are meant to compute, so wouldn't all of the problems that need to be parallelized be fundamentally limited by how quickly the computer can do numerical calculations?
Traditionally, the answer to this question has been yes, but the technological landscape has been rapidly changing over the last decade. Sources of vast, unending data (e.g., social media, inexpensive genenome sequencing) have converged with inexpensive, high-capacity hard drives and the advanced filesystems to support them, and now data-intensive computing problems are emerging. In contrast to the aforementioned traditional computing problems, data-intensive problems demonstrate the following features:
- Input data is far beyond gigabyte-scale: datasets are commonly on the order of tens, hundreds, or thousands of terabytes
- They are I/O-bound: it takes longer for the computer to get data from its permanent location to the CPU than it takes for the CPU to operate on that data
Traditional Parallel Applications:
To illustrate these differences, the following schematic depicts how your typical traditionally parallel application works.
The input data is stored on some sort of remote storage device (a SAN, a file server serving files over NFS, a parallel Lustre or GPFS filesystem, etc; grey cylinders). The compute resources or elements (blue boxes) are abstract units that can represent MPI ranks, compute nodes, or threads on a shared-memory system.
Upon launching a traditionally parallel application,
- A master parallel worker (MPI rank, thread, etc) reads the input data from disk (green arrow).
NOTE: In some cases multiple ranks may use a parallel I/O API like MPI-IO to collectively read input data, but the filesystem on which the input data resides must be a high-performance filesystem that can sustain the required device- and network-read bandwidth.
- The master worker then divides up the input data into chunks and sends parts to each of the other workers (red arrows).
- All of the parallel workers compute their chunk of the input data
- All of the parallel workers communicate their results with each other, then continue the next iteration of the calculation
NOTE: In some cases multiple ranks may use a parallel I/O API like MPI-IO to collectively read input data, but the filesystem on which the input data resides must be a high-performance filesystem that can sustain the required device- and network-read bandwidth.
Data-intensive Applications:
The map/reduce paradigm is a completely different way of solving a certain subset of parallelizable problems that gets around the bottleneck of ingesting input data from disk (that pesky green arrow). Whereas traditional parallelism brings the data to the compute, map/reduce does the opposite--it brings the compute to the data:
In map/reduce, the input data is not stored on a separate, high-capacity storage system. Rather, the data exists in little pieces and is permanently stored on the compute elements. This allows our parallel procedure to follow these steps:
- We don't have to move any data since it is pre-divided and already exists on nodes capable of acting as computing elements
- All of the parallel worker functions are sent to the nodes where their respective pieces of the input data already exist and do their calculations
- All of the parallel workers communicate their results with each other, move data if necessary, then continue the next step of the calculation
Thus, the only time data needs to be moved is when all of the parallel workers are communicating their results with each other in step #3. There is no more serial step where data is being loaded from a storage device before being distributed to the computing resources because the data already exists on the computing resources.
Of course, for the compute elements to be able to do their calculations on these chunks of input data, the calculations and data must be all completely independent from the input data on other compute elements. This is the principal constraint in map/reduce jobs: map/reduce is ideally suited for trivially parallel calculations on large quantities of data, but if each worker's calculations depend on data that resides on other nodes, you will begin to encounter rapidly diminishing returns.
Of course, for the compute elements to be able to do their calculations on these chunks of input data, the calculations and data must be all completely independent from the input data on other compute elements. This is the principal constraint in map/reduce jobs: map/reduce is ideally suited for trivially parallel calculations on large quantities of data, but if each worker's calculations depend on data that resides on other nodes, you will begin to encounter rapidly diminishing returns.
No comments:
Post a Comment