Hadoop The Definitive Guide

Book Summary

This is a summary and notes from Hadoop The Definitive Guide 4th Edition.

more data usually beats better algorithms
In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and analysis.
MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative.
seek time is improving more slowly than transfer rate.

A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration information.
Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks.
The tasks are scheduled using YARN and run on nodes in the cluster. If a task fails, it will be automatically rescheduled to run on a different node.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits.
Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split.
Having many splits means the time taken to process each split is small compared to the time to process the whole input.
On the other hand, if splits are too small, the overhead of managing the splits and map task creation begins to dominate the total job execution time.
For most jobs, a good split size tends to be the size of an HDFS block, which is 128 MB by default.
Hadoop does its best to run the map task on a node where the input data resides in HDFS, because it doesn’t use valuable cluster bandwidth. This is called the data locality optimization
Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete, the map output can be thrown away. So, storing it in HDFS with replication would be overkill.

Reduce tasks don’t have the advantage of data locality; the input to a single reduce task is normally the output from all mappers. In the present example, we have a single reduce task that is fed by all of the map tasks.

When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition.
The above diagram makes it clear why the data flow between map and reduce tasks is collo‐ quially known as “the shuffle,” as each reduce task is fed by many map tasks.

Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map output, and the combiner function’s output forms the input to the reduce function.