“MapReduce exploits a restricted programming model to parallelize the user
program automatically and to provide transparent fault-tolerance”
Restricted programming model: a model whose designers omit features because they're "too complex" or because some programmers have used them to write bad code. In short, restricting the API to
give greater guarantees.
- Intermediate keys and values must be the same type as the output keys and
values. This is not the case with the current Hadoop implementation.
- The authors also mention that MapReduce could be applied to a shared memory
system or NUMA multi-processor system
- Notes on Execution
- Intermediate values are written to local disk periodically, partitioned
by key
- It sends these locations to the master
- Reducer learns about the locations from the master and reads the appropriate
partitions via RPC directly from the Mapper
- Once all of the data is received, the reducer sorts the data, possibly with
an external sort
- I wonder if some speculative partial reduction work could be done or some
sort of heap style sort that could start immediately
- However, it’s probably better use of the machine to have it running other
work instead
- Reducer output goes to GFS
- Once the reducer finishes, it atomically renames the output file
- Master does not currently checkpoint, the user must rerun the job
- The authors say it would be easy
- At the time, network bandwidth was scarce
- They had the cluster scheduler assign map tasks on the same machines that
held the data in GFS or very close to the same machine
- Wonder how it change now with better network traffic from Google
- ability to skip bad records
- When an error occurs (i.e. segfault), they have a handler that sends a quick
UDP packet to the master with the offset
- If the master receives this message from two workers, when it reschedules
the task, it will notify the worker to skip the record
- https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/SkipBadRecords.html
- Local execution: an alternative im- plementation of the MapReduce library that sequentially executes all of the work for a MapReduce operation on the local machine, where the user can debug
with GDB
- Counter values are piggybacked on the heartbeat response