-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathmapreduce.html
More file actions
217 lines (97 loc) · 79 KB
/
mapreduce.html
File metadata and controls
217 lines (97 loc) · 79 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- iOS Safari -->
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
<!-- Chrome, Firefox OS and Opera Status Bar Color -->
<meta name="theme-color" content="#FFFFFF">
<link rel="stylesheet" type="text/css" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.11.1/katex.min.css">
<link rel="stylesheet" type="text/css"
href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.19.0/themes/prism.min.css">
<link rel="stylesheet" type="text/css" href="css/SourceSansPro.css">
<link rel="stylesheet" type="text/css" href="css/theme.css">
<link rel="stylesheet" type="text/css" href="css/notablog.css">
<!-- Favicon -->
<link rel="shortcut icon" href="https://www.notion.so/signed/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Ffc9b3a94-67d3-4485-bdf3-5e0c0b341ebe%2FAA238E8485C55D168DCF034BC7482B61.png?table=collection&id=c97ea4eb-3d30-4977-8edc-ee98d0f07149">
<style>
:root {
font-size: 20px;
}
</style>
<title>MapReduce | Patrick’s Blog</title>
<meta property="og:type" content="blog">
<meta property="og:title" content="MapReduce">
<meta name="description" content="MapReduce: Simplified Data Processing on Large Clusters">
<meta property="og:description" content="MapReduce: Simplified Data Processing on Large Clusters">
<meta property="og:image" content="data:image/svg+xml,<svg xmlns=%22http://www.w3.org/2000/svg%22 viewBox=%220 0 100 100%22><text text-anchor=%22middle%22 dominant-baseline=%22middle%22 x=%2250%22 y=%2255%22 font-size=%2280%22>🥏</text></svg>">
<style>
.DateTagBar {
margin-top: 1.0rem;
}
</style>
</head>
<body>
<nav class="Navbar">
<a href="index.html">
<div class="Navbar__Btn">
<span><img class="inline-img-icon" src="https://www.notion.so/signed/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2Ffc9b3a94-67d3-4485-bdf3-5e0c0b341ebe%2FAA238E8485C55D168DCF034BC7482B61.png?table=collection&id=c97ea4eb-3d30-4977-8edc-ee98d0f07149"></span>
<span>Home</span>
</div>
</a>
<span class="Navbar__Delim">·</span>
<a href="about.html">
<div class="Navbar__Btn">
<span><img class="inline-img-icon" src="data:image/svg+xml,<svg xmlns=%22http://www.w3.org/2000/svg%22 viewBox=%220 0 100 100%22><text text-anchor=%22middle%22 dominant-baseline=%22middle%22 x=%2250%22 y=%2255%22 font-size=%2280%22>😀</text></svg>"></span>
<span>About me</span>
</div>
</a>
<span class="Navbar__Delim">·</span>
<a href="categories.html">
<div class="Navbar__Btn">
<span><img class="inline-img-icon" src="data:image/svg+xml,<svg xmlns=%22http://www.w3.org/2000/svg%22 viewBox=%220 0 100 100%22><text text-anchor=%22middle%22 dominant-baseline=%22middle%22 x=%2250%22 y=%2255%22 font-size=%2280%22>📃</text></svg>"></span>
<span>Categories</span>
</div>
</a>
</nav>
<header class="Header">
<div class="Header__Spacer Header__Spacer--NoCover">
</div>
<div class="Header__Icon">
<span><img class="inline-img-icon" src="data:image/svg+xml,<svg xmlns=%22http://www.w3.org/2000/svg%22 viewBox=%220 0 100 100%22><text text-anchor=%22middle%22 dominant-baseline=%22middle%22 x=%2250%22 y=%2255%22 font-size=%2280%22>🥏</text></svg>"></span>
</div>
<h1 class="Header__Title">MapReduce</h1>
<div class="DateTagBar">
<span class="DateTagBar__Item DateTagBar__Date">Posted on Sat, Apr 16, 2022</span>
<span class="DateTagBar__Item DateTagBar__Tag DateTagBar__Tag--gray">
<a href="tag/📖 Note.html">📖 Note</a>
</span>
<span class="DateTagBar__Item DateTagBar__Tag DateTagBar__Tag--brown">
<a href="tag/Distributed.html">Distributed</a>
</span>
</div>
</header>
<article id="https://www.notion.so/8ce269f74da54a35bade61d8d7cda45e" class="PageRoot"><h1 id="https://www.notion.so/a3845b11235a4f90bb59d51beec36373" class="ColorfulBlock ColorfulBlock--ColorDefault Heading Heading--1"><a class="Anchor" href="#https://www.notion.so/a3845b11235a4f90bb59d51beec36373"><svg width="16" height="16" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg></a><span class="SemanticStringArray"><span class="SemanticString">MapReduce: Simplified Data Processing on Large Clusters</span></span></h1><div id="https://www.notion.so/2d530e91814440deac81810c08f8f0f8" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">MapReduce is a programming model and an associated implementation for processing and generating large data sets.</span></span></p></div><ul class="BulletedListWrapper"><li id="https://www.notion.so/3be036d359ca4f03af8ba7def3e0ea72" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">map</em></span><span class="SemanticString"> function: processes a key/value pair to generate a set of intermediate key/value pairs</span></span></li><li id="https://www.notion.so/4d19e42693e845a88110eabbc6e7a7e8" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">reduce</em></span><span class="SemanticString"> function: merges all intermediate values associated with the same intermediate key</span></span></li></ul><div id="https://www.notion.so/723e76d3e539459fad0d06cb204d4476" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication</span></span></p></div><h2 id="https://www.notion.so/764876b628944e0296ca7a81c1ae7876" class="ColorfulBlock ColorfulBlock--ColorDefault Heading Heading--2"><a class="Anchor" href="#https://www.notion.so/764876b628944e0296ca7a81c1ae7876"><svg width="16" height="16" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg></a><span class="SemanticStringArray"><span class="SemanticString">1 Introduction</span></span></h2><div id="https://www.notion.so/9d25920332dc49f4853547b218afe24e" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues.</span></span></p></div><div id="https://www.notion.so/f18b05f8daf4432fa23e188a053606eb" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">new abstraction:</span></span></p></div><ul class="BulletedListWrapper"><li id="https://www.notion.so/fc04724b52094c688b140f760cd82fe6" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">map</em></span><span class="SemanticString"> operation to each logical “record” in our input in order to compute a set of intermediate key/value pairs</span></span></li><li id="https://www.notion.so/7d7adde24b4c4c5bb23d255898f28ebd" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">reduce</em></span><span class="SemanticString"> operation to all the values that shared the same key, in order to combine the derived data appropriately</span></span></li></ul><div id="https://www.notion.so/4fcb1bd5506b478a8e721542d99f3826" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">(inspired by the </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">map</em></span><span class="SemanticString"> and </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">reduce</em></span><span class="SemanticString"> primitives present in Lisp and many other functional languages)</span></span></p></div><h2 id="https://www.notion.so/b63c8898ee8c4897a478dbc6af5debee" class="ColorfulBlock ColorfulBlock--ColorDefault Heading Heading--2"><a class="Anchor" href="#https://www.notion.so/b63c8898ee8c4897a478dbc6af5debee"><svg width="16" height="16" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg></a><span class="SemanticStringArray"><span class="SemanticString">2 Programming Model</span></span></h2><div id="https://www.notion.so/4bf2ff610c3c43a0b7180c6516aed4d7" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">input</em></span><span class="SemanticString">: key/value pairs</span></span></p></div><div id="https://www.notion.so/1eebd5fd51664892b47d0484053b810a" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">output</em></span><span class="SemanticString">: key/value pairs</span></span></p></div><ul class="BulletedListWrapper"><li id="https://www.notion.so/a94ced7f3c4f46bf8fd7bdf92044c959" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">Map</em></span><span class="SemanticString">: written by the user, take an input pair and produces a set of </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">intermediate</em></span><span class="SemanticString"> key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">I</em></span><span class="SemanticString"> and passes them to the </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">Reduce</em></span><span class="SemanticString"> function.</span></span></li><li id="https://www.notion.so/d2ed940287294db5879aed15e374a956" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">Reduce</em></span><span class="SemanticString">: also written by the user, accepts an intermediate key </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">I</em></span><span class="SemanticString"> and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">Reduce</em></span><span class="SemanticString"> invocation. The intermediate values are supplied to the user’s reduce function via an iterator. (to handle lists of values that are too large to fit in memory)</span></span></li></ul><h3 id="https://www.notion.so/a8e3049c84a64733845f17b0f3b5f006" class="ColorfulBlock ColorfulBlock--ColorDefault Heading Heading--3"><a class="Anchor" href="#https://www.notion.so/a8e3049c84a64733845f17b0f3b5f006"><svg width="16" height="16" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg></a><span class="SemanticStringArray"><span class="SemanticString">2.1 Example</span></span></h3><div id="https://www.notion.so/c2aab4034cc94d188172cee255194c40" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">Counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code:</span></span></p></div><pre id="https://www.notion.so/b430126d2bcc45e9963fa2f544fcdf26" class="Code Code--NoWrap"><code><span class="SemanticStringArray"><span class="SemanticString"><span><span class="token function">map</span><span class="token punctuation">(</span>String key<span class="token punctuation">,</span> String value<span class="token punctuation">)</span><span class="token operator">:</span>
<span class="token comment">// key: document name</span>
<span class="token comment">// value: document contents</span>
<span class="token keyword">for</span> each word w in value<span class="token operator">:</span>
<span class="token function">EmitIntermediate</span><span class="token punctuation">(</span>w<span class="token punctuation">,</span> <span class="token string">"1"</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token function">reduce</span><span class="token punctuation">(</span>String key<span class="token punctuation">,</span> Iterator values<span class="token punctuation">)</span><span class="token operator">:</span>
<span class="token comment">// key: a word</span>
<span class="token comment">// values: a list of counts</span>
<span class="token keyword">int</span> result <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span>
<span class="token keyword">for</span> each v in values<span class="token operator">:</span>
result <span class="token operator">+=</span> <span class="token function">ParseInt</span><span class="token punctuation">(</span>v<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token function">Emit</span><span class="token punctuation">(</span><span class="token function">AsString</span><span class="token punctuation">(</span>result<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span></span></span></span></code></pre><div id="https://www.notion.so/3d4cc5cfb2644ab5871d788fe6ca0874" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">In addition, the user writes code to fill in a </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">mapreduce specification</em></span><span class="SemanticString"> object with the names of the input and output files, and optional tuning parameters. The user then invokes the </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">MapReduce</em></span><span class="SemanticString"> function, passing it the specification object.</span></span></p></div><h3 id="https://www.notion.so/3de2ecbab92d4f1b8aa73814605d2a07" class="ColorfulBlock ColorfulBlock--ColorDefault Heading Heading--3"><a class="Anchor" href="#https://www.notion.so/3de2ecbab92d4f1b8aa73814605d2a07"><svg width="16" height="16" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg></a><span class="SemanticStringArray"><span class="SemanticString">2.2 Types</span></span></h3><div id="https://www.notion.so/825b804ff86d40928802023896e30696" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">Conceptually the map and reduce functions supplied by the user have associated types:</span></span></p></div><pre id="https://www.notion.so/f7a2b4720bf34ab0ace04738e33b44be" class="Code Code--NoWrap"><code><span class="SemanticStringArray"><span class="SemanticString"><span><span class="token function">map</span> <span class="token punctuation">(</span>k1<span class="token punctuation">,</span> v1<span class="token punctuation">)</span> <span class="token operator">-></span> <span class="token function">list</span><span class="token punctuation">(</span>k2<span class="token punctuation">,</span> v2<span class="token punctuation">)</span>
<span class="token function">reduce</span> <span class="token punctuation">(</span>k2<span class="token punctuation">,</span> <span class="token function">list</span><span class="token punctuation">(</span>v2<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">-></span> <span class="token function">list</span><span class="token punctuation">(</span>v2<span class="token punctuation">)</span></span></span></span></code></pre><div id="https://www.notion.so/86b8f6f0d64c408088cbced09a732bae" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The input keys and values are drawn from a different domain than the output keys and values. The intermediate keys and values are from the same domain as the output keys and values.</span></span></p></div><h3 id="https://www.notion.so/3de0335119c2430a913d94783c2553a6" class="ColorfulBlock ColorfulBlock--ColorDefault Heading Heading--3"><a class="Anchor" href="#https://www.notion.so/3de0335119c2430a913d94783c2553a6"><svg width="16" height="16" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg></a><span class="SemanticStringArray"><span class="SemanticString">2.3 More Examples</span></span></h3><ul class="BulletedListWrapper"><li id="https://www.notion.so/25c11baf4c424b28a0fa00af0bee92c9" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString"><strong class="SemanticString__Fragment SemanticString__Fragment--Bold">Distributed Grep</strong></span></span><div id="https://www.notion.so/98f5408ea10b45f286d3ef22c802ec9c" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The map function emits a line if it matches a supplied pattern.</span></span></p></div><div id="https://www.notion.so/d1b5a4999a17421bb4ac0b0dca199977" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The reduce function is an identity function that just copies the supplied intermediate data to the output.</span></span></p></div></li><li id="https://www.notion.so/3945727a2afc4dfbaeb51acb49388194" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString"><strong class="SemanticString__Fragment SemanticString__Fragment--Bold">Count of URL Access Frequency</strong></span></span><div id="https://www.notion.so/ea72a37ac6a94317909b7f320246b064" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The map function processes logs of web page requests and outputs </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code"><URL, 1></code></span><span class="SemanticString">.</span></span></p></div><div id="https://www.notion.so/a3a9308d4e1f424791c7e1afd4082fca" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The reduce function adds together all values for the same URL and emits a </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code"><URL, total count></code></span><span class="SemanticString"> pair.</span></span></p></div></li><li id="https://www.notion.so/2c64f89bc5a24d659a6ac5b491645891" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString"><strong class="SemanticString__Fragment SemanticString__Fragment--Bold">Reverse Web-Link Graph</strong></span></span><div id="https://www.notion.so/6d73b28137dc414c98f7761d23f85f8a" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The map function outputs </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code"><target, source></code></span><span class="SemanticString"> pairs for each link to a </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code">target</code></span><span class="SemanticString"> URL found in a page named </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code">source</code></span><span class="SemanticString">.</span></span></p></div><div id="https://www.notion.so/0fae65199aa041c385497125f704f3aa" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code"><target, list(source)></code></span><span class="SemanticString">.</span></span></p></div></li><li id="https://www.notion.so/4e36e815ab2a43ec989629b518446622" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString"><strong class="SemanticString__Fragment SemanticString__Fragment--Bold">Term-Vector per Host</strong></span></span><div id="https://www.notion.so/81cc552c48504edcab2db833e3db1d9f" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">A term vector summarizes the most important words that occur in a document or a set of documents as a list of </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="⟨word, frequency⟩"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">⟨</mo><mi>w</mi><mi>o</mi><mi>r</mi><mi>d</mi><mo separator="true">,</mo><mi>f</mi><mi>r</mi><mi>e</mi><mi>q</mi><mi>u</mi><mi>e</mi><mi>n</mi><mi>c</mi><mi>y</mi><mo stretchy="false">⟩</mo></mrow><annotation encoding="application/x-tex">⟨word, frequency⟩</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">⟨</span><span class="mord mathdefault" style="margin-right:0.02691em;">w</span><span class="mord mathdefault">o</span><span class="mord mathdefault" style="margin-right:0.02778em;">r</span><span class="mord mathdefault">d</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathdefault" style="margin-right:0.10764em;">f</span><span class="mord mathdefault" style="margin-right:0.02778em;">r</span><span class="mord mathdefault">e</span><span class="mord mathdefault" style="margin-right:0.03588em;">q</span><span class="mord mathdefault">u</span><span class="mord mathdefault">e</span><span class="mord mathdefault">n</span><span class="mord mathdefault">c</span><span class="mord mathdefault" style="margin-right:0.03588em;">y</span><span class="mclose">⟩</span></span></span></span></span></span><span class="SemanticString"> pairs.</span></span></p></div><div id="https://www.notion.so/5a8af4e0df574144b1fb13019e8dd2a4" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The map function emits a </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code"><hostname, term vector></code></span><span class="SemanticString"> pair for each input document (where the hostname is extracted from the URL of the document).</span></span></p></div><div id="https://www.notion.so/d9fc6ca3a7f847419c4c1deefad1f52b" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code"><hostname, term vector></code></span><span class="SemanticString"> pair.</span></span></p></div></li><li id="https://www.notion.so/f1bedcd17e2b4804aee4f6ab43258b99" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString"><strong class="SemanticString__Fragment SemanticString__Fragment--Bold">Inverted Index</strong></span></span><div id="https://www.notion.so/2c5a9173a1f946a88ae891349ee340ab" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The map function parses each document, and emits a sequence of </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code"><word, document ID></code></span><span class="SemanticString"> pairs.</span></span></p></div><div id="https://www.notion.so/5a9742ad2acf4835af8d00a377a68da3" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code"><word, list(document ID)></code></span><span class="SemanticString"> pair.</span></span></p></div><div id="https://www.notion.so/b205c5eb4b2e4c33a6d8806c2c122b10" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word position.</span></span></p></div></li><li id="https://www.notion.so/96cc2598eab5438aaa0d4223d5351f02" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString"><strong class="SemanticString__Fragment SemanticString__Fragment--Bold">Distributed Sort</strong></span></span><div id="https://www.notion.so/7e55ad0f2d55457fbb056128f4411f25" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The map function extracts the key from each record, and emits a </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code"><key, record></code></span><span class="SemanticString"> pair.</span></span></p></div><div id="https://www.notion.so/6a59e3391e7c49589a8cfeb8dfaa9518" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The reduce function emits all pairs unchanged.</span></span></p></div><div id="https://www.notion.so/89ba3725d7024cd0933ad1de78696f6c" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">This computation depends on the partitioning facilities described in Section 4.1 and the ordering properties described in Section 4.2.</span></span></p></div></li></ul><h2 id="https://www.notion.so/6a3961d672e542979f65b8e115d3d8c1" class="ColorfulBlock ColorfulBlock--ColorDefault Heading Heading--2"><a class="Anchor" href="#https://www.notion.so/6a3961d672e542979f65b8e115d3d8c1"><svg width="16" height="16" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg></a><span class="SemanticStringArray"><span class="SemanticString">3 Implementation</span></span></h2><div id="https://www.notion.so/681f62df8a4441d0ad750a28c3a75c69" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">This section describes an implementation targeted to the computing environment: large clusters of commodity PCs connected together with switched Ethernet.</span></span></p></div><ul class="BulletedListWrapper"><li id="https://www.notion.so/0f8fe3a0a5df4dc3b2df22bf23c68f93" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString">A cluster consists of hundreds of thousands of machines, and therefore machine failures are common.</span></span></li><li id="https://www.notion.so/9ab9fd21199140448ae0f55743ad4ba3" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString">A distributed file system (GFS) is used to manage the data stored on the disks attached directly to individual machines. The file system uses replication to provide availability and reliability on top of unreliable hardware.</span></span></li><li id="https://www.notion.so/ec3539c99a794afa92bdf75fba082341" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString">Users submit jobs to a scheduling system. Each job consists of a set of tasks, and is mapped by the scheduler to a set of available machines within a cluster.</span></span></li></ul><h3 id="https://www.notion.so/ffd71347b36741a084756a50e61f4bb4" class="ColorfulBlock ColorfulBlock--ColorDefault Heading Heading--3"><a class="Anchor" href="#https://www.notion.so/ffd71347b36741a084756a50e61f4bb4"><svg width="16" height="16" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg></a><span class="SemanticStringArray"><span class="SemanticString">3.1 Execution Overview</span></span></h3><div id="https://www.notion.so/e3faf1eae93a49a0a011a33b85f23138" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">Map</em></span><span class="SemanticString"> invocations are distributed across multiple machines by automatically partitioning the input data into a set of </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">M splits</em></span><span class="SemanticString">. The input splits can be processed in parallel by different machines.</span></span></p></div><div id="https://www.notion.so/75081ac5922141beaaa6ed306ac8566c" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">Reduce</em></span><span class="SemanticString"> invocations are distributed by partitioning the intermediate key space into </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">R</em></span><span class="SemanticString"> pieces using a partitioning function (e.g., </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="hash(key) \mod R"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>h</mi><mi>a</mi><mi>s</mi><mi>h</mi><mo stretchy="false">(</mo><mi>k</mi><mi>e</mi><mi>y</mi><mo stretchy="false">)</mo><mspace></mspace><mspace width="0.6666666666666666em"/><mrow><mi mathvariant="normal">m</mi><mi mathvariant="normal">o</mi><mi mathvariant="normal">d</mi></mrow><mtext> </mtext><mtext> </mtext><mi>R</mi></mrow><annotation encoding="application/x-tex">hash(key) \mod R</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathdefault">h</span><span class="mord mathdefault">a</span><span class="mord mathdefault">s</span><span class="mord mathdefault">h</span><span class="mopen">(</span><span class="mord mathdefault" style="margin-right:0.03148em;">k</span><span class="mord mathdefault">e</span><span class="mord mathdefault" style="margin-right:0.03588em;">y</span><span class="mclose">)</span><span class="mspace allowbreak"></span><span class="mspace" style="margin-right:0.6666666666666666em;"></span></span><span class="base"><span class="strut" style="height:0.69444em;vertical-align:0em;"></span><span class="mord"><span class="mord"><span class="mord mathrm">m</span><span class="mord mathrm">o</span><span class="mord mathrm">d</span></span></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mspace" style="margin-right:0.16666666666666666em;"></span><span class="mord mathdefault" style="margin-right:0.00773em;">R</span></span></span></span></span></span><span class="SemanticString">). The number of partitions (</span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">R</em></span><span class="SemanticString">) and the partitioning function are specified by the user.</span></span></p></div><div id="https://www.notion.so/38025bc81ea341fe8d53b250cec7417d" class="Image Image--PageWidth"><figure><a href="https://www.notion.so/signed/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F6f0b4f65-582f-4fc2-9e59-e058a924a615%2Fpdfresizer.com-pdf-crop.svg?width=672&table=block&id=38025bc8-1ea3-41fe-8d53-b250cec7417d"><img src="https://www.notion.so/signed/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F6f0b4f65-582f-4fc2-9e59-e058a924a615%2Fpdfresizer.com-pdf-crop.svg?width=672&table=block&id=38025bc8-1ea3-41fe-8d53-b250cec7417d" style="width:100%"/></a><figcaption><span class="SemanticStringArray"></span></figcaption></figure></div><div id="https://www.notion.so/c8e44bce772f4e7b874e86f22c3b5943" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">When the user program calls the </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code">MapReduce</code></span><span class="SemanticString"> function, the following sequence of actions occurs:</span></span></p></div><ol class="NumberedListWrapper"><li id="https://www.notion.so/508f5ec8f8c54f14893aad04c1108775" class="NumberedList" value="1"><span class="SemanticStringArray"><span class="SemanticString">The MapReduce library in the user program first </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">splits the input files into </span></span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">M</span></em></span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown"> pieces of typically 16 megabytes to 64 megabytes (MB) per piece</span></span><span class="SemanticString"> (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.</span></span></li><li id="https://www.notion.so/0d2a15ffdd2241acb19a1e37488dddbb" class="NumberedList" value="2"><span class="SemanticStringArray"><span class="SemanticString">One of the copies of the program is special — the master. The rest are workers that are assigned work by the master. There are </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">M</em></span><span class="SemanticString"> map tasks and </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">R</em></span><span class="SemanticString"> reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.</span></span></li><li id="https://www.notion.so/960c721780e343f2a61ceda1f7535036" class="NumberedList" value="3"><span class="SemanticStringArray"><span class="SemanticString">A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code">Map</code></span><span class="SemanticString"> function. The intermediate key/value pairs produced by the </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code">Map</code></span><span class="SemanticString"> function are buffered in memory.</span></span></li><li id="https://www.notion.so/c6630496c94947f3b96cb9dc95d026b0" class="NumberedList" value="4"><span class="SemanticStringArray"><span class="SemanticString">Periodically, the buffered pairs are written to local disk, </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">partitioned into </span></span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">R</span></em></span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown"> regions by the partitioning function</span></span><span class="SemanticString">. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.</span></span></li><li id="https://www.notion.so/83c5cd2b24d64cc999048707fa76d3b3" class="NumberedList" value="5"><span class="SemanticStringArray"><span class="SemanticString">When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.</span></span></li><li id="https://www.notion.so/7a93110926414d79a1348d9bb58a6cab" class="NumberedList" value="6"><span class="SemanticStringArray"><span class="SemanticString">The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code">Reduce</code></span><span class="SemanticString"> function. The output of the </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code">Reduce</code></span><span class="SemanticString"> function is appended to a final output file for this reduce partition.</span></span></li><li id="https://www.notion.so/c319e46b3d8e434c898faf5113ff5fe6" class="NumberedList" value="7"><span class="SemanticStringArray"><span class="SemanticString">When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the </span><span class="SemanticString"><code class="SemanticString__Fragment SemanticString__Fragment--Code">MapReduce</code></span><span class="SemanticString"> call in the user program returns back to the user code.</span></span></li></ol><div id="https://www.notion.so/abc8ef36c2ca45bd8cc42a6993e61687" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">After successful completion, the output of the mapreduce execution is available in the </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">R</em></span><span class="SemanticString"> output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">R</em></span><span class="SemanticString"> output files into one file — they often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.</span></span></p></div><h3 id="https://www.notion.so/0005e16fd3dd403a95cccec17001a332" class="ColorfulBlock ColorfulBlock--ColorDefault Heading Heading--3"><a class="Anchor" href="#https://www.notion.so/0005e16fd3dd403a95cccec17001a332"><svg width="16" height="16" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg></a><span class="SemanticStringArray"><span class="SemanticString">3.2 Master Data Structures</span></span></h3><div id="https://www.notion.so/8d7a6fc5d9b641e281336f284b94f9b0" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">For each map task and reduce task, the master stores </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">the state (</span></span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">idle</span></em></span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">, </span></span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">in-progress</span></em></span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">, or </span></span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">completed</span></em></span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">)</span></span><span class="SemanticString">, and </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">the identity of the worker machine (for non-idle tasks)</span></span><span class="SemanticString">.</span></span></p></div><div id="https://www.notion.so/ef0b9be82ff64a54960b447cc8f8d8aa" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">the locations and sizes of the </span></span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">R</span></em></span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown"> intermediate file regions produced by the map task</span></span><span class="SemanticString">. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">in-progress</em></span><span class="SemanticString"> reduce tasks.</span></span></p></div><h3 id="https://www.notion.so/aecd2b296e74458a976af89054407ae9" class="ColorfulBlock ColorfulBlock--ColorDefault Heading Heading--3"><a class="Anchor" href="#https://www.notion.so/aecd2b296e74458a976af89054407ae9"><svg width="16" height="16" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg></a><span class="SemanticStringArray"><span class="SemanticString">3.3 Fault Tolerance</span></span></h3><div id="https://www.notion.so/f389d7e9fcbe4d1e999258ff2607ce06" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString"><strong class="SemanticString__Fragment SemanticString__Fragment--Bold">Worker Failure</strong></span></span></p></div><div id="https://www.notion.so/83f45aae2f6a473aad2b3461ea361bf1" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed.</span></span></p></div><div id="https://www.notion.so/05d40ab2a9b5490bbba85aff6d776351" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">Completed map tasks</span></span><span class="SemanticString"> are re-executed on a failure because their </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">output is stored on the </span></span><span class="SemanticString"><mark class="SemanticString__Fragment SemanticString__Fragment--HighlightedColor SemanticString__Fragment--ColorRed"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">local disk(s)</span></mark></span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown"> of the failed machine</span></span><span class="SemanticString"> and is therefore inaccessible. </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">Completed reduce tasks</span></span><span class="SemanticString"> do not need to be re-executed since their </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">output is stored in </span></span><span class="SemanticString"><mark class="SemanticString__Fragment SemanticString__Fragment--HighlightedColor SemanticString__Fragment--ColorRed"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">a global file system</span></mark></span><span class="SemanticString">.</span></span></p></div><div><div></div><div></div><div></div></div><div id="https://www.notion.so/aa340d0873c64fd38f8de9b1273ee992" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">When a map task is executed first by worker </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">A</em></span><span class="SemanticString"> and then later executed by worker </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">B</em></span><span class="SemanticString"> (because </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">A</em></span><span class="SemanticString"> failed), all workers executing reduce tasks are notified of the re-execution. Any reduce tasks that has not already read the data from worker </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">A</em></span><span class="SemanticString"> will read the data from worker </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">B</em></span><span class="SemanticString">.</span></span></p></div><div id="https://www.notion.so/54e4e441467d459b9882e277cec591a3" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString"><strong class="SemanticString__Fragment SemanticString__Fragment--Bold">Master Failure</strong></span></span></p></div><div id="https://www.notion.so/4194ef6cf77c4a6dad3270713ce09862" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The master writes periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state.</span></span></p></div><div id="https://www.notion.so/5492c9221cdc47ae9342aa24977716ee" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">(However, given that there is only a single master, its failure is unlikely; therefore our current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.)</span></span></p></div><div id="https://www.notion.so/5394dfb0ea424b269f62efb487bd6c3f" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString"><strong class="SemanticString__Fragment SemanticString__Fragment--Bold">Semantics in the Presence of Failures</strong></span></span></p></div><div id="https://www.notion.so/1b82b40e32e046469104e420240ab7c3" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">When the user-supplied </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">map</em></span><span class="SemanticString"> and </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">reduce</em></span><span class="SemanticString"> operators are </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">deterministic</span></span><span class="SemanticString"> functions of their input values, our distributed implementation produces the same output as would have been produced by a non-faulting </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">sequential execution</span></span><span class="SemanticString"> of the entire program.</span></span></p></div><div id="https://www.notion.so/c9ca5bf6a30345cfb441e09d8c5471a6" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">We rely on </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">atomic commits of map and reduce task outputs</span></span><span class="SemanticString"> to achieve this property.</span></span></p></div><div id="https://www.notion.so/4dca4cf9230c43c89aa819f955a9c246" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">Each in-progress task writes its output to private temporary files. A reduce task produces one such file, and a map task produces </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">R</em></span><span class="SemanticString"> such files (one per reduce task).</span></span></p></div><ul class="BulletedListWrapper"><li id="https://www.notion.so/d393e0b0a8a1461692153cbf1e5622e4" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString">When a map task completes, the worker sends a message to the master and includes the names of the </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">R</em></span><span class="SemanticString"> temporary files in the message. </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">If the master receives a completion message for an already completed map task, it ignores the message.</span></span><span class="SemanticString"> Otherwise, it records the name of </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">R</em></span><span class="SemanticString"> files in a master data structure.</span></span></li><li id="https://www.notion.so/4b6304e25f584447b442a748aaa8d8cc" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString">When a reduce task completes, the reduce worker atomically renames its temporary output file to the final output file. If the same reduce task is executed on multiple machines, multiple rename calls will be executed for the same final output file. We rely on the </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">atomic rename operation provided by the underlying file system</span></span><span class="SemanticString"> to guarantee that the final file system state contains just the data produced by one execution of the reduce task.</span></span></li></ul><div id="https://www.notion.so/dcc0c769e06d40608c70ec556a0a7c09" class="Divider"></div><div id="https://www.notion.so/5cf55bc5089b4dcc870fc0434fe8ef7c" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">When the </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">map</em></span><span class="SemanticString"> and/or </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">reduce</em></span><span class="SemanticString"> operators are </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">non-deterministic</span></span><span class="SemanticString">, we provide weaker but still reasonable semantics. In the presence of non-deterministic operators, the output of a particular reduce task </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="R_1"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>R</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">R_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.83333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span></span><span class="SemanticString"> is equivalent to the output for </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="R_1"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>R</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">R_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.83333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span></span><span class="SemanticString"> produced by a sequential execution of the non-deterministic program. However, the output for a different reduce task </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="R_2"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>R</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">R_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.83333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span></span><span class="SemanticString"> may correspond to the output for </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="R_2"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>R</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">R_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.83333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span></span><span class="SemanticString"> produced by a different sequential execution of the non-deterministic program.</span></span></p></div><div id="https://www.notion.so/45c75304a54e48f3a9f59dbe5207ab27" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">Consider map task </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="M"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>M</mi></mrow><annotation encoding="application/x-tex">M</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault" style="margin-right:0.10903em;">M</span></span></span></span></span></span><span class="SemanticString"> and reduce tasks </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="R_1"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>R</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">R_1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.83333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span></span><span class="SemanticString"> and </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="R_2"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>R</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">R_2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.83333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span></span><span class="SemanticString">. Let </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="e(R_i)"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>e</mi><mo stretchy="false">(</mo><msub><mi>R</mi><mi>i</mi></msub><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">e(R_i)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathdefault">e</span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.31166399999999994em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathdefault mtight">i</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span></span></span><span class="SemanticString"> be the execution of </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="R_i"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>R</mi><mi>i</mi></msub></mrow><annotation encoding="application/x-tex">R_i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.83333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.31166399999999994em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathdefault mtight">i</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span></span><span class="SemanticString"> that committed (there is exactly one such execution). The weaker semantics arise because </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="e(R_1)"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>e</mi><mo stretchy="false">(</mo><msub><mi>R</mi><mn>1</mn></msub><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">e(R_1)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathdefault">e</span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span></span></span><span class="SemanticString"> may have read the output produced by one execution of </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="M"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>M</mi></mrow><annotation encoding="application/x-tex">M</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault" style="margin-right:0.10903em;">M</span></span></span></span></span></span><span class="SemanticString"> and </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="e(R_2)"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>e</mi><mo stretchy="false">(</mo><msub><mi>R</mi><mn>2</mn></msub><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">e(R_2)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathdefault">e</span><span class="mopen">(</span><span class="mord"><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.30110799999999993em;"><span style="top:-2.5500000000000003em;margin-left:-0.00773em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span></span></span><span class="SemanticString"> may have read the output produced by a different execution of </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="M"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>M</mi></mrow><annotation encoding="application/x-tex">M</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.68333em;vertical-align:0em;"></span><span class="mord mathdefault" style="margin-right:0.10903em;">M</span></span></span></span></span></span><span class="SemanticString">.</span></span></p></div><h3 id="https://www.notion.so/699cd57da6c54f3595e5a3d9a75667eb" class="ColorfulBlock ColorfulBlock--ColorDefault Heading Heading--3"><a class="Anchor" href="#https://www.notion.so/699cd57da6c54f3595e5a3d9a75667eb"><svg width="16" height="16" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg></a><span class="SemanticStringArray"><span class="SemanticString">3.4 Locality</span></span></h3><div id="https://www.notion.so/285a244b11944f0da752b3dc8f12d574" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">We conserve network bandwidth by taking advantage of the fact that the input data (managed by GFS) is stored on the local disks of the machines that make up our cluster.</span></span></p></div><div id="https://www.notion.so/96ca9f12b0c944a9ad989ee074574583" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">GFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines. The MapReduce master takes the location information of the input files into account and </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Unknown">attempts to schedule a map task on a machine that contains a replica of the corresponding input data</span></span><span class="SemanticString">. Failing that, it attempts to schedule a map task near a replica of that task’s input data (e.g., on a worker machine that is on the same network switch as the machine containing the data).</span></span></p></div><h3 id="https://www.notion.so/9e74a32c4dfa4b47a873164bfc6cdfd8" class="ColorfulBlock ColorfulBlock--ColorDefault Heading Heading--3"><a class="Anchor" href="#https://www.notion.so/9e74a32c4dfa4b47a873164bfc6cdfd8"><svg width="16" height="16" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg></a><span class="SemanticStringArray"><span class="SemanticString">3.5 Task Granularity</span></span></h3><div id="https://www.notion.so/694b7bab45ae48b9a9612bdc0b1564fb" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">We subdivide the map phase into </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">M</em></span><span class="SemanticString"> pieces and the reduce phase into </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">R</em></span><span class="SemanticString"> pieces. Ideally, </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">M</em></span><span class="SemanticString"> and </span><span class="SemanticString"><em class="SemanticString__Fragment SemanticString__Fragment--Italic">R</em></span><span class="SemanticString"> should be much larger than the number of worker machines.</span></span></p></div><ul class="BulletedListWrapper"><li id="https://www.notion.so/34321eeda14c47f8a3cf7662ebdaa1dc" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString">improve dynamic load balancing</span></span></li><li id="https://www.notion.so/42dcf272be7f470997be3fec62e6b48c" class="BulletedList"><span class="SemanticStringArray"><span class="SemanticString">speed up recovery when a worker fails: completed map tasks can be spread out across all the other worker machines</span></span></li></ul><div id="https://www.notion.so/8acfda21bbba4e42ad988dee5c046507" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">Practical bounds:</span></span></p></div><div id="https://www.notion.so/8edb4ca57ef44400b475db0758afe092" class="ColorfulBlock ColorfulBlock--ColorDefault Text"><p class="Text__Content"><span class="SemanticStringArray"><span class="SemanticString">The master must take </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="O(M+R)"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><mi>M</mi><mo>+</mo><mi>R</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(M+R)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathdefault" style="margin-right:0.02778em;">O</span><span class="mopen">(</span><span class="mord mathdefault" style="margin-right:0.10903em;">M</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="mclose">)</span></span></span></span></span></span><span class="SemanticString"> scheduling decisions and keeps </span><span class="SemanticString"><span class="SemanticString__Fragment SemanticString__Fragment--Math" data-latex="O(M*R)"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>O</mi><mo stretchy="false">(</mo><mi>M</mi><mo>∗</mo><mi>R</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">O(M*R)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathdefault" style="margin-right:0.02778em;">O</span><span class="mopen">(</span><span class="mord mathdefault" style="margin-right:0.10903em;">M</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span><span class="mbin">∗</span><span class="mspace" style="margin-right:0.2222222222222222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathdefault" style="margin-right:0.00773em;">R</span><span class="mclose">)</span></span></span></span></span></span><span class="SemanticString"> state in memory.</span></span></p></div></article>
<footer class="Footer">
<div>© Patrick’s Blog 2026</div>
<div>·</div>
<div>Powered by <a href="https://github.com/dragonman225/notablog" target="_blank"
rel="noopener noreferrer">Notablog</a>.
</div>
</footer>
</body>
</html>