-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbig-picture.html
307 lines (296 loc) · 17 KB
/
big-picture.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>The big picture</title>
<link rel="stylesheet" href="/theme/css/main.css" />
<!--[if IE]>
<script src="https://html5shiv.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body id="index" class="home">
<header id="banner" class="body">
<h1><a href="/">And yet it moves! </a></h1>
<nav><ul>
<li class="active"><a href="/category/data-processing.html">Data processing</a></li>
</ul></nav>
</header><!-- /#banner -->
<section id="content" class="body">
<article>
<header>
<h1 class="entry-title">
<a href="/big-picture.html" rel="bookmark"
title="Permalink to The big picture">The big picture</a></h1>
</header>
<div class="entry-content">
<footer class="post-info">
<abbr class="published" title="2016-06-09T13:10:00+01:00">
Published: jeu. 09 juin 2016
</abbr>
<address class="vcard author">
By <a class="url fn" href="/author/gael.html">Gaël</a>
</address>
<p>In <a href="/category/data-processing.html">Data processing</a>.</p>
</footer><!-- /.post-info --> <p>This post is the sixth technical post of the <a href="{filename}01-index.md">series</a>.</p>
<p><em>We saw in the <a href="{filename}07-maximizing_reward.md">previous post</a> that the
multi-armed bandit strategy does not systematically beat
the A/B testing data-gathering process in gathering the most clicks overall.
The main thing that is really lacking at this stage is an estimate of when
one method yields more clicks that the other.</em></p>
<p><em>This is the reason why we are going to perform a few parametric studies. As
every variables are pretty decorrelated from each other, we are going to study
their effect on the total click-through rate one after the other. That said,
there is still a very important variable that should be represented in each
case, the sample size, because this is the main question we have to answer
before launching the campaign.</em></p>
<h1 id="sample-size-of-the-testing-stage">Sample size of the testing stage</h1>
<p>As we have already explained, an A/B testing campaign is usually divided into
two stages: <em>testing</em> (where all the variants are served, the data gathered
and then analysed) and <em>exploitation</em> (where the best variant is served). </p>
<p>We would like to know the effect of changing
the ratio of these two stages on the number of clicks (while keeping the size
of the population, i.e. the totality of our visitors, constant).</p>
<p>But the number of clicks is not that easy to reason about as the sample size
changes. That is why we are going to normalize them by the population size
to get the total click-through rate for each setup.</p>
<p>This is what we did to get the following figure:</p>
<p><center>
<img alt="Here should be a figure (try with another web browser).
" src="/images/clicks_vs_sample_size.svg"/>
</center></p>
<p>There is a <em>lot</em> to say about this figure:</p>
<ol>
<li>First we see that at the smallest sample sizes, the number of clicks
increases sharply until around <span class="math">\(500\)</span> samples and then starts to fall
linearly toward a given final value.
That final value is known for each method (given that the total sample size is
large enough):<ul>
<li>In A/B testing, this is the average of the <span class="math">\(P(O|X)\)</span> (here <span class="math">\((0.2+0.1)/2 =
0.15\)</span>).</li>
<li>In the multi-armed bandit method, it should be equal to
<div class="math">$$0.2\, (\varepsilon + [1-\varepsilon]/\text{number of events}) + 0.1\, ([1-\varepsilon]/\text{number of events}).$$</div>
</li>
</ul>
</li>
<li>The click-through rates at smaller sample sizes are smaller because
we tend to wrongly choose the worse variant more.</li>
<li>The optimum value is a trade-off between being wrong because of the small
sample size and spending too many trials in the testing stage at the
expense of the exploitation stage.</li>
<li>The final value obtained in the multi-armed bandit framework corresponds
to the <em>set-and-forget</em> setup which seems to perform worse that the
two-staged setup for adequate sample sizes.</li>
<li>The A/B testing method seems to indeed beat the multi-armed bandit strategy for
small sample sizes:</li>
</ol>
<p><center>
<img alt="Here should be a figure (try with another web browser).
" src="/images/closer_clicks_vs_sample_size.svg"/>
</center></p>
<p>This allows us to conclude that:</p>
<ul>
<li>The A/B testing method is indeed competitive and can provide similar or
higher numbers of clicks than the multi-armed bandit strategy.</li>
<li>It seems to do so at lower sample sizes (i.e. requires shorter testing stages).</li>
<li>The <em>set-and-forget</em> setup is not optimal in the sense that the two-staged
algorithm is more likely to give mores clicks.</li>
<li>Yet, the set-and-forget algorithm is much simpler to setup than the other
options (basically nothing to think about, to calculate or to design).</li>
<li>The two-staged setup of the multi-armed bandit strategy is interesting
because it provides a longer "sweet spot", i.e. the sample sizes yielding the
highest overall number of clicks spreads over thousands of trials while the
equivalent A/B testing setup spreads over hundreds of trials only.</li>
</ul>
<p>Still, we can hardly justify that these results can be generalized, that's
why we need to perform a few parametric studies.</p>
<h1 id="population-size">Population size</h1>
<p>The first parameter we want to alter is the population size
(total number of visitors, in the testing stage and afterwards):
what would happen to the total click-through rates vs. the (testing) sample size
if the (total) population size was different?</p>
<p>We want to represent 3 variables here:</p>
<ol>
<li>the total click-through rate depending on both</li>
<li>the sample size</li>
<li>and the population size.</li>
</ol>
<p>In order to make things clearer, we are going to represent the click-through
rate as an image using a color code: the blue patches represent areas where
the A/B testing provides a higher click-through rate overall, green patches
represent areas where the multi-armed bandit strategy provides the
highest click-through rates and the white patches represent the areas where
both methods provide the similar numbers of clicks.</p>
<p><center>
<img alt="Here should be a figure (try with another web browser).
" src="/images/diff_vs_sample_size_vs_tot.svg"/>
</center></p>
<p>The left-hand part of the figure is mainly blue: this means that the A/B
testing data-gathering method does provide the highest
click-through rates at lower sample sizes for any population sizes, just like
we have already observed above.</p>
<p>Yet at the smallest population sizes, that blue domain is much smaller and the
green patches are much more present and stronger; the domain where the A/B
testing data gathering method is more efficient is much smaller.</p>
<p>As we have already stated in the previous posts, there is a trade-off between
correctly choosing the most efficient variant and spending more time in the
testing stage:</p>
<ul>
<li>If the population size is small, the relative time spent in
the testing stage in large. In that case, the multi-armed bandit method
should provide more clicks: it was precisely designed to maximize the number
of clicks in the testing stage.</li>
<li>If the population size is large, the relative time spent in the testing
stage should quickly become negligible. In that case, the higher
correctness provided by the A/B testing data-gathering method should give
it the advantage.</li>
</ul>
<p>This is exactly what we observe here. So the larger the population, the wider
the range of superiority of A/B testing.</p>
<h1 id="epsilon">Epsilon</h1>
<p>The multi-armed bandit strategy is defined with one extra parameter:
<span class="math">\(\varepsilon\)</span> (epsilon). This is the parameter that defines the values
of <span class="math">\(P(\text{favored})\)</span> and <span class="math">\(P(\text{disfavored})\)</span>. Changing its value from
the default <span class="math">\(90\%\)</span> may also change the range of superiority of the A/B testing
data-gathering method. This is what we represented in the following figure:</p>
<p><center>
<img alt="Here should be a figure (try with another web browser).
" src="/images/epsilon_vs_sample_size_vs_tot.svg"/>
</center></p>
<p>We can see that the uppermost part of the figure
(for <span class="math">\(\varepsilon \approx 100\%\)</span> is entirely blue. This is caused by the
inability of the algorithm to explore other options than the default one.</p>
<p>On the other side (<span class="math">\(\varepsilon \approx 0\%\)</span>), the line is mostly white. This
was expected: in this case, there is no difference between the A/B testing data
gathering method and the multi-armed bandit one.</p>
<p>There seems to be a green (i.e. multi-armed bandit) sweet spot around
<span class="math">\(\varepsilon \approx 70-90\%\)</span>. It is not clear whether or not that spot
moves with other parameters.</p>
<p>As in the previous analysis, the leftmost part of the figure is mostly blue
while the rightmost part is mostly green. The same explanation applies.</p>
<p>So higher values of <span class="math">\(\varepsilon\)</span> counter-intuitively tend to favor A/B
testing, values around <span class="math">\(70-90\%\)</span> tend to favor the multi-armed bandit method
while lower values tend to make the differences disappear.</p>
<h1 id="difference-in-click-through-rates">Difference in click-through rates</h1>
<p>This time, let <span class="math">\(P(O|A) = 10\%\)</span> and define <span class="math">\(P(O|B) = P(O|A) + {\Delta}P(O|X)\)</span>.
<span class="math">\({\Delta}P(O|X)\)</span> represents the difference in click-through rate and we are
going to see how the methods behave by changing this:</p>
<p><center>
<img alt="Here should be a figure (try with another web browser).
" src="/images/POXdiff_vs_sample_size_vs_tot.svg"/>
</center></p>
<p>As we can see, the bottom line is completely white: when there is no
difference, there is no reason that any of the methods would guess right more
often than not.</p>
<p>Then the blue patches prevail on the leftmost side of the figure and almost
disappear around <span class="math">\(30\%\)</span>, leaving more room for green patches.
As we said earlier, the A/B testing data-gathering method is better
at identifying the best variant at smaller sample sizes than the multi-armed
bandit strategy. However, when the variants yield completely different
click-through rates, that "correctness" advantage of the A/B testing
data-gathering method whithers and the multi-armed bandit method takes over.</p>
<p>That said, for differences that large, we do not expect the A/B testing
data-gathering method to become universally worse than the multi-armed bandit
strategy: A/B testing is supposed to remain superior (or as good) but for even
lower sample sizes (below, outside of the represented range).</p>
<p>So the low difference tend to increase the range of superiority of the A/B
testing data-gathering method while larger differences do reduce it.</p>
<h1 id="click-through-rate-base-value">Click-through rate base value</h1>
<p>As a last parameter, we would like to see the impact of offsetting the
click-through rates while keeping their difference constant. In the following
figure, the offset represents the position of the central value </p>
<p><center>
<img alt="Here should be a figure (try with another web browser).
" src="/images/POXorig_vs_sample_size_vs_tot.svg"/>
</center></p>
<p>As usual, blue patches on the left, green patches on the right. Same
explanation.</p>
<p>The largest superiority range of A/B testing is obtained for an offset
of <span class="math">\(30-35\%\)</span> which, in fact corresponds to <span class="math">\(P(O|A) = 40-45\%\)</span> and
<span class="math">\(P(O|B) = 55-60\%\)</span>, i.e. when the click-through rates are centered around
<span class="math">\(50\%\)</span>.</p>
<p><em>The purpose of this post was to determine when the A/B testing data-gathering
method was actually beaten by the multi-armed bandit strategy. As it turned
out, the multi-armed bandit strategy performs better when the sample size of
the testing stage is not negligible compared to the (total) population size.
We also demonstrated that it tends to perform better for epsilon values around
<span class="math">\(\varepsilon \approx 40-90\%\)</span>. Finally, it performs better for large <span class="math">\(P(O|X)\)</span>
differences, especially if at least one is close to <span class="math">\(0\%\)</span> or <span class="math">\(100\%\)</span>.</em></p>
<p><em>On the other hand, the A/B testing data-gathering method performed the best
at the smallest sample sizes in all these cases.</em></p>
<p><em>The original claim that the multi-armed bandit method yield a
higher number of clicks is therefore not true -- not even generally.
Yet the multi-armed bandit strategy does provide some advantages
like its ease of use (in the case of the set-and-forget setup) or its
capability to adapt to poorly designed testing campaigns while retaining
acceptable click-through rates.</em></p>
<p><em>In the <a href="/ab-mab-conclusion.html">next and last post</a>,
we are going to talk a little bit about
uncertainties and finally conclude.</em></p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
var location_protocol = (false) ? 'https' : document.location.protocol;
if (location_protocol !== 'http' && location_protocol !== 'https') location_protocol = 'https:';
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = location_protocol + '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
</div><!-- /.entry-content -->
</article>
</section>
<section id="extras" class="body">
</section><!-- /#extras -->
<footer id="contentinfo" class="body">
<address id="about" class="vcard body">
Proudly powered by <a href="http://getpelican.com/">Pelican</a>, which takes great advantage of <a href="http://python.org">Python</a>.
</address><!-- /#about -->
<p>The theme is by <a href="http://coding.smashingmagazine.com/2009/08/04/designing-a-html-5-layout-from-scratch/">Smashing Magazine</a>, thanks!</p>
</footer><!-- /#contentinfo -->
</body>
</html>