-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmodeling.html
461 lines (446 loc) · 16.7 KB
/
modeling.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
<!DOCTYPE HTML>
<!--
Miniport by HTML5 UP
html5up.net | @ajlkn
Free for personal and commercial use under the CCA 3.0 license (html5up.net/license)
-->
<html>
<head>
<title>PH Twitter Fake News Analysis</title>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no" />
<link rel="stylesheet" href="assets/css/main.css" />
<style>
td {
border: 1px solid black;
color: black;
text-align: center;
vertical-align: middle;
}
th {
border: 1px solid black;
color: black;
text-align: center;
width: 200px;
font-weight: 700;
}
.center-inline-img{
max-height: 500px;
max-width: 1300px;
margin: 0 auto;
display: block;
}
</style>
</head>
<body class="is-preload">
<!-- Nav -->
<nav id="nav">
<ul class="container">
<li><a href="index.html">Home</a></li>
<li><a href="overview.html">Motivation</a></li>
<li><a href="exploration.html">Data Exploration</a></li>
<li><a href="modeling.html">Modeling</a></li>
<li><a href="communication.html">Communication</a></li>
</ul>
</nav>
<!-- Home -->
<article id="top" class="wrapper style1">
<div class="container">
<div class="row">
<div>
<header>
<h1>Statistical Modeling</h1>
</header>
<h2>Normality and Equal Variances Tests</h2>
<p>
Before proceeding with the statistical approach, the data was split into three different categories: before, during and after the campaign period.
<br><br>
For the normality test, Anderson-Darling and Shapiro-Wilk tests were first applied to determine the normality of the splitted data sets. As for checking if the variances are equal, the Levene test was performed on the said data sets.
</p>
<p>
<strong>Anderson-Darling Test:</strong>
<br>
<br>
 H<sub>o</sub> = The data is normally distributed.
<br>
 H<sub>a</sub> = The data is not normally distributed.
</p>
<table>
<tr>
<th>
Dataset
</th>
<th>
A Test Statistic
</th>
<th>
Critical Values
</th>
<th>
Significance Level
</th>
</tr>
<tr>
<td rowspan="3">
Pre-campaign
</td>
<td rowspan="3">
2.419
</td>
<td>
0.705
</td>
<td>
0.050
</td>
</tr>
<tr>
<td>
0.822
</td>
<td>
0.025
</td>
</tr>
<tr>
<td>
0.978
</td>
<td>
0.010
</td>
</tr>
<tr>
<td rowspan="3">
Campaign
</td>
<td rowspan="3">
7.718
</td>
<td>
0.718
</td>
<td>
0.050
</td>
</tr>
<tr>
<td>
0.838
</td>
<td>
0.025
</td>
</tr>
<tr>
<td>
0.996
</td>
<td>
0.010
</td>
</tr>
<tr>
<td rowspan="3">
Post-campaign
</td>
<td rowspan="3">
4.636
</td>
<td>
0.703
</td>
<td>
0.050
</td>
</tr>
<tr>
<td>
0.820
</td>
<td>
0.025
</td>
</tr>
<tr>
<td>
0.975
</td>
<td>
0.010
</td>
</tr>
</table>
<p>
<strong>Shapiro-Wilk Test:</strong>
<br>
<br>
 H<sub>o</sub> = The data is normally distributed.
<br>
 H<sub>a</sub> = The data is not normally distributed.
</p>
<table>
<tr>
<th>
Dataset
</th>
<th>
P-value
</th>
</tr>
<tr>
<td>
Pre-campaign
</td>
<td>
5.070e-05
</td>
</tr>
<tr>
<td>
Campaign
</td>
<td>
5.648e-09
</td>
</tr>
<tr>
<td>
Post-campaign
</td>
<td>
9.822e-08
</td>
</tr>
</table>
<p>
Given that the test statistics for all data sets are significantly larger than their respective critical values for the <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html">Anderson-Darling</a> Test, the null hypothesis is rejected and it can be concluded that the three datasets do not have a normal distribution. On the other hand, provided that the p-values for the <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html">Shapiro-Wilk</a> Test across all data sets are less than the level of significance of 0.05, the null hypothesis is rejected which further confirms that the datasets are not normal. Given the data is not normal, a non-parametric statistical test must be used.
</p>
<p>
<strong>Levene Test:</strong>
<br>
<br>
 H<sub>o</sub> = The datasets have equal variances.
<br>
 H<sub>a</sub> = The datasets do not have equal variances.
</p>
<table>
<tr>
<th>
Datasets compared
</th>
<th>
P-value
</th>
</tr>
<tr>
<td>
Pre-campaign vs Campaign vs Post-campaign
</td>
<td>
0.048
</td>
</tr>
</table>
</p>
<p>
The p-value for the <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html">Levene</a> Test is less than the level of significance 0.05 which suggests that we reject the null hypothesis implying the data sets do not have equal variances.
</p>
<h2>Kruskal-Wallis Test</h2>
<p>
First the <a href="https://stats.libretexts.org/Courses/Las_Positas_College/Math_40%3A_Statistics_and_Probability/12%3A_Nonparametric_Statistics/12.11%3A_KruskalWallis_Test">assumptions</a> for the Kruskal-Wallis Test are as follows:
<br>
<br>
 1. The data does not have to be normally distributed.
<br>
 2. The data must have equal variances.
<br>
<br>
From the section on Normality and Equal Variances Tests, Assumption 2 is violated by the current dataset. As a result, the medians or means cannot be compared since minor differences in the variances may result to higher errors <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.3561">(Fagerland & Sandvik, 2009)</a>. Therefore, the Kruskal-Wallis Test in this case can only conclude if there is a statistical difference between the groups i.e., checking if a group's sample could have significantly different values than the other samples from different groups.
<br>
<br>
The data in the three categories (pre-campaign, campaign, post-campaign) will be compared using the Kruskal-Wallis Test in order to answer the following hypothesis:
<br>
<br>
 H<sub>o</sub> = There is no significant difference in the number of BBM credit-grabbing tweets before, during and after the campaign period.
<br>
 H<sub>a</sub> = There is significant difference in the number of BBM credit-grabbing tweets before, during and after the campaign period.
<br>
<br>
The table below presents the test statistic value and p-value after performing Kruskal-Wallis Test on the non-normal splitted datasets with 0.05 level of significance using Scipy.
</p>
<table>
<tr>
<th>
Datasets compared
</th>
<th>
H Test statistic
</th>
<th>
x<sup>2</sup> critical value (df=2)
</th>
<th>
P-value
</th>
<th>
Significance Level
</th>
</tr>
<tr>
<td>
Pre-campaign vs Campaign vs Post-campaign
</td>
<td>
5.423
</td>
<td>
5.991
</td>
<td>
0.066
</td>
<td>
0.05
</td>
</tr>
</table>
<p>
Observe that for the <a href="https://www.sciencedirect.com/topics/mathematics/kruskal-wallis-test">Kruskal-Wallis Test </a>, the test statistic is less than the critical value and the p-value is greater than the significance level thus we fail to reject the null hypothesis. In other words, there is no significant difference in the number of BBM credit-grabbing tweets before, during and after the campaign period. To explore more on this finding, a post hoc analysis was also performed through Mann-Whitney U Test.
</p>
<h2>Mann-Whitney U Test</h2>
<p>
First it must be noted that the Kruskal-Wallis Test is equivalent to the Mann-Whitney U Test but for more than two groups <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881615/">(du Prel et al., 2010)</a>. Hence, the same assumptions from Kruskal-Wallis can be applied. Note that with Mann-Whitney U Test pairwise comparisons would be made with the campaign period as the point of reference.
<br>
<br>
To further check for possible statistical significance, a One-Tailed Mann-Whitney U Test is performed for post hoc analysis of pairwise groups with the following hypothesis:
<br>
<br>
 H<sub>o</sub> = There is no significant change in the number of BBM credit-grabbing tweets before, during and after the campaign period.
<br>
 H<sub>a</sub> = BBM credit-grabbing tweets are significantly less during the campaign period than the time period being compared.
<br>
<br>
</p>
<table>
<tr>
<th>
Datasets compared
</th>
<th>
U Test statistic
</th>
<th>
Critical value
</th>
<th>
P-value
</th>
<th>
Significance Level
</th>
</tr>
<tr>
<td>
Campaign vs Pre-campaign
</td>
<td>
309
</td>
<td>
331
</td>
<td>
0.009
</td>
<td>
0.05
</td>
</tr>
<tr>
<td>
Campaign vs Post-campaign
</td>
<td>
382
</td>
<td>
317
</td>
<td>
0.201
</td>
<td>
0.05
</td>
</tr>
</table>
<p>
Contrary to the results of the Kruskal-Wallis Test, the One-tailed <a href="https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Mostly_Harmless_Statistics_(Webb)/13%3A_Nonparametric_Tests/13.05%3A__Mann-Whitney_U_Test">Mann Whitney U Test</a> shows that for the campaign and pre-campaign datasets, the test statistic is less than the critical value and the p-value is less than the significance level therefore the null hypothesis is rejected. The implication of this finding is that there are significantly less BBM credit-grabbing tweets during the campaign period than the pre-campaign period. However, for the campaign vs post-campaign Mann Whitney U Test both the test statistic and the p-value are greater than the critical value and significance level, respectively. This implies that there is no significant difference between the number of BBM credit-grabbing tweets during and after the campaign period.
<br>
<br>
To analyze the possible causes to these results, a computational was used on the split datasets.
</p>
</div>
</div>
<div class="row">
<div>
<header>
<h1>Computational Modeling</h1>
</header>
<h2>Event Detection Modeling</h2>
<p>
Event detection model was selected as the computational model to analyze the dataset. Event detection modeling detects change points and peaks on the dataset. Using these outputs, dates with the most amount of disinformation tweets and dates that indicate a change in pattern was used to find events that contributed to the dataset's pattern.
</p>
<h2>PELT Algorithm</h2>
<p>
The algorithm used in the event detection model was the Pruned Exact Linear Time (PELT) algorithm. The PELT algorithm has one parameter called the penalty value. The penalty value determines the number of change points that will be displayed in the model. Lowering the penalty value subjects the dates to have a higher sensitivity. Hence, a higher penalty is desired to avoid including dates that may be irrelevant. After a few tweaks on this parameter, the penalty value of 4 was chosen with the results shown below.
</p>
<img src="images/event-modeling-penalty-4.JPG" class="center-inline-img">
<p>
In the image above, we can see the results of choosing the penalty value of 4. This is also the same results that we get if we change the penalty value to 3 (lowering the penalty value). This implies that these dates are significant for our findings.
</p>
<img src="images/event-modeling-penalty-5.JPG" class="center-inline-img">
<p>
Changing the penalty value to 5 shows no change points as shown above. Hence, considering the data with a low sensitivity, the penalty value of 4 was chosen.
</p>
<img src="images/event-modeling-peaks.JPG" class="center-inline-img">
<p>
The parameter for the peaks that were relevant for tweaking is the minimum height of the considered peak. For this we chose 5 as the height. Lowering the peak height to 4, adds two more peaks that might be irrelevant to our findings. Additionally, the height of 4 would be just 1 tweet away from numerous points in the data.
</p>
<h2>Interpretting and Explaining the Results via Real-life Events</h2>
<img src="images/event-modeling-labels.JPG" class="center-inline-img">
<p>
The dates above are the dates that were gathered from the event-detection modelling. The peak dates are October 20, 2021, January 30, 2022, and September 23, 2022. The dates for the change points are October 21, 2021, January 24, 2022, and May 29, 2022.
</p>
<img src="images/event-modeling-first-dates.JPG" class="center-inline-img">
<p>
The first dates obtained from the peak and change points are October 20, 2022, and October 21, 2022. In the days of October, multiple tweets can be seen within the said month. This is the month that BBM has filed his Certificate of Candidacy (CoC), October 6, 2021
<a href="https://www.pna.gov.ph/articles/1155753">(PNA, 2021)</a>
. The number of tweets after this month dropped.
</p>
<img src="images/event-modeling-second-dates.JPG" class="center-inline-img">
<p>
Only up until the change point January 24, 2022 do we see an increase of tweets. The second peak (January 30, 2022) can also be seen in this month. It was reported by Twitter to the AFP that Twitter accounts supporting BBM was suspended due to violation of platform manipulation and spam policy
<a href="https://www.rappler.com/nation/elections/twitter-suspends-accounts-ferdinand-bongbong-marcos-jr-network-january-2022/">(Rappler, 2022)</a>
.
</p>
<img src="images/event-modeling-last-dates.JPG" class="center-inline-img">
<p>
The high number of tweets continued up until the next change point at May 29, 2022 (end of elections month). Little to no tweets were seen after the last change point. Only on September 23, 2022 do we see a large amount of tweets (last peak). This is the day after Joe Biden praised BBM for his ‘work’ on windmills, on September 22, 2022 at the UN General Assembly
<a href="https://www.philstar.com/pilipino-star-ngayon/bansa/2022/09/23/2211712/biden-impressed-sa-ilocos-norte-windmills-na-itinayo-naman-ng-private-sector">(Philstar, 2022)</a>
.
</p>
</div>
</div>
</div>
</article>
<!-- Scripts -->
<script src="assets/js/jquery.min.js"></script>
<script src="assets/js/jquery.scrolly.min.js"></script>
<script src="assets/js/browser.min.js"></script>
<script src="assets/js/breakpoints.min.js"></script>
<script src="assets/js/util.js"></script>
<script src="assets/js/main.js"></script>
</body>
</html>