-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathexploration.html
178 lines (146 loc) · 10 KB
/
exploration.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
<!DOCTYPE HTML>
<!--
Miniport by HTML5 UP
html5up.net | @ajlkn
Free for personal and commercial use under the CCA 3.0 license (html5up.net/license)
-->
<html>
<head>
<title>PH Twitter Fake News Analysis</title>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no" />
<link rel="stylesheet" href="assets/css/main.css" />
<style>
.center-inline-img{
margin: 0 auto;
display: block;
}
p{
color: black;
text-align: justify;
}
</style>
</head>
<body class="is-preload">
<!-- Nav -->
<nav id="nav">
<ul class="container">
<li><a href="index.html">Home</a></li>
<li><a href="overview.html">Motivation</a></li>
<li><a href="exploration.html">Data Exploration</a></li>
<li><a href="modeling.html">Modeling</a></li>
<li><a href="communication.html">Communication</a></li>
</ul>
</nav>
<!-- Home -->
<article id="top" class="wrapper style1">
<div class="container">
<div class="row">
<div>
<header>
<h1>Data Exploration</strong></h1>
</header>
<p>This section describes the preprocessing methods performed on the data set. The cleaned data is used as input to create different graphs in order to identify trends or relationships in the data. Implementations of these graphs are also discussed in this section</p>
</div>
</div>
<div id="preprocessing" class="row">
<div>
<header>
<h2>Preprocessing</strong></h2>
<h3>Time Series Analysis</h3>
</header>
<p>
To prepare the data for the time series analysis, the dates from the original spreadsheet were extracted through the code below. Observe that two datasets were derived from the original dataset: data with only the month and year and data with the entire date. Do also note that data with null values were dropped.
<img src="images/time-series-preprocessing.JPG" class="center-inline-img">
To add for the preparation, we made two separate DataFrames for each type of topic.
<img src="images/time-series-categ-df.JPG" class="center-inline-img">
<br>
The types of topics came from executing the code below in the exploration stage, where categorical columns are inspected.
<img src="images/time-series-categ-code.JPG" class="center-inline-img">
<br>
Looking at the output, the keywords for the topics are used for the sorting of data on the later stages.
<img src="images/time-series-categ-output.JPG" class="center-inline-img">
<br>
Continuing with the preprocessing, the engagement type columns are listed and used to get the count of average engagement of each topic. This is used in a later plot regarding tweet engagement.
<img src="images/time-series-engagement-preprocess.JPG" class="center-inline-img">
<br>
Two dataframes are again created to separate the topics, but in addition, the count of tweets of each timeframe is also applied. Shown below is the code, which also shows the storing of the data sizes.
<img src="images/time-series-tweet-distribution-1.JPG" class="center-inline-img">
<br>
Afterwards, the difference of the distribution of each dates is taken into consideration using the code below. This combines the data from the different type of topics to form the plot uniting them. This will be useful for a graph shown later.
<img src="images/time-series-tweet-distribution-2.JPG" class="center-inline-img">
<br>
</p>
</div>
</div>
<div id="visualization" class="row">
<div style="width: 100%;">
<header>
<h2>Visualization</strong></h2>
</header>
<h3>Time Series by Topic</h3>
<p>
The code below shows the plotting of the graph showing the distribution of the credit-grabbing tweets over the given timeframe. Note that the graph in the following images shows a 2-month gap. This is made to make the graph more compact and show the information more clearly.
<img src="images/time-series-tweet-distribution-3.JPG" class="center-inline-img">
<br>
Below is the output of the code. Observe that most of the windmill-related tweets appear from around October 2021 to October 2022 while the Nutribun-related tweets are more flat and boomed around June 2022 to October 2022. It also shows
<img src="images/time-series-tweet-distribution-graph.jpg" class="center-inline-img">
<br>
</p>
<h3>Time Series by Average Engagement</h3>
<p>
The code below uses the preprocessed engagement data to plot the average engagement count of the tweets of each topic (Nutribun and Windmill).
<img src="images/time-series-engagement-code.JPG" class="center-inline-img">
<br>
Shown below is the output of the above code. It shows the average amount of engagement each tweet (by topic) gets. As we can see, an average windmill tweet gets more attention/engagement than an average nutribun tweet.
<img src="images/time-series-engagement-output.JPG" class="center-inline-img">
<br>
<h3>Time Series by Month</h3>
<p>
The code below shows how the time series graph which was <strong>divided by months</strong> was constructed. Frequency of the tweets was counted by month using the value_counts function call. The bins were then sorted in chronological order. Note that some data were dropped because these data were past the 1 year range i.e., belonging in the October 2022 bin.
<img src="images/time-series-plotting-1.JPG" class="center-inline-img">
<br>
Executing the code will yield to the graph below. Observe that most of the credit-grabbing tweets were posted during October 2021, January to May 2022 and September 2022. This may have been caused by Bongbong Marcos's filing of candicacy on October 6, 2021, the campaign period from February 8 to May 7, 2022, and Bongbong's attendance to the United Nations General Assembly in September.
<img src="images/time-series-graph-1.JPG" class="center-inline-img">
<br>
Finally, below is the code executed to interpolate the data in the objective of filling the missing points provided that only the total frequency per month was utilized in the time series graph. As evident in the code, 365 points were now used to span the entire year. Three interpolation techniques were applied namely, Cubic Spline, Piecewise Cubic Hermite Interpolating Polynomial (PCHIP), and Akima 1D.
<img src="images/time-series-interpolation-1.JPG" class="center-inline-img">
<br>
Below is the resulting interactive graph upon executing the interpolation code. Clicking on the lines in the legend will hide the other lines plotted in the graph.
<div> <script type="text/javascript">window.PlotlyConfig = {MathJaxConfig: 'local'};</script>
<script type="text/javascript" src="plotly-a.js"></script>
<div id="02e1c0af-0418-4424-a133-bd98692d80d9" class="plotly-graph-div" style="height:100%; width:100%;"></div>
<script type="text/javascript" src="plotly2-a.js"></script>
</div>
<br>
</p>
<h3>Time Series by Day</h3>
<p>
Provided that specific dates may have sparked the emergence of credit-grabbing tweets, a time series <strong>divided by days</strong> was also implemented. Recall from the preprocessing phase that data which contained the days of each tweet was also extracted. Using this data, the number of tweets per day were counted. Observe that a similar process was done wherein each day was designated as a bin and the frequency of tweets per day was extracted using the value_counts function. Note that some data were omitted because the said data do not belong to the 1 year range.
<img src="images/time-series-plotting-2.JPG" class="center-inline-img">
Executing above will result to the graph below. As expected and as observed from the monthly time series, the months of October 2021, January to May 2022, and September 2022 had the most number of credit-grabbing tweets with September 23, 2022, Jan 30, 2022, and October 20, 2021 with the most number of tweets.
<img src="images/time-series-graph-2.JPG" class="center-inline-img">
<br>
Lastly, the code below is for the interpolation of the daily data. As observed from the previous graph, there were multiple days wherein no tweets were made on credit grabbing. This is alleviated by initially interpolating the daily data using different interpolators: Cubic Spline, PCHIP, and Akima 1D. If there were still existing dates without any tweets made, these dates were imputed with the mean number of tweets after the interpolation process. It is recommended to explore resampling (upsampling from the monthly data) in the future.
<img src="images/time-series-interpolation-2.JPG" class="center-inline-img">
<br>
The interactive graph resulting from the interpolation of the daily data is presented below. Clicking on the symbols in the legend will hide the other plots in the graph.
<div> <script type="text/javascript">window.PlotlyConfig = {MathJaxConfig: 'local'};</script>
<script type="text/javascript" src="plotly-b.js"></script>
<div id="1b1b27a2-01fa-47f8-b4ec-b83652fc1790" class="plotly-graph-div" style="height:100%; width:100%;"></div>
<script type="text/javascript" src="plotly2-b.js"></script>
</div>
</p>
</div>
</div>
</div>
</article>
<!-- Scripts -->
<script src="assets/js/jquery.min.js"></script>
<script src="assets/js/jquery.scrolly.min.js"></script>
<script src="assets/js/browser.min.js"></script>
<script src="assets/js/breakpoints.min.js"></script>
<script src="assets/js/util.js"></script>
<script src="assets/js/main.js"></script>
</body>
</html>