-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
627 lines (489 loc) · 31.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
---
title: "Singapore HDB Resale Flat Prices"
output:
github_document:
toc: TRUE
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r packages, include = FALSE}
library(tidyverse)
library(readr)
library(lubridate)
library(jtools)
```
## Introduction
As a millenial growing up in Singapore and starting out in my first job, one of my life's goals is to own a house on this little red dot with my life partner (*read: wife*).
Here in [Singapore](https://en.wikipedia.org/wiki/Singapore), there are generally [two ways](https://www.propertyguru.com.sg/property-guides/the-different-types-of-housing-in-singapore-9916) to do this:
- [Public Housing](https://www.hdb.gov.sg/about-us/our-role/public-housing-a-singapore-icon)
- Brand new Built-to-Order (BTO) Flats (or apartments)
- Resale Flats
- Private Housing (e.g., condominiums, landed property, etc.)
Whilst BTOs tend to be the choice of most young Singaporean couples, my current and potential future situation may push my partner and I to look towards resale flats to meet our housing needs. As such, I set forth on this project that will focus on resale flats under the Housing & Development Board (HDB). I hope to uncover insights into the relationships between variables related to resale flat prices and whether the data can be fitted into a model that may give an idea of future prices.
This project therefore seeks to analyse the prices of Singapore's HDB Resale Flats for the years of 1990 - 2022. Data was obtained from Kaggle, [(1)](https://www.kaggle.com/datasets/syrahmadi/resale-hdb-flat-prices-2000-2022) and [(2)](https://www.kaggle.com/datasets/chngyuanlongrandy/hdb-prices-with-closest-mrt-distance), which were sourced from GovTech's open data portal, [data.gov.sg](https://data.gov.sg/dataset/resale-flat-prices).
```{r import, message = FALSE, echo = FALSE}
#import data as objects
nineties_data <- read_csv("~/Desktop/Data Stuff/singapore-hdb-resale-prices/Data/resale-flat-prices-based-on-approval-date-1990-1999.csv")
feb2012_data <- read_csv("~/Desktop/Data Stuff/singapore-hdb-resale-prices/Data/resale-flat-prices-based-on-approval-date-2000-feb-2012.csv")
dec2014_data <- read_csv("~/Desktop/Data Stuff/singapore-hdb-resale-prices/Data/resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.csv")
dec2016_data <- read_csv("~/Desktop/Data Stuff/singapore-hdb-resale-prices/Data/resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016.csv")
latest_data <- read_csv("~/Desktop/Data Stuff/singapore-hdb-resale-prices/Data/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")
```
## Data Exploration
First, we will examine the data to understand its structure and contents.
```{r data exploration, echo = FALSE}
head(nineties_data)
head(feb2012_data)
head(dec2014_data)
head(dec2016_data)
head(latest_data)
```
From these, we see that the data structure is generally similar. The only difference being that the "Jan 2015 - Dec 2016" and "Jan 2017 onwards" data consists of an extra column `remaining_lease`, indicating the remaining years of lease for the houses in the datasets. For the purpose of this project then, we will be dropping this column to maintain data uniformity.
```{r drop excess columns, include = FALSE}
dec2016_data <- dec2016_data %>%
select(!remaining_lease)
head(dec2016_data)
latest_data <- latest_data %>%
select(!remaining_lease)
head(latest_data)
```
```{r union datasets, include = FALSE}
resale_data <- bind_rows(nineties_data,feb2012_data,dec2014_data,dec2016_data,latest_data)
head(resale_data)
```
Now that the data is similar, we can combine the datasets together. Looking into the dataset, we see that the `storey_range` variable has varied values, which we will be renaming into a new variable called `storeys`.
However, before we create the new variable, it would be helpful to understand the range of storey ranges among the resale flats and create a new variable to categorise them.
Even in 2022, very few resale flats have more than 12 to 15 floors, hence, everything above the 9th floor will be tagged as *"high"*.
```{r storey_ranges, warning = FALSE, message = FALSE, echo = FALSE, fig.align = "center"}
#examine the values in storey_range for each year period
resale_data %>%
group_by(storey_range) %>%
summarise(count = n()) %>%
ggplot(aes(x = storey_range, y = count)) +
geom_col() +
coord_flip() +
labs(y = "Count of flats",
x = "Storey Ranges",
title = "Number of flats per storey range in 1990 to 2022")
```
```{r relabel storey_ranges, include = FALSE}
#create new variable
resale_data <- resale_data %>%
mutate(storeys = factor(case_when(
storey_range == "01 TO 03" | storey_range == "01 TO 05" | storey_range == "04 TO 06" ~ "low",
storey_range == "06 TO 10" | storey_range == "07 TO 09" ~ "mid",
TRUE ~ "high"
),
levels = c("low", "mid", "high"))) %>%
relocate(storeys, .after = storey_range)
head(resale_data)
```
## Exploratory Data Analysis
Now that the data is cleaned, we can do some quick exploratory analyses.
First, we know that generally, HDB flat prices tend to vary by storey levels. We can visualise this layman assumption here:
```{r plot storeys and price, echo = FALSE, fig.align = "center"}
ggplot(resale_data, aes(x = storeys, y = resale_price/1000)) +
geom_boxplot() +
coord_flip() +
labs(x = "Storeys",
y = "Prices (thousands; SGD)",
title = "Spread of HDB Resale Flat Prices by Storey Levels") +
scale_y_continuous(breaks = seq(0, 1500, by = 200))
```
Although the variation and range of resale flat prices are relatively the same, the median prices for each of the storey categories seem to fit this assumption. That is, HDB flats on the lower levels generally tend to cost less than those on the higher levels.
One possible explanation for near similar variations and ranges could be the scope of the dataset, which covers approximately 30 years' worth of data. Furthermore, the above chart does not take into account the HDB town, which, as most Singaporeans know, also influences the prices of flats.
Let's examine the relationship between HDB towns and prices next:
```{r plot towns and price, echo = FALSE, fig.align = "center"}
ggplot(resale_data, aes(x = reorder(town, resale_price, median), y = resale_price/1000)) +
geom_boxplot() +
coord_flip() +
labs(x = "Towns",
y = "Prices (thousands; SGD)",
title = "Spread of HDB Resale Flat Prices by HDB Towns") +
scale_y_continuous(breaks = seq(0, 1500, by = 200))
```
It is therefore evident that HDB resale flat prices do vary by their towns.
At this point, as mentioned above, the range of data is still massive with about 871,671 observations. For ease of and more focused analyses, we will narrow down the dataset by three conditions:
- Apartments from 2010 onwards (because we want newer houses and buildings)
- Apartments of at least 4-room and not multi-generation (so that we have a flat that is big enough to house at least 2 children)
- Apartments located on "mid" or "high" storeys (as we simply prefer being higher up and for the higher resale value over time)
```{r create year variable, include = FALSE}
resale_data <- resale_data %>%
mutate(year = as.integer(str_sub(month, end = 4))) %>%
relocate(year, .after = month)
```
```{r clean flat_type data, include = FALSE}
resale_data %>%
group_by(flat_type) %>%
summarise(n())
resale_data <- resale_data %>%
mutate(flat_type2 = factor(case_when(
flat_type == "1 ROOM" ~ "1-room",
flat_type == "2 ROOM" ~ "2-room",
flat_type == "3 ROOM" ~ "3-room",
flat_type == "4 ROOM" ~ "4-room",
flat_type == "5 ROOM" ~ "5-room",
flat_type == "EXECUTIVE" ~ "Executive",
TRUE ~ "Others"
),
levels = c("1-room", "2-room", "3-room", "4-room", "5-room", "Executive", "Others"))) %>%
relocate(flat_type2, .after = flat_type)
```
```{r filter data, echo = FALSE}
filtered_data <- resale_data %>%
filter(year >= 2010,
flat_type2 == c("4-room", "5-room", "Executive"),
storeys != "low")
head(filtered_data)
```
This gives us a dataset with 38,695 observations - something much easier to work with and relatively more reflective of current trends, considering the recency of the data.
### Analyses for 2010 - 2022
Here, we repeat the analyses from the previous section using the filtered dataset.
```{r plot storeys/towns and price with filtered data, echo = FALSE, fig.show="hold", out.width="50%"}
ggplot(filtered_data, aes(x = storeys, y = resale_price/1000)) +
geom_boxplot() +
coord_flip() +
labs(x = "Storeys",
y = "Prices (thousands; SGD)",
title = "Spread of HDB Resale Flat Prices by Storey Levels") +
scale_y_continuous(breaks = seq(0, 1500, by = 200))
ggplot(filtered_data, aes(x = reorder(town, resale_price, median), y = resale_price/1000)) +
geom_boxplot() +
coord_flip() +
labs(x = "Towns",
y = "Prices (thousands; SGD)",
title = "Spread of HDB Resale Flat Prices by HDB Towns") +
scale_y_continuous(breaks = seq(0, 1500, by = 200))
```
```{r compare median prices of storeys, include = FALSE}
median_price_high <- filtered_data %>%
filter(storeys == "high") %>%
summarise(median(resale_price)) %>%
pull()
median_price_mid <- filtered_data %>%
filter(storeys == "mid") %>%
summarise(median(resale_price)) %>%
pull()
median_diff <- format((median_price_high - median_price_mid),
big.mark = ",", big.interval = 3L,
scientific = FALSE)
```
We also add another analysis:
```{r plot flat type and price with filtered data, echo = FALSE, fig.align = "center"}
ggplot(filtered_data, aes(x = flat_type2, y = resale_price/1000)) +
geom_boxplot() +
coord_flip() +
labs(x = "Flat Type",
y = "Prices (thousands; SGD)",
title = "Spread of HDB Resale Flat Prices by Flat Types") +
scale_y_continuous(breaks = seq(0, 1500, by = 200))
```
We can make three observations here:
For HDB Resale Flats sold from 2010 onwards,
1. Apartments on higher floors (i.e., at least 11th floor) tend to be priced SGD `r median_diff` more than those on middle floors (i.e., 6th to 10th floors)
2. Apartments in the areas nearer to the Central Business District (CBD) area of Singapore tend to be priced more than those further away, with those in mature estates priced higher than those in non-mature estates (see [here](https://www.propertyguru.com.sg/property-guides/non-mature-vs-mature-bto-55760) for a list of such estates)
3. Apartments with larger flat types tend to be priced higher than relatively smaller ones
Another interesting observation is with the spread of resale prices by HDB towns.
Let's take a look at both the unfiltered (left) and filtered (right) data visualisations side by side:
```{r plot towns and price side, echo = FALSE, fig.show="hold", out.width="50%"}
ggplot(resale_data, aes(x = reorder(town, resale_price, median), y = resale_price/1000)) +
geom_boxplot() +
coord_flip() +
labs(x = "Towns",
y = "Prices (thousands; SGD)",
title = "Spread of HDB Resale Flat Prices by HDB Towns for 1990 to 2022") +
scale_y_continuous(breaks = seq(0, 1500, by = 200)) +
theme(plot.title = element_text(size = 12))
ggplot(filtered_data, aes(x = reorder(town, resale_price, median), y = resale_price/1000)) +
geom_boxplot() +
coord_flip() +
labs(x = " ",
y = "Prices (thousands; SGD)",
title = "Spread of HDB Resale Flat Prices by HDB Towns for 2010 to 2022") +
scale_y_continuous(breaks = seq(0, 1500, by = 200)) +
theme(plot.title = element_text(size = 12))
```
We see that when comparing the range of HDB resale flat prices for 1990 to 2022 vs. for 2010 to 2022, the second observation above does not hold. The simple explanation here is that, many estates (e.g., Punggol, Sengkang, Central Area) did not have many HDB flats built prior to 2010 (or even 2000).
As Singapore developed further, with land scarcity increasing, more HDB flats needed to be built closer to the CBD to meet the demands of working professionals whose offices were located there. As such, even mature estates such as Queenstown and Bukit Merah, being some of the oldest estates in Singapore, with their newly developed HDB flats in the early 2000s, started seeing increased resale prices due to their proximity to the CBD. Furthermore, with the development of Sengkang and Punggol in line with more recent trends in HDB resale flat prices, these newer flats came out on top in terms of resale prices when being compared to flats that were built long before 2010, or even 2000.
This brings us to another variable that we have not investigated: the lease commencement date. The years variable used thus far were the years that the HDB flat was put on the resale market. As such, the lease commencement date would give an indicator of the age of the HDB flat.
Let's take a look at the visualisation here:
```{r plot lease date and price by towns, echo = FALSE, message = FALSE, fig.align = "center"}
filtered_data %>%
group_by(town, lease_commence_date) %>%
mutate(median_price_by_lease = median(resale_price)) %>%
ungroup() %>%
ggplot(aes(x = lease_commence_date, y = median_price_by_lease/1000)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, color = "red") +
labs(x = "Lease Commencement Year",
y = "Prices (thousands; SGD)",
title = "Median HDB Resale Flat Prices by Lease Date, grouped by HDB town") +
scale_y_continuous(breaks = seq(0, 1500, by = 200)) +
scale_x_continuous(breaks = seq(1960, 2020, by = 20)) +
theme(plot.title = element_text(size = 12)) +
facet_wrap(~town)
```
The plot above shows that there is a general upward trend for newer HDB flats, which by common logic, does make sense. The only "outlier" in this plot is that of HDB resale flats in Sembawang, which saw a downward trend followed by a spike in the late 2010s. This could have been due to recent urban developments in the Canberra area, located near Sembawang.
```{r spearman correlation for correlation between price and lease year, echo = FALSE}
filtered_data <- filtered_data %>%
group_by(town, lease_commence_date) %>%
mutate(median_price_by_lease = median(resale_price)) %>%
ungroup()
corr_median_price_lease <- format(round(cor(filtered_data$lease_commence_date,
filtered_data$median_price_by_lease,
method = "spearman"), 3), nsmall = 3)
```
Nevertheless, when looking at the data as a whole, it still shows that the resale prices increased for newer HDB resale flats (rho = `r corr_median_price_lease`, *p* < 0,05).
```{r spearman correlation test for correlation between price and lease year, echo = FALSE}
cor.test(filtered_data$lease_commence_date,
filtered_data$median_price_by_lease,
method = "spearman",
exact = FALSE)
```
```{r plot lease date and price, echo = FALSE, message = FALSE, fig.align = "center"}
filtered_data %>%
group_by(town, lease_commence_date) %>%
mutate(median_price_by_lease = median(resale_price)) %>%
ungroup() %>%
ggplot(aes(x = lease_commence_date, y = median_price_by_lease/1000)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, color = "red") +
labs(x = "Lease Commencement Year",
y = "Prices (thousands; SGD)",
title = "Median HDB Resale Flat Prices by Lease Date") +
scale_y_continuous(breaks = seq(0, 1500, by = 200)) +
scale_x_continuous(breaks = seq(1960, 2020, by = 20)) +
theme(plot.title = element_text(size = 12))
```
Finally, we can also examine whether the HDB resale prices follow a pattern or trend over time.
```{r create date and month variable and then plot time series, echo = FALSE, message = FALSE, fig.align = "center"}
filtered_data <- filtered_data %>%
mutate(month_year = ym(month)) %>%
relocate(month_year, .after = month)
#plot time series with month and year against resale value
filtered_data %>%
group_by(month_year, flat_type2) %>% #all flat sizes
summarise(mean_resale_price_hundredk = mean(resale_price)/1000) %>%
ggplot(aes(month_year, mean_resale_price_hundredk)) +
geom_line(color = "red3") +
labs(title = "Monthly Mean HDB Resale Prices over 2010 to 2022",
y = "HDB Resale Prices (SGD; thousands)",
x = "Month and Year") +
theme_bw(base_size = 15) +
theme(plot.title = element_text(size = 12),
axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
axis.title.y = element_text(size = 13)) +
facet_grid(rows = vars(flat_type2)) +
scale_x_date(date_breaks = "1 year", date_minor_breaks = "1 month", date_labels = "%b %Y")
filtered_data %>%
filter(flat_type2 == "4-room") %>% #4-room flats
group_by(month_year, storeys) %>%
summarise(mean_resale_price_hundredk = mean(resale_price)/1000) %>%
ggplot(aes(month_year, mean_resale_price_hundredk)) +
geom_line(color = "red3") +
labs(title = "Monthly Mean HDB Resale Prices for 4-room apartments over 2010 to 2022",
y = "HDB Resale Prices (SGD; thousands)",
x = "Month and Year") +
theme_bw(base_size = 15) +
theme(plot.title = element_text(size = 12),
axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
axis.title.y = element_text(size = 13)) +
facet_grid(rows = vars(storeys)) +
scale_x_date(date_breaks = "1 year", date_minor_breaks = "1 month", date_labels = "%b %Y")
filtered_data %>%
filter(flat_type2 == "5-room") %>% #5-room flats
group_by(month_year, storeys) %>%
summarise(mean_resale_price_hundredk = mean(resale_price)/1000) %>%
ggplot(aes(month_year, mean_resale_price_hundredk)) +
geom_line(color = "red3") +
labs(title = "Monthly Mean HDB Resale Prices for 5-room apartments over 2010 to 2022",
y = "HDB Resale Prices (SGD; thousands)",
x = "Month and Year") +
theme_bw(base_size = 15) +
theme(plot.title = element_text(size = 12),
axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
axis.title.y = element_text(size = 13)) +
facet_grid(rows = vars(storeys)) +
scale_x_date(date_breaks = "1 year", date_minor_breaks = "1 month", date_labels = "%b %Y")
filtered_data %>%
filter(flat_type2 == "Executive") %>% #exec flats
group_by(month_year, storeys) %>%
summarise(mean_resale_price_hundredk = mean(resale_price)/1000) %>%
ggplot(aes(month_year, mean_resale_price_hundredk)) +
geom_line(color = "red3") +
labs(title = "Monthly Mean HDB Resale Prices for Executive apartments over 2010 to 2022",
y = "HDB Resale Prices (SGD; thousands)",
x = "Month and Year") +
theme_bw(base_size = 15) +
theme(plot.title = element_text(size = 12),
axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
axis.title.y = element_text(size = 13)) +
facet_grid(rows = vars(storeys)) +
scale_x_date(date_breaks = "1 year", date_minor_breaks = "1 month", date_labels = "%b %Y")
```
#### Insights from EDA of resale apartments from 2010 onwards
1. Apartments on higher floors tend to be priced higher (at least SGD 40,000 more between high and middle floors)
2. Apartments nearer to the CBD tend to be priced higher than those futher away
3. Apartments in mature estates tend to be priced higher than those in non-mature estates
4. Apartments with larger flat types tend to be priced higher than relatively smaller ones
5. Newer apartments tend to be priced higher than relatively older ones (i.e., with shorter remaining lease years)
6. Resale prices generally dip at the end of the year and then dip again at around mid-year
7. Executive apartment resale prices are more volatile (variable) than 4-room and 5-room apartment resale prices
### Choosing feature variables to predict HDB resale prices
In predicting HDB resale prices, the exploratory analyses revealed some key variables:
- Flat-type (e.g., 4-room, 5-room, Executive)
- This variable is synonymous with the floor area of the house, which is represented by the `floor_area_sqm` variable
- Month of sale
- Year of lease commencement (i.e., indicating the remaining length of the house lease at point of sale)
- Storeys (e.g., Mid or High floors)
- Region (i.e., indicated by Town, which indicates proximity to the CBD)
Out of these variables, we will need to create or calculate new variables for the length of housing lease at point of sale, for the housing region and for the month of sale.
For the town groupings by region, we will use the information found [here](https://en.wikipedia.org/wiki/Regions_of_Singapore).
```{r create variable for remaining length of lease, echo = FALSE}
filtered_data <- filtered_data %>%
mutate(year = as.numeric(year)) %>%
mutate(remaining_lease_length = 99-(year-lease_commence_date)) %>%
relocate(remaining_lease_length, .after = lease_commence_date)
```
```{r create variable for region, echo = FALSE}
filtered_data <- filtered_data %>%
mutate(region = factor(case_when(
town %in% c("BISHAN", "BUKIT MERAH", "BUKIT TIMAH", "CENTRAL AREA", "GEYLANG", "KALLANG/WHAMPOA",
"MARINE PARADE", "QUEENSTOWN", "TOA PAYOH") ~ "central",
town %in% c("BEDOK", "PASIR RIS", "TAMPINES") ~ "east",
town %in% c("SEMBAWANG", "WOODLANDS", "YISHUN") ~ "north",
town %in% c("ANG MO KIO", "HOUGANG", "PUNGGOL", "SENGKANG", "SERANGOON") ~ "north-east",
town %in% c("BUKIT BATOK", "BUKIT PANJANG", "CHOA CHU KANG", "CLEMENTI", "JURONG EAST", "JURONG WEST") ~ "west"
),
levels = c("central", "east", "north", "north-east", "west"))) %>%
relocate(region, .after = town)
```
```{r create variable for month of sale, echo = FALSE}
filtered_data <- filtered_data %>%
mutate(month_2 = as.character(month_year, "%B")) %>%
relocate(month_2, .after = month_year) %>%
mutate(month_2 = factor(month_2,
levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")))
```
As such, the feature variables we will be using to predict HDB resale prices would be:
- Flat Type (`flat_type2`)
- Floor area in sqm (`floor_area_sqm`)
- Remaining lease length in years (`remaining_lease_length`)
- Region (`region`)
- Floor level (`storeys`)
- Month of sale (`month_2`)
Since flat type and floor area are synonymous, we can conduct an ANOVA test on both variables to test for feature dependence and consider whether we should use both variables as features in a model or only one.
For this ANOVA test, we will set our alpha level to 0.1 and the null hypothesis that there is no difference between the means of the floor area for the flat types.
```{r anova for flat type and floor area, echo = FALSE}
floor_area_vs_flat_type <- lm(floor_area_sqm ~ flat_type2, data = filtered_data) #define model for ANOVA
anova(floor_area_vs_flat_type)
```
The ANOVA test showed that the p-value was less than 0.001 and so we can reject the null hypothesis; concluding that there is a mean difference between the floor area of flat types.
To determine which pairs of flat types had differences, we can conduct a series pairwise t-test with Bonferroni correction. We will set alpha to 0.1 for this test.
```{r pairwise t-test, echo = FALSE}
pairwise.t.test(filtered_data$floor_area_sqm, filtered_data$flat_type2, p.adjust.method = "bonferroni")
```
The pairwise t-tests showed p-values of less than 0.001 for all possible pairs, which means that we can reject all null hypotheses that there were no differences in mean floor areas between flat types. As such, we will be keeping both variables for predicting HDB resale prices.
## Predicting HDB apartment resale prices for the years of 2010 - 2022
After exploring our data and gaining some insights into the trends and factors possibly influencing resale prices, we will be modelling the prices to predict them.
For these linear regression models, the alpha level will be set at 0.1.
Three models were fitted:
1. A model with all the feature variables listed above (aka the Full model)
2. A model without the flat type variable
3. A model without the month of year variable
```{r multiple linear regression for resale prices Full Model, echo = FALSE}
glm_resale_prices <- glm(resale_price ~ flat_type2 + floor_area_sqm + remaining_lease_length + region + storeys + month_2,
data = filtered_data,
family = gaussian()) #define linear regression model
summ(glm_resale_prices) #Review the results
```
```{r multiple linear regression for resale prices Model 2, echo = FALSE}
glm_resale_prices_2 <- glm(resale_price ~ floor_area_sqm + remaining_lease_length + region + storeys + month_2,
data = filtered_data,
family = gaussian()) #define linear regression model 2 without flat_type variable
summ(glm_resale_prices_2) #Review the results 2
```
```{r multiple linear regression for resale prices Model 3, echo = FALSE}
glm_resale_prices_3 <- glm(resale_price ~ flat_type2 + floor_area_sqm + remaining_lease_length + region + storeys,
data = filtered_data,
family = gaussian()) #define linear regression model 3 without month of year
summ(glm_resale_prices_3) #Review the results 3
```
The r^2^ values and the p-values for the coefficients of each model and for each full model were compared.
The results show that the full model was sufficient in explaining the HDB resale prices across 2010 and 2022, with r^2^ = 0.586, *F* = 2740, *p* < 0.001. When certain variables were removed as in Models 2 and 3, the r^2^ values would drop to 0.577 and 0.585 respectively, indicating a drop in the explainability of the model. The full model therefore showed a moderate variability (58.6%) in explaining HDB resale prices.
While dropping the month of year variable resulted in a change in r^2^ of 0.001, the coefficients for the month of year variable had several values that had *p*-values less than 0.1 thereby still being able to explain the HDB resale prices.
#### Visualising simple effects
Examining the Full Model further, we will visualise the simple effects to ensure our insights above are validated based on the data.
```{r effect plot, echo = FALSE, fig.align = "center"}
summ(glm_resale_prices)
effect_plot(glm_resale_prices, pred = flat_type2, interval = TRUE, plot.points = TRUE, jitter = 0.05) #plot flat type predictor
effect_plot(glm_resale_prices, pred = floor_area_sqm, interval = TRUE, plot.points = TRUE, jitter = 0.05) #plot floor area predictor
effect_plot(glm_resale_prices, pred = remaining_lease_length, interval = TRUE, plot.points = TRUE, jitter = 0.05) #plot remaining lease length predictor
effect_plot(glm_resale_prices, pred = region, interval = TRUE, plot.points = TRUE, jitter = 0.05) #plot region predictor
effect_plot(glm_resale_prices, pred = storeys, interval = TRUE, plot.points = TRUE, jitter = 0.05) #plot storeys predictor
effect_plot(glm_resale_prices, pred = month_2, interval = TRUE, plot.points = TRUE, jitter = 0.05) #plot month predictor
```
### Digging deeper into each feature variable
```{r review the linear regression, echo = FALSE}
summ(glm_resale_prices) #Review the results
```
Reviewing the linear regression coefficients, we see the following insights:
1. As predicted, as flat sizes increase, the resale prices would increase, all else being equal
2. As predicted, as remaining lease lengths increase, resale prices would increase, all else being equal
3. As predicted, for apartments on higher floors, resale prices would increase, all else being equal
Two feature variables deserve separate insights analyses:
- Region of apartment
- All else being equal, compared to apartments in the Central region, apartments in the North, East, North-East and West regions would predict lower resale prices.
- The order of resale prices by region from highest to lowest is as follows: Central > East > North-East > West > North
- Month of sale
- As predicted, resale prices would decrease in the middle of the year (May to July), increase towards the end of the year, sustain at the beginning of the year and then fall again in mid-year.
At this point, some of the more obvious predictors would be flat sizes, remaining lease lengths and apartment floor level, where it is clear that the higher they go, the higher prices would go.
It seems that key variables that are would affect my decision-making would be the region of the HDB flat and the time of year that I purchase a flat. In sum, I can expect to pay on average, the lowest prices when purchasing flats in the North and aim to purchase one in May to June to take advantage of a fall in market prices.
This is what we see when visualising mean HDB resale prices by region across 2010 to 2022:
```{r visualise prices by region over time, echo = FALSE, message = FALSE, fig.align = "center"}
filtered_data %>%
select(resale_price, region, month_year) %>%
group_by(region, month_year) %>%
summarise(mean_price = mean(resale_price), region, month_year) %>%
ggplot(aes(x = month_year, y = mean_price/1000)) +
geom_line(color = "red") +
labs(title = "Monthly Mean HDB Resale Prices over 2010 to 2022 by region",
y = "HDB Resale Prices (SGD; thousands)",
x = "Month and Year") +
theme_bw(base_size = 15) +
theme(plot.title = element_text(size = 12),
axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
axis.title.y = element_text(size = 13)) +
facet_grid(rows = vars(region)) +
scale_x_date(date_breaks = "1 year", date_minor_breaks = "1 month", date_labels = "%b %Y")
```
It is evident that resale prices are on a general upward trend, with the prices in the Central regions seeing the most volatility compared to other regions. More importantly, the prices in the North region are relatively the lowest, followed by both the West and North-East, then by East.
When visualising the month of year that a flat gets sold:
```{r visualise prices by month of year, echo = FALSE, fig.align = "center"}
filtered_data %>%
select(resale_price, month_2) %>%
group_by(month_2) %>%
ggplot(aes(x = month_2, y = resale_price/1000, fill = month_2)) +
geom_boxplot() +
labs(title = "Monthly Mean HDB Resale Prices over 2010 to 2022 by Month of Sale",
y = "HDB Resale Prices (SGD; thousands)",
x = "Month of Sale") +
theme_bw(base_size = 15) +
theme(plot.title = element_text(size = 12),
axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1),
axis.title.y = element_text(size = 13)) +
scale_fill_brewer(palette = "Paired")
```
It is also evident that resale prices tend to be lowest in the mid-year months of May to June. As such, I could expect to find the lowest relative prices at this time of the year.
## Conclusions
I set out on this project with the aim of analysing HDB resale flat prices over time to uncover any insights or trends. As I had some layman intuition about the relationship between resale prices and several factors, the analyses here validated some of those assumptions and provided insight into other factors.
In short:
- Larger flat sizes, longer remaining lease lengths and units on higher floors command higher resale prices
- Different regions command different average prices, in order from highest to lowest: Central > East > North-East > West > North
- The time of the year when a unit is sold could determine its price relative to the market, with relatively lower prices consistently seen in May to June.
While there are many other factors affecting when a couple purchases a home (e.g., necessity and urgency of moving out), the variables analysed here were readily and publicly available and could serve to inform potential homeowners with the same criteria as me when studying HDB resale flat prices. With time in hand and adequate patience, one might be able to secure a flat when the market prices them at their lowest as compared to the rest of the year and also pick units in regions where prices are lower than others.
Thank you for looking through this humble project of mine. Please feel free to reach out to me for any feedback or enquiries through my social media or email as listed [here](https://github.com/iyliahutta).
The code for this project can be found in the .Rmd file of the repo or also [here as an R file](https://github.com/iyliahutta/singapore-hdb-resale-prices/blob/main/README.R).