- Allison Bertie Johnson (allisonbertie@tamu.edu)
- Arjun Ravikumar (arjun.cc@gmail.com)
- Christopher Han (christopherhan@stat.tamu.edu)
- John Schwenck (jschwenck@stat.tamu.edu)
For our 2020 TAMU Datathon submission, we have elected to solve the City Search challenge. Our model's unique spin offers a complete front-to-end implementation of a K-Means algorithm, which suggest the Top 12 best locations for a given user's preferences. We utilized Google tools such as Google Forms and Google Sheets where users submit a form with their preferences to a series of questions, which then get fed into a Google Sheet. Using R, we extract the information from the Google Sheet using the googledrive package and feed it into our K-Means algorithm.
For the 2020 TAMU Datathon, our team chose to collaborated on creating a city search tool. The city search tool matches users based on a multitude of criteria with an ideal city or area that best fits their specifications. Over 10 different data sets were utilized to collect the most current data, spanning from recent air quality reports to covid-19 cases reported in the past 14 days. After thoroughly cleaning the data sets and merging all desired variables, our team applyed clustering techniques to organize outcomes based on selectable criteria. The user can visit the city search dashboard that we have created which aesthetically formats the results of their criteria searches.
Devpost: https://devpost.com/software/tamu_datathon_2020
- Identify region of interest (U.S)
- Decide on the granularity (Metropolitan Area/ CBSA code)
- Identify variables of interest
- Obtain relevant dataset for each variable (API, government, other websites)
- Process each dataset to obtain relevant columns
- Map each dataset from zip code/city/county level to Metropolitan area through CBSA ID
- Join all the mapped datasets into one table containing CBSA ID, and all relevant columns
- Analyze dataset using hieratical clustering
- Utilize Power BI to create visuals
Source: https://www.zillow.com/research/data/
Raw Data: ZHVI_1.csv, ZHVI_2.csv, ZHVI_3.csv, ZHVI_4.csv, ZHVI_5.csv, ZHVI_condo.csv
Processed Data: housing.rda
Variable Descriptions:
- CBSA: Core-based statistical area code
- one_bed_avg: Average price in dollars of one bedroom house
- two_bed_avg: Average price in dollars of two bedroom house
- three_bed_avg: Average price in dollars of three bedroom house
- four_bed_avg: Average price in dollars of four bedroom house
- five_bed_avg: Average price in dollars of five bedroom house
- condo_avg: Average price in dollars of a condo
Source: https://www.huduser.gov/portal/datasets/fmr.html#2021_data
Raw Data: fy2021-safmrs.xlsx
Processed Data: rent_metro.rda
Variable Descriptions:
- CBSA: Core-based statistical area code
- one_bed_rent: Fair Market Rent for one bedroom apartment
- two_bed_rent: Average price in dollars of two bedroom apartment
- three_bed_rent: Average price in dollars of three bedroom apartment
- four_bed_rent: Average price in dollars of four bedroom apartment
- five_bed_rent: Average price in dollars of five bedroom apartment
Source: https://aqs.epa.gov/aqsweb/airdata/download_files.html#Annual
Raw Data: annual_aqi_by_cbsa_2019.csv
Processed Data: air_quality.rda
Variable Descriptions:
- City, State: City and State for each entry
- CBSA: Core-based statistical area code
- Median AQI: Median Air Quality Index
Source: https://api.census.gov/data/2018/acs/acs5/variables.html
Raw Data:
Census API key needed
pub_trans = B08006_008
Processed Data: pub_trans.rda
Variable Description:
- CBSA: Core-based statistical area code
- NAME: City, State Abbr. Area Type
- Avg_User: Average number of users
Source: https://wonder.cdc.gov/controller/datarequest/D60
Raw Data: max_avg_temps2011.txt and min_avg_temps2011.txt
Processed Data: avg_temps_msa.rda
Variable Description:
- CBSA: Core-based statistical area code
- mean(Avg.Min.Temp): mean average of minimum air temperature for each CBSA
- mean(Avg.Max.Temp): mean average of maximum air temperature for each CBSA
Source: https://www.bls.gov/oes/tables.html
Raw Data: B01003_001B01002_001, B19013_001
Processed Data: demographics.rda
Variable Descriptions:
- CBSA: Core-based statistical area code
- CBSA_name: Core-based statistical area name
- tot_emp_hc: Total employees in healthcare
- loc_quotient_hc: Location quotient for healthcare
- tot_emp_rest: Total employees in restaurants and food service
- loc_quotient_rest: Location quotient for restaurants and food service
- pop: Population
- pca_hc: Healthcare employees per capita
- rest_hc: Restaurant and food service workers per capita
Source: https://www.bls.gov/oes/tables.html
Raw Data: oesm19ma.zip; MSA_M2019_dl.xlsx
Processed Data: healthcare_restaurants_labor.rda
Variable Descriptions:
- CBSA: Core-based statistical area code
- CBSA_name: Core-based statistical area name
- tot_emp_hc: Total employees in healthcare
- loc_quotient_hc: Location quotient for healthcare
- tot_emp_rest: Total employees in restaurants and food service
- loc_quotient_rest: Location quotient for restaurants and food service
- pop: Population
- pca_hc: Healthcare employees per capita
- rest_hc: Restaurant and food service workers per capita
Source: https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
Raw Data: us_crime_counties.csv
Processed Data: covid_14dayavg.rda
Variable Descriptions:
- CBSA: Core-based statistical area code
- mean(cases): 14-day average of cases
- mean(deaths): 14-day average of deaths
Source: https://api.census.gov/data/2018/acs/acs5/variables.html
Raw Data: B08101_025, B08134_061, B08101_009, B08134_011, B08134_021, B08301_004, B08134_031, B08101_033, B08101_041, B08134_111, B08101_049, B08134_001
Processed Data: transportation.rda
Variable Description:
- CBSA: Core-based statistical area code
- public_trans_qty: Average number of people who took public transportation
- priv_trans_all_qty: Average number of people who took private transportation
- priv_trans_carpool_qty: Average number of people who took private transportation and carpooled
- walked_qty: Average number of people who walk
- bike_cab_other_qty: Average number of bike, cab and other transportation usage
- work_from_home_qty: Average number of people who work from home
- avg_time_to_work_all: Average number of seconds for a person to get to work
Source: https://www.thearda.com/Archive/ChCounty.asp
Raw Data: religion.csv
Processed Data: religion.rda
Variable Description:
- CBSA: Core-based statistical area code
- sum(TOTCNG): All denominations/groups - Total number of congregations
- sum(TOTADH): All denominations/groups - Total number of adherents
- sum(NONDCNG): Non-denominational - Total number of congregations
- sum(NONDADH): Non-denominational- Total number of adherents
- mean(TOTRATE): All denominations/groups - Rates of adherence per 1,000 population
- mean(NONDRATE): Non-denominational - Rates of adherence per 1,000 population
Source: https://libguides.princeton.edu/c.php?g=464855&p=3181354
Raw Data: B15003_017, B15003_018, B15003_022, B15003_023, B15003_024, B15003_025
Processed Data: education.rda
Variable Description:
- CBSA: Core-based statistical area code
- edu_hs_only: Total number of people who only have a high school diploma
- edu_ged_only: Total number of people who only have a GED
- edu_bs: Total number of people who have a Bachlor's degree
- edu_ms: Total number of people who hae a masters
- edu_prof: Total number of professional persons
- edu_phd: Total number of people with a PhD