This repository provides data and information that will enable you to transform, import and analyze the raw Stack Overflow Annual Developer Survey data with MariaDB ColumnStore.
Starting at Step 2, Parse and transform the data, this repository will walk you through the process of preparing, importing and, ultimately, being able to analyze the raw Stack Overflow Annual Developer Survey data for 2020 (which is included in the developer_survey_2020 folder).
Note: This repository will be updated to include the raw 2021 data once it becomes available here.
- Requirements
- Parse and transform the data
- Prepare the database
- Create the schema
- Import the data
- Analyze the data
- Support and contribution
- License
- MariaDB Client, used to connect to MariaDB database instances.
- Python v. 3+
In this sample you will use the Python script file, parse_and_transform.py, to parse and transform the Stack Overflow Annual Survey data for 2020.
Executing parse_and_transform.py will parse through the [survey results data] and split each row, which contains all the responses from an single survey respondent, into multiple rows, one per response. The result of turns ~66k rows of data into ~5 million rows.
To execute the parse_and_transform.py you need to perform the following steps.
- 
Open a new terminal window at this location. 
- 
Create a new Python virtual environment. 
$ python3 -m venv venv- Activate the virtual environment.
$ . venv/bin/activate- Install the pandas Python package, which will be use for data manipulation within parse_and_transform.py.
$ pip install pandas- And, finally, execute the parse_and_transform.py script!
$ python3 parse_and_transform.pyTo be able to store and analyze the survey data you're going to need a place to put it. MariaDB to the rescue! Below includes instructions on setting up a local database, using the official MariaDB Docker image, or MariaDB SkySQL, the ultimate MariaDB database in the cloud.
Running a single instance (container) of MariaDB ColumnStore is incredibly simple using the MariaDB Community Server ColumnStore image.
Check out the instructions here.
SkySQL is the first and only database-as-a-service (DBaaS) to bring the full power of MariaDB Platform to the cloud, including its support for transactional, analytical and hybrid workloads. Built on Kubernetes, and optimized for cloud infrastructure and services, SkySQL combines ease of use and self-service with enterprise reliability and world-class support – everything needed to safely run mission-critical databases in the cloud, and with enterprise governance.
IMPORTANT: Once you've registered for MariaDB SkySQL you will need to create a new analytics service so that you can take advantage of the MariaDB columnar storage engine, ColumnStore. For more information on how to do this check out this walk-through, or check out this short video on launching a new SkySQL service - don't worry it only takes a couple of minutes!
The survey result data contained in newly created answers.csv file will need to be imported to MariaDB. To accomodate that you will need to create a new database, survey_data, that contains a single table, answers.
To create the new database and table you can either copy and execute the following code within a database client of your choice.
DROP DATABASE IF EXISTS survey_data;
CREATE DATABASE survey_data;
CREATE TABLE answers (
    respondent_id INT unsigned NOT NULL, 
    question_id VARCHAR(25) NOT NULL,
    answer VARCHAR(65) NOT NULL
) ENGINE=ColumnStore DEFAULT CHARSET=utf8;or use the MariaDB Client to execute the schema.sql script contained within this repository.
For example:
Locally
$ mariadb --host 127.0.0.1 --user root -pPassword123! < schema.sqlSkySQL
mariadb --host analytics-1.mdb0001265.db.skysql.net --port 5001 --user DB00004537 -p --ssl-ca ~/Downloads/skysql_chain.pem < schema.sqlNote: Remember to update the command above with your database location, user and SSL information accordingly!
After you've created the new schema, you can import the answers.csv data using the MariaDB Client.
For example:
Locally
mariadb --host 127.0.0.1 --port 3306 --user root -pPassword123! -e "LOAD DATA LOCAL INFILE 'answers.csv' INTO TABLE answers FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n'" survey_dataSkySQL
mariadb --host analytics-1.mdb0001265.db.skysql.net --port 5001 --user DB00004537 -p --ssl-ca ~/Downloads/skysql_chain.pem -e "LOAD DATA LOCAL INFILE 'answers.csv' INTO TABLE answers FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n'" survey_dataNote: Remember to update the command above with your database location, user and SSL information accordingly!
Once the data has been successfully imported into MariaDB there are many ways that you can use the data!
You can use a database client to execute SQL queries directly on the results data contained in the answers table.
For exmple, using the MariaDB Client:
- Start by connecting to your MariaDB database instance.
$ mariadb --host 127.0.0.1 --user root -pPassword123!- Execute the following query to SELECTthe top 10 programming langauges that have been used by respondents that have also used MariaDB.
SELECT
	answer, COUNT(answer) AS respondent_count
FROM
	survey_data.answers
WHERE 
	question_id = "LanguageWorkedWith" AND 
	respondent_id IN (SELECT respondent_id FROM answers WHERE question_id = "DatabaseWorkedWith" AND answer = "MariaDB")
GROUP BY
	answer
ORDER BY
	COUNT(answer) DESC
LIMIT 10;+-----------------------+------------------+
| answer                | respondent_count |
+-----------------------+------------------+
| JavaScript            |             6878 |
| HTML/CSS              |             6597 |
| SQL                   |             6239 |
| PHP                   |             5149 |
| Python                |             4204 |
| Java                  |             4028 |
| Bash/Shell/PowerShell |             3746 |
| TypeScript            |             2558 |
| C#                    |             2396 |
| C++                   |             2277 |
+-----------------------+------------------+
You can also use modern data analysis and visualization tools like Jupyter Lab, in combination with MariaDB Connector/Python and Python libraries like Plotly and Pandas.
For more information on how you can do this please check out the following resources:
- 
Using Data Analysis and Visualization with MariaDB Connector/Python (GitHub Repository) 
- 
Deep dive: Taking advantage of MariaDB Connector for Python (Webinar) 
Please feel free to submit PR's, issues or requests to this project project directly.
If you have any other questions, comments, or looking for more information on MariaDB please check out:
Or reach out to us diretly via:
