Skip to content

aacecandev/core-project-one

Repository files navigation

CORE PROJECT ONE

A Data Science project to explore demographics of Barcelona city.

Table of contents

Project Task List

  • L1: Crear api en fastapi
  • L1: Crear dashboard en streamlit
  • L1: Base de datos en MongoDB o PostgreSQL
  • L2: Utilizar de datos geoespaciales y geoqueries en MongoDB o Postgres (Usando PostGIS)*
  • L2: Tener la base de datos en el Cloud (Hay servicios gratis en MongoDB Atlas, Heroku Postgres, dentre otros)
  • L2: Generar reporte pdf de los datos visibles en Streamlit, descargable mediante boton.
  • L2: Un dashboard de multiples páginas en Streamlit
  • L3: Que el dashboard te envie el reporte pdf por e-mail
  • L3: Poder subir nuevos datos a la bbdd via la API (usuario y contraseña como headers del request)
  • L4: Poder actualizar la base de datos via Streamlit (con usuario y contraseña, en una página a parte. El dashboard debe hacer la petición anterior que añade datos via API)
  • L4: Crear contenedor Docker y hacer deploy de los servicios en el cloud (Heroku. Los dos servicios deben subirse separadamente)
  • L5: Controlar el pipeline con Apache Airflow

Data Source

This project is based on a dataset of Barcelona city that contains information about demographics and population statistics.

For this project, we will using the accidents dataset, which contains useful information about dates, places, and accidents.

Data Analysis with Jupyter Lab

To make the Exploratory Data Analysis, I've used Jupyter Lab

During the analysis, I've used the following libraries:

For this stage of the project, the main concern has been to understand the data in order to be able to make interesting questions (and visualizations then) about it.

Furthermore, I've been extremely interested in reducing the dataset weight, which I've achieved passing from a 6MB file to a 240KB file. This has been achieved reducing the number of unnecesary columns, cleaning the data and assigning the correct data type to it.

Another interesting aspect is that requiring location information with the geopy API is very limited, so I've investigated about Python multithreading capabilities and implemented parallel requests using high performance methods.

You can run a notebook in the data folder and repeat my steps, and when you're done, export the DataFrame to a JSON file that will feed the local MongoDB database, or just upload it with your own URL in an .env file.

To be get the basic environment ready, install the packages contained in the requirements.txt file.

Fast API

This projects uses Fast API as its backend to create the API.

In this project I've levereaged the following features:

  • Asynchronous API
  • Asynchronous testing of the main route
  • Asynchronous MongoDB queries using motor
  • Routing with regex
  • OpenAPI documentation
  • Sentry integration
  • Python type hinting with pydantic
  • Type annotations with typing
  • Pydantic validation through models
  • Docker deployment using gunicorn with ASGI asynchronous workers

This project's API is deployed in a Docker container and to be able to run it locally, you must have an .env file with the following variables:

  • DATABASE_URL
  • DATABASE_NAME
  • DEBUG
  • ENVIRONMENT
  • SENTRY_DSN

Data Visualization with Streamlit

To be able to query Fast API I've used Streamlit as a frontend, wrapped with the Hydralit library to get some nice forntend features.

Within the frontend you will be able to make a couple of queries to the API, and you will be able to see the results in a nice dashboard.

The dashboard also implements a few other features, such as a search bar, a map, a table, and a graph.

There's a user panel in which you will be able to interact with the database, and you can also see the documentation of the API.

A user can also upload a new dataset to the database, and the dashboard will be updated with the new data. This is done through the API.

Another cool feature is that you can download a report of the data, which I've achieved using Beautiful Soup.

This project uses Streamlit's native secrets management, so you must provide an app/dashboard/.streamlit/secrets.toml file with the following variables:

url
api_key
api_secret
sender_email
sender_name

[mongo]
host
port
username
password

MongoDB

When you export the DataFrame with Jupyter, the data is stored in a folder that contains a MongoDB Dockerfile which will be used to create a MongoDB container automatically.

Authentication will be enabled using the entrypoint.sh script, which will be executed when the container is started. It also creates a non root user giving him readWrite role on a non-admin database.

In order to deploy the MongoDB Docker container, a database/.env file must be created with the following variables:

  • MONGO_INITDB_ROOT_USERNAME
  • MONGO_INITDB_ROOT_PASSWORD
  • MONGO_NON_ROOT_USERNAME
  • MONGO_NON_ROOT_PASSWORD
  • MONGO_NON_ROOT_ROLE
  • MONGO_INITDB_DATABASE
  • MONGO_NON_ROOT_DB

Running this project

This projects uses Makefile. The following options are available:

  • help: Show the help
  • build-docker-api: Build the Docker image for the API
  • lint-docker-api: Lint the Docker image for the API
  • run-docker-api: Run the Docker image for the API
  • build-docker-db: Build the Docker image for the MongoDB
  • lint-docker-db: Lint the Docker image for the MongoDB
  • run-docker-db: Run the Docker image for the MongoDB
  • build-docker-streamlit: Build the Docker image for the Streamlit
  • lint-docker-streamlit: Lint the Docker image for the Streamlit
  • run-docker-streamlit: Run the Docker image for the Streamlit
  • run-app: Run the application using docker compose
  • rm-app: Remove the docker-compose stack

Pre-commit

This project uses pre-commit to test the repository files before making a commit.

You need to install it with pip install pre-commit within the repository

Sentry

This project uses Sentry to send errors to the developers. It is configured with the SENTRY_DSN environment variable in Fast API

Contributing

Pull Requests are welcome! Feel free to contribute to the project.

Semantic release

This project uses Semantic Release and every push to the main branch will trigger a workflow that generates a CHANGELOG.md file.

Resources