Skip to content
View jmdu99's full-sized avatar

Highlights

  • Pro

Block or report jmdu99

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
jmdu99/README.md

Hi there 👋, I'm Jose

Freelance Data Engineer — Turning messy data into clarity for projects with real impact.
I work with purpose-driven teams to build data systems they can trust.

🛠 What I Do

  • 📊 Centralise scattered data into a single source of truth
  • ⚙️ Automate cleaning & validation for always-ready data
  • 🚀 Design efficient ETL/ELT pipelines (Airflow, dbt, Spark…)
  • 📈 Build solid foundations for BI, ML & GenAI
  • ⏱ Create real-time dataflows when speed matters

💻 Tech Stack

Core Skills & Tooling

Python SQL Bash Git GitHub Poetry Pylint Pandas NumPy

Ingestion, Orchestration & Processing

Apache Airflow Cloud Composer (GCP) MWAA (AWS) dbt Fivetran Airbyte Prefect Apache Spark PySpark Apache Beam Dataflow (GCP) Dataproc (GCP) Spark Structured Streaming Apache Kafka Google Pub/Sub Apache NiFi Web scraping

Data Platforms & Storage

Amazon S3 Google Cloud Storage Parquet BigQuery Snowflake Amazon Redshift Amazon Athena PostgreSQL MongoDB Cassandra ClickHouse

Cloud & DevOps

Amazon EC2 Google Compute Engine Terraform (IaC) Docker Docker Compose GitHub Actions (CI/CD) IAM / RBAC

ML, NLP & Knowledge Graphs

Generative AI Large Language Models OpenAI API LangChain (RAG) Hugging Face Transformers NLTK spaCy scikit-learn PyTorch TensorFlow SPARQL AWS SageMaker

Analytics & Visualization

Matplotlib Seaborn Plotly Amazon QuickSight Apache Superset

🎯 About Me

Since 2021, I've worked in data across tech, banking, and large-scale systems (Amazon, Slido/Cisco).
In 2025, I went freelance to focus on projects with real impact — from healthtech and edtech to any sector that values purpose as much as results.
I also donate 10% of my earnings to the GiveWell Top Charities Fund.

🗂 Portfolio & Contact

💼 Portfolio requestLinkedIn
📩 Let’s connect and discuss how to make your data work better.

🏆 GitHub Trophies

trophy

Pinned Loading

  1. Hybrid-Fitness-Data-Pipeline-Batch-Streaming Hybrid-Fitness-Data-Pipeline-Batch-Streaming Public

    This project demonstrates a full hybrid fitness data pipeline combining real-time streaming (Kafka + MongoDB) with scheduled batch enrichment and loading (Prefect + Redshift). Dashboards are built …

    Python

  2. Hybrid-Nutrition-Data-Pipeline-Batch-Streaming Hybrid-Nutrition-Data-Pipeline-Batch-Streaming Public

    This project simulates a real-time and batch data pipeline for food item enrichment and nutritional analytics. It demonstrates a modern architecture that uses Kafka for streaming ingestion, Cassand…

    Python

  3. dbpedia/DBpedia-Spotlight-Dashboard dbpedia/DBpedia-Spotlight-Dashboard Public

    An integrated statistical information tool from the Wikipedia dumps and the DBpedia Extraction Framework artifacts

    Python 1 1

  4. Data-Processes-assignment Data-Processes-assignment Public

    COVID-19 survival analysis of a dataset and prediction using Python (sklearn, pandas, numpy, matplotlib, lifelines, mlxtend, joblib)

    Python 1

  5. Spark-Practical-Work Spark-Practical-Work Public

    Big Data: Spark Practical Work First Semester 2021/2022

    Scala

  6. Graph-Analysis-Social-Networks Graph-Analysis-Social-Networks Public

    Assignments made during the Graph Analysis and Social Networks course using Tweepy, NetworkX and NLTK.

    Jupyter Notebook 1