Topics: Data Engineering | Market Gap Analysis | Data Visualization | Blue Ocean Strategy
This project delivers a high-fidelity CPG Market Gap Analysis for Helix CPG Partners. By analyzing a massive sample from the Open Food Facts dataset, we've identified a "Blue Ocean" opportunity in the plant-based snack sector: High-Protein, Low-Sugar, and Low-Sodium alternatives.
- Technical Notebook: Deep-Dive Analysis
- Interactive Presentation: Live Strategy Deck
- Interactive Dashboard: Data Studio Reporting
- Executive Summary: Video Walkthrough
Traditional snacks are saturated with sugar (averaging 30g/100g). Our analysis pinpointed a specific technical gap where demand for health is not met by current inventory.
| Red Ocean (High Competition) | Blue Ocean (The Gap) |
|---|---|
| High Sugar (>15g) | Low Sugar (< 4.6g) |
| Low/Medium Protein | High Protein (> 4.2g) |
| High Sodium Dependency | Low Sodium Integrity |
Hexagonal binning density analysis revealing the "Red Ocean" concentration vs. the "Blue Ocean" opportunity.
To handle the 12GB raw dataset on local hardware, I implemented a chunked streaming architecture:
- Stream Processing: Utilized
chunksizeto ingest 500,000 records in 50k batches. - Memory Optimization: Immediate downcasting to
float32and selective column loading reduced RAM overhead by ~70%. - Statistical Sampling: Extracted a 200,000-row randomized sample to ensure 99% confidence.
-
Nutrient Density Score: Calculated as
$\frac{Proteins}{(Sugars + 1)}$ to quantify nutritional value per "sugar overhead." -
Ingredient Mining: Developed Regex-based NLP to parse
ingredients_text. - The "Salt Trap" Discovery: Natural language processing revealed that 78% of high-protein leaders rely on added sodium for flavor—defining our primary R&D differentiator: Flavor without Sodium.
Correlation analysis across high-level business buckets: Snacks, Beverages, Dairy, Plant-based, and Cereals.
pip install pandas numpy matplotlib seaborn- Clone the repository.
- Open
notebook_v2.ipynbin Jupyter or VS Code. - The notebook will automatically handle data ingestion (cached locally as
point5mil.csv).
Distributed under the GNU General Public License v3.0. See LICENSE for more information.
Developed by Ryan Nii Akwei Brown
Working with data consistently improves recall, strengthens active learning, and keeps technical skills sharp.