Skip to content

celsomsilva/thesis-data-science-usp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Walmart Sales Prediction – HLM3 (Hierarchical Models, Nonlinear)

“A data scientist is not a button pusher.” — Prof. Luiz Paulo Fávero - USP

Research Paper

The research presented in this repository is described in the paper:

Nonlinear Multilevel Model for Weekly Sales Prediction

A short description of the future paper and its academic context is available in:

docs/en/README.md



Project Status

This project is under active maintenance, with ongoing improvements to documentation, figures, and model comparisons.

The repository is intentionally not in its final form yet, as the corresponding paper is still in preparation. Once the paper is completed, this and the related repositories will be finalized accordingly.



This repository contains my thesis project, where I analyze Walmart’s weekly sales data using 3-level hierarchical models (HLM3) and compare them with generalized linear models (GLMs) that ignore the data hierarchy. The goal was to understand what drives weekly sales and to show how nonlinear multilevel models can outperform traditional approaches when the data is naturally nested (Store -> Department -> Week).

Evaluation

  • Grade: 9.0 (University of São Paulo – USP, 2024)
  • Committee feedback: “Interesting and challenging.”

Academic Note

This project follows USP academic integrity and originality requirements.

USP uses official similarity-checking systems (Turnitin / SimilarityCheck). All thesis submissions are automatically verified against academic databases and public sources. Even AI-generated text becomes traceable once published online.

The evaluation considers writing consistency, methodology, and whether the code, analysis, and text form a coherent whole — meaning only genuinely original work passes.


Extended Versions

If you want a more applied, storytelling, or analytics-focused version of this project:

Analytics & Storytelling repo: https://github.com/celsomsilva/thesis-storytelling-usp

Quarto website version: https://github.com/celsomsilva/walmart-analytics-storytelling


Abstract

This thesis analyzes the factors that influence Walmart’s weekly sales using statistical machine learning techniques. A nonlinear 3-level HLM (HLM3) was built and compared with several GLM-based models, including versions with and without transformations (Box-Cox/Yeo-Johnson) and negative binomial alternatives.

The workflow included:

  • Null vs full models (GLM and multilevel)
  • Linear and nonlinear setups
  • Yeo–Johnson transformation for stabilizing variance
  • Negative binomial models for overdispersed data
  • Model comparison using LogLik, AIC, BIC

The results consistently show that:

  • Hierarchical models fit the data structure better
  • Predictions improve when store and department effects are modeled
  • Interpretation becomes clearer for decision-making

This contributes both to statistical understanding of HLMs and to practical retail insights.

Keywords: Multilevel models, Yeo–Johnson, Box-Cox, HLM3, hierarchical data, machine learning


Models Included

The initial HLM/OLS templates were provided in class by Prof. Fávero (HLM2/HLM3 exercises). This project expands those ideas by adding:

  • GLMs and GLMMs
  • Negative binomial and multilevel negative binomial models
  • Nonlinear transformations (Yeo–Johnson)
  • Full diagnostics and comparative modeling

Figures

Some reference figures come from class material; all extended visualizations (AIC/BIC charts, residual diagnostics, multilevel diagrams, etc.) were created for the thesis.


Project Structure

thesis-data-science-usp/
  src/
  
  data/

  docs/
    pt/ 	
    en/
    
  charts/
    pt/
    en/

  .gitignore
  README.md
  LICENSE               # MIT
  LICENSE-thesis.txt    # Creative Commons (thesis)

Data

Data is not included.

Use the public Walmart Weekly Sales dataset.


How to Reproduce

  1. Install R packages (lme4, car, ggplot2, etc.)
  2. Use the ready dataset (data/walmart_forecast.csv) or preprocess it using src/tratamento_de_dadosfinal.ipynb
  3. Run the full analysis in src/multilevel_retail.R

Documentation

Additional context about the research paper is available in:

docs/en/README.md

Figures used in the thesis are under charts/.


Author

This project was developed by an engineer and data scientist with a background in:

  • Postgraduate degree in Data Science and Analytics (USP)
  • Bachelor of Science in Electrical and Computer Engineering (UERJ)
  • Special interest in statistical models, interpretability, and applied AI

Acknowledgments

  • Prof. Delmo Alves de Moura (UFABC) — for his guidance and support throughout my thesis
  • Prof. Luiz Paulo Fávero (USP) — for emphasizing deep statistical foundations

Contact

About

MBA thesis (grade 9/10, USP) applying nonlinear hierarchical statistical models to real-world Walmart sales data, with rigorous model comparison and interpretability focus.

Resources

License

MIT, Unknown licenses found

Licenses found

MIT
LICENSE
Unknown
LICENSE-thesis.txt

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors