“A data scientist is not a button pusher.” — Prof. Luiz Paulo Fávero - USP
The research presented in this repository is described in the paper:
Nonlinear Multilevel Model for Weekly Sales Prediction
A short description of the future paper and its academic context is available in:
docs/en/README.md
This project is under active maintenance, with ongoing improvements to documentation, figures, and model comparisons.
The repository is intentionally not in its final form yet, as the corresponding paper is still in preparation. Once the paper is completed, this and the related repositories will be finalized accordingly.
This repository contains my thesis project, where I analyze Walmart’s weekly sales data using 3-level hierarchical models (HLM3) and compare them with generalized linear models (GLMs) that ignore the data hierarchy. The goal was to understand what drives weekly sales and to show how nonlinear multilevel models can outperform traditional approaches when the data is naturally nested (Store -> Department -> Week).
- Grade: 9.0 (University of São Paulo – USP, 2024)
- Committee feedback: “Interesting and challenging.”
This project follows USP academic integrity and originality requirements.
USP uses official similarity-checking systems (Turnitin / SimilarityCheck). All thesis submissions are automatically verified against academic databases and public sources. Even AI-generated text becomes traceable once published online.
The evaluation considers writing consistency, methodology, and whether the code, analysis, and text form a coherent whole — meaning only genuinely original work passes.
If you want a more applied, storytelling, or analytics-focused version of this project:
Analytics & Storytelling repo: https://github.com/celsomsilva/thesis-storytelling-usp
Quarto website version: https://github.com/celsomsilva/walmart-analytics-storytelling
This thesis analyzes the factors that influence Walmart’s weekly sales using statistical machine learning techniques. A nonlinear 3-level HLM (HLM3) was built and compared with several GLM-based models, including versions with and without transformations (Box-Cox/Yeo-Johnson) and negative binomial alternatives.
The workflow included:
- Null vs full models (GLM and multilevel)
- Linear and nonlinear setups
- Yeo–Johnson transformation for stabilizing variance
- Negative binomial models for overdispersed data
- Model comparison using LogLik, AIC, BIC
The results consistently show that:
- Hierarchical models fit the data structure better
- Predictions improve when store and department effects are modeled
- Interpretation becomes clearer for decision-making
This contributes both to statistical understanding of HLMs and to practical retail insights.
Keywords: Multilevel models, Yeo–Johnson, Box-Cox, HLM3, hierarchical data, machine learning
The initial HLM/OLS templates were provided in class by Prof. Fávero (HLM2/HLM3 exercises). This project expands those ideas by adding:
- GLMs and GLMMs
- Negative binomial and multilevel negative binomial models
- Nonlinear transformations (Yeo–Johnson)
- Full diagnostics and comparative modeling
Some reference figures come from class material; all extended visualizations (AIC/BIC charts, residual diagnostics, multilevel diagrams, etc.) were created for the thesis.
thesis-data-science-usp/
src/
data/
docs/
pt/
en/
charts/
pt/
en/
.gitignore
README.md
LICENSE # MIT
LICENSE-thesis.txt # Creative Commons (thesis)
Data is not included.
Use the public Walmart Weekly Sales dataset.
- Install R packages (
lme4,car,ggplot2, etc.) - Use the ready dataset (
data/walmart_forecast.csv) or preprocess it usingsrc/tratamento_de_dadosfinal.ipynb - Run the full analysis in
src/multilevel_retail.R
-
Research paper (English): Future paper
-
Thesis (Portuguese): Original thesis
Additional context about the research paper is available in:
docs/en/README.md
Figures used in the thesis are under charts/.
This project was developed by an engineer and data scientist with a background in:
- Postgraduate degree in Data Science and Analytics (USP)
- Bachelor of Science in Electrical and Computer Engineering (UERJ)
- Special interest in statistical models, interpretability, and applied AI
- Prof. Delmo Alves de Moura (UFABC) — for his guidance and support throughout my thesis
- Prof. Luiz Paulo Fávero (USP) — for emphasizing deep statistical foundations