A Python implementation of submarket clustering algorithms for analyzing product substitution patterns.
SUBMARIT is a comprehensive toolkit for identifying and analyzing submarkets based on product substitution patterns. This Python implementation provides:
- Efficient clustering algorithms for submarket identification
- Statistical evaluation methods
- Validation techniques including k-fold cross-validation
- Support for large-scale data analysis
- MATLAB compatibility layer for seamless migration
git clone https://github.com/yourusername/submarit.git
cd submarit
pip install -e .pip install -e ".[dev]"
pre-commit installimport submarit
import numpy as np
# Load substitution matrix
data = submarit.load_substitution_data("data.csv")
matrix = submarit.SubstitutionMatrix(data)
# Run clustering
clusterer = submarit.LocalSearch(n_clusters=5)
labels = clusterer.fit_predict(matrix)
# Evaluate results
evaluator = submarit.ClusterEvaluator()
metrics = evaluator.evaluate(matrix, labels)
print(f"Log-likelihood: {metrics.log_likelihood}")
print(f"Z-score: {metrics.z_score}")-
Core Algorithms
- Local search optimization (quick approximation and direct log-likelihood)
- Constrained clustering with fixed assignments
- Multiple initialization strategies
-
Evaluation Metrics
- Log-likelihood calculations
- Z-value computations
- GAP statistic for optimal cluster selection
- Entropy-based comparisons
-
Validation
- K-fold cross-validation
- Empirical distribution generation
- Rand index calculations
- P-value computations
-
Performance
- Optimized NumPy operations
- Optional Numba JIT compilation
- Parallel processing support
- Memory-efficient sparse matrix handling
Full documentation is available at https://submarit.readthedocs.io
- Installation Guide - Platform-specific installation instructions
- Quick Start Tutorial - Get started with SUBMARIT in minutes
- API Reference - Complete API documentation with examples
- Algorithm Theory - Mathematical foundations and implementation details
- Performance Guide - Optimization strategies and benchmarks
- FAQ - Frequently asked questions
- Migration Guide - Comprehensive guide for MATLAB users
- Function Mapping - 1-to-1 MATLAB to Python function reference
- Migration Examples - Jupyter notebook with practical examples
- Getting Started - Basic introduction to SUBMARIT
- Advanced Clustering - Advanced techniques and algorithms
- Performance Optimization - Tips for optimal performance
- Visualization Gallery - Beautiful visualizations
- MATLAB Migration - Examples for MATLAB users
- Test Suite Documentation - Guide to running tests
- Benchmarks - Performance benchmark results
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use SUBMARIT in your research, please cite:
@software{submarit,
title = {SUBMARIT: SUBMARket Identification and Testing},
year = {2024},
url = {https://github.com/yourusername/submarit}
}This is a Python implementation of the original MATLAB SUBMARIT package. The original MATLAB files are preserved in the matlab_original/ directory for reference and validation purposes.
The MATLAB implementation includes contributions from:
- Stephen France, Mississippi State University (RandIndex4.m, 2012)
- Additional contributors (names unknown)
The methodology is based on submarket identification research from marketing science literature, including:
- Rand (1971) - Rand Index for clustering similarity
- Hubert and Arabie (1985) - Adjusted Rand Index
- Urban, Johnson, and Hauser - Z-value calculations
- Tibshirani, Walther, and Hastie (2001) - GAP statistic for optimal cluster selection