Curated datasets for machine learning tasks according to use cases adapted from a now defunct article on Kaggle. Also check out this repo of winning solutions.
For each type of analysis think about:
- What problem does it solve and for who?
- How is it being solved today?
- What are the data inputs and where do they come from?
- What are the outputs and how are they consumed? Online models, static or dynamic reports?
- Is it a revenue leakage (“saves us money”) or a revenue growth (“makes us money”) problem?
Forecast volumes of sales, inventory needed, etc.
- Rossman - Supermarket sales forecasting
- Online Product Sales - self-help product sales forecasting
Identify the most lucrative and loyal segments of your customers
Identify characteristics and timing of customer churns/upgrades in order to prevent/encourage them
Identify main customer clusters and their characteristics
- Instacart Market Basket Analysis
- Online Retail Dataset
- Loyal Customer Prediction - new customers from 11/11 event on Tmall
Group products together in the most reasonable category trees
Identify which products a customer is going to buy based on past purchases
- MovieLens - Movie recommendation dataset
- Jester - Joke recommendation dataset
- Book-Crossings - Book recommendation dataset
- HetRec - Music recommendation dataset
- Instacart Market Basket Analysis
- WikiLens - Wiki edits dataset
- OpenStreetMap - OpenStreetMap edits dataset
Allocate credits fairly to all ads channels and have portfolio for your ads spending
- AnalyzeCore - Synthetic data and attribution models
Predict and price impressions, clicks, conversions or any performance metrics for ads
- Avazu Click-Through Rate Prediction - Mobile ads click-through-rate prediction
- Avito Demand Prediction Challenge - Predict demand for an online classified ad
Detect ad click/install frauds
- TalkingData AdTracking Fraud Detection Challenge - Can you detect fraudulent click traffic for mobile app ads?
Optimal price for growth, profit, customer retention, etc.
Optimal store/website layout for growth, profit, customer retention, etc.
Text classification to determine customer feedbacks/sentiment about your products
- IMDb - Movie reviews
- Amazon Reviews
- Yelp Open Dataset - Yelp reviews
- Wongnai Challenge - Restaurant reviews
- OpinRank Review Dataset - TripAdvisor and Edmunds Reviews
Generate natural language answers based on given context and questions
- SQuAD - Stanford Question Answering Dataset
Predict wait time based on customer history, time of day, call volumes, products owned, churn risk, LTV, etc.
Score candidates based on resumes and internal records
Predicts which employees are most likely to leave
- SAS Employee Turnover - Synthetic employee churn dataset
- IBM HR Employee Attrition and Performance - Synthetic employee churn dataset
- Employee Attrition - Synthetic employee churn dataset
Classify medical images according to conditions
- Grand Challenges - Collection of Biomedical Image Competitions
- MURA - Large Dataset for Abnormality Detection in Musculoskeletal Radiographs
- ISIC - International Skin Imaging Collaboration
- DermNet - Skin Disease Atlas
- TCIA - Cancer Imaging Archive
- OASIS - Longitudinal Neuroimaging Dataset
- DDSM - Digital Database for Screening Mammography
- Breast Histopathology Images
- NIH Chest X-rays
- HERLEV - Pap-smear Database
- Stanford Tissue Microarray Database
- CheXPert
- MIMIC-CXR
Predict risk of re-admittance based on patient attributes, medical history, diagnose & treatment
Generate natural language reports based on tabular data
Classify patients according to their initial complaints
Optimize/predict operating theatre & bed occupancy based on initial patient visits
- Healthcare in Washington
- Mini Heritage Health Prize - Processed version of Heritage Health Prize dataset
Activity monitoring of patients
- OPPORTUNITY - Dataset for Human Activity Recognition from Wearable, Object, and Ambient Sensors
- PAMAP2 - Physical Activity Monitoring Data Set
Predict survival rates of patients
- Haberman's Survival Data Set - Survival of patients who had undergone surgery for breast cancer
Analyse effects of admitting different types and dosage of medication for a disease
Generate short length descriptions of news articles.
Predict timing and size of claims
Outlier detection for insurance claim fraud
Predict type of insurance
Predict which customers are going to default
- Statlog (German Credit Data) Data Set
- Statlog (Australian Credit Approval) Data Set
- Home Credit Default Risk
- A Fin tech fraud transaction classification - default prediction with anonymized features
Optimize portfolio of assets according to risks and returns
- quantmod - library for financial modeling in R; APIs for downloading fundamental and technical data
- Stanford EE103 - Popular ETFs from 2006 to 2016
Trade financial assets using automated models
- quantmod - library for financial modeling in R; APIs for downloading fundamental and technical data
- Get Rich or Die Modelin' - Bitcoin trading signals
Identify fraudulent transactions and parties with outlier detection and network analysis
- Credit Card Fraud Detection - Anonymized features
- PaySim Synthetic Financial Datasets For Fraud Detection
- Bitcoin Transactions
Detect malfunctioning pieces with computer vision
Find bottlenecks in manufacturing processes
Predict your products' rate and timing of failures
Design new products
- Fashion MNIST - Labeled fashion images
Forecast agricultural yields
- Planet: Understanding the Amazon from Space
- SpaceNet - Annotated satellite images of buildings and roads
- Dstl Satellite Imagery Feature Detection
Classify wild animals
- North American Camera Trap Images (NACTI) - images of trapped animals
Predict real estate values based on their characteristics
Score essays based on past pieces
Optimize distribution networks of electricity, water, etc.