Home

Welcome to the stackoverflow-post-analysis wiki!

Preliminary Analysis

Create a web app for predicting the success of questions posted to StackOverflow. Questions can be retrieved via API call or direct entry of post text to app. Success can be defined as:
- Whether or not an answer is accepted as correct.
- Number of page views.
- Number of answers.
- Number of upvotes.
- A composite score based on all of the above.
Create static visualizations of important post features.
Offer interactive tool showing changes in success score based on changing feature values.

Download data-set, read data into SQL database.
Extract features from data-set.
Model desired target data as a function of extracted features.
- Validate model on independent test split.
Retrieve prospective user post via API call or direct entry into web app.
Extract post features.
Feed extracted features to fitted model, generate predictions.

Questions have different subject matter.
- Solution: Use lots of data to isolate signals of interest.
Posts have been up for different amounts of time.
- Solution: Only analyze posts within certain time window.
- Solution: Model evolution of success score as function of time, normalize data to remove time effects.
Data-set too large to fit in memory.
- Solution: Use scikit-learn out-of-core modeling algorithms.
Natural language features are complicated.
- Solution: NLTK package may be able to accurately extract desired features.