-
Notifications
You must be signed in to change notification settings - Fork 0
Home
David Staub edited this page Feb 17, 2015
·
11 revisions
Welcome to the stackoverflow-post-analysis wiki!
- View count as a function of title length and number of tags.
- Data sourced from here.
- Create a web app for predicting the success of questions posted to StackOverflow. Questions can be retrieved via API call or direct entry of post text to app. Success can be defined as:
- Whether or not an answer is accepted as correct.
- Number of page views.
- Number of answers.
- Number of upvotes.
- A composite score based on all of the above.
- Create static visualizations of important post features.
- Offer interactive tool showing changes in success score based on changing feature values.
- Download data-set, read data into SQL database.
- Extract features from data-set.
- Model desired target data as a function of extracted features.
- Validate model on independent test split.
- Retrieve prospective user post via API call or direct entry into web app.
- Extract post features.
- Feed extracted features to fitted model, generate predictions.
- post body length
- title length
- average sentence length
- code snippets
- word vectors from NLP
- number of tags
- use of vertical whitespace
- use of headings
- use of emphasis (bolding, italics, etc.)
- use of code snippets
- use of images
- use of tables
- proper capitalization and punctuation
- saying thank you
- scikit-learn, numpy, pandas, nltk, scipy
- SQL, Python
- SVM, random forests, linear models
- cross-validation, feature selection, train/test split
- NLP
- Questions have different subject matter.
- Solution: Use lots of data to isolate signals of interest.
- Posts have been up for different amounts of time.
- Solution: Only analyze posts within certain time window.
- Solution: Model evolution of success score as function of time, normalize data to remove time effects.
- Data-set too large to fit in memory.
- Solution: Use scikit-learn out-of-core modeling algorithms.
- Natural language features are complicated.
- Solution: NLTK package may be able to accurately extract desired features.