Skip to content
David Staub edited this page Feb 17, 2015 · 11 revisions

Welcome to the stackoverflow-post-analysis wiki!

Preliminary Analysis

Project Goals

  • Create a web app for predicting the success of questions posted to StackOverflow. Questions can be retrieved via API call or direct entry of post text to app. Success can be defined as:
    • Whether or not an answer is accepted as correct.
    • Number of page views.
    • Number of answers.
    • Number of upvotes.
    • A composite score based on all of the above.
  • Create static visualizations of important post features.
  • Offer interactive tool showing changes in success score based on changing feature values.

Workflow

  1. Download data-set, read data into SQL database.
  2. Extract features from data-set.
  3. Model desired target data as a function of extracted features.
    • Validate model on independent test split.
  4. Retrieve prospective user post via API call or direct entry into web app.
  5. Extract post features.
  6. Feed extracted features to fitted model, generate predictions.

Feature Examples

  • post body length
  • title length
  • average sentence length
  • code snippets
  • word vectors from NLP
  • number of tags
  • use of vertical whitespace
  • use of headings
  • use of emphasis (bolding, italics, etc.)
  • use of code snippets
  • use of images
  • use of tables
  • proper capitalization and punctuation
  • saying thank you

Tools and Techniques

  • scikit-learn, numpy, pandas, nltk, scipy
  • SQL, Python
  • SVM, random forests, linear models
  • cross-validation, feature selection, train/test split
  • NLP

Challenges

  • Questions have different subject matter.
    • Solution: Use lots of data to isolate signals of interest.
  • Posts have been up for different amounts of time.
    • Solution: Only analyze posts within certain time window.
    • Solution: Model evolution of success score as function of time, normalize data to remove time effects.
  • Data-set too large to fit in memory.
    • Solution: Use scikit-learn out-of-core modeling algorithms.
  • Natural language features are complicated.
    • Solution: NLTK package may be able to accurately extract desired features.