Skip to content

michaelcapizzi/TextComplexity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Requirements

Clone the repository

git clone https://github.com/michaelcapizzi/TextComplexity.git

How to Run

The system expects the text to be classified to be in .txt form with paragraphs delimited by a blank line:

She moved away from the door, stepping as softly as if she were afraid of awakening some one.  She was glad that there 
was grass under her feet and that her steps made no sounds.  She walked under one of the fairy-like gray arches between 
the trees and looked up at the sprays and tendrils which formed them.  "I wonder if they are all quite dead," she said.  
"Is it all a quite dead garden? I wish it wasn't."

If she had been Ben Weatherstaff she could have told whether the wood was alive by looking at it, but she could only 
see that there were only gray or brown sprays and branches and none showed any signs of even a tiny leaf-bud anywhere.

**Note:**You can import a .txt file where the paragraphs are not delimited by a blank line, but computational times may explode as the feature extractor depends upon a discourse parse which handles the entire document as one instance to parse.

The system can handle two different classification structures:

Classify into 6 distinct classes:

K-1 2-3 4-5 6-8 9-10 11-12
"0001" "0203" "0405" "0606" "0910" "1112"

or

Classify into 3 distinct classes:

K-5 6-8 9-12
"0005 "0608" "0912"

Best Performing

If you'd like to simply run the best-performing model for each label structure*, you can run this command: run-main Complexity.Demo [file to analyze] [number of classes]. You will see the feature values generated, the predicted grade level band, and the confidence of the other classes for comparison.

run-main Complexity.Demo "document.txt" "3"

or

run-main Complexity.Demo "document.txt" "6"

*current, best-performing configurations:

  • 6-class structure: random forest classifier using lexical features only
  • 3-class structure: linear SVM classifier using lexical and paragraph features

Other Configurations

If you'd like to further investigate the predicted output generated by different configurations, you can run the Predict main class. This requires more arguments: run-main Complexity.Predict [file to analyze] [number of classes] [model to use] [full path to dataset to use] [feature types to include+].

The model choices are: randomForest, perceptron, logisticRegression, or svm.

The datasets can all be found in the /resources/savedFeatureMatrices folder of the repository. They are saved with an .svmLight file type, but they are just plain text. The number of classes and feature sets used to generate the matrix should be easily identiable from the file name.

Note: The choice of the dataset must match both the number of classes and the feature sets to use.

run-main Complexity.Predict "document.txt" "6" "randomForest" "/path/to/resources/savedFeatureMatrices/lex_par-6.svmLight" "lexical" "paragraph"

Each feature set to include should be its own argument separated by a space. The choices are: lexical, syntactic, paragraph, or all. For example lexical paragraph would use both lexical and paragraph features. Using all will utilize all three feature sets.

About

Implementation of a text complexity system in Scala

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages