Skip to content

Latest commit

 

History

History
204 lines (129 loc) · 10.5 KB

README.rst

File metadata and controls

204 lines (129 loc) · 10.5 KB

NLTK Trainer

NLTK Trainer exists to make training and evaluating NLTK objects as easy as possible.

Requirements

You must have Python 2.6 with argparse and NLTK 2.0 installed. NumPy, SciPy, and megam are recommended for training Maxent classifiers.

Training Classifiers

Example usage with the movie_reviews corpus can be found in Training Binary Text Classifiers with NLTK Trainer.

Train a binary NaiveBayes classifier on the movie_reviews corpus, using paragraphs as the training instances::
python train_classifier.py --instances paras --classifier NaiveBayes movie_reviews
Include bigrams as features::
python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 movie_reviews
Minimum score threshold::
python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --min_score 3 movie_reviews
Maximum number of features::
python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --max_feats 1000 movie_reviews
Use the default Maxent algorithm::
python train_classifier.py --instances paras --classifier Maxent movie_reviews
Use the MEGAM Maxent algorithm::
python train_classifier.py --instances paras --classifier MEGAM movie_reviews
Train on files instead of paragraphs::
python train_classifier.py --instances files --classifier MEGAM movie_reviews
Train on sentences::
python train_classifier.py --instances sents --classifier MEGAM movie_reviews
Evaluate the classifier by training on 3/4 of the paragraphs and testing against the remaing 1/4, without pickling::
python train_classifier.py --instances paras --classifier NaiveBayes --fraction 0.75 --no-pickle movie_reviews
For a complete list of usage options::
python train_classifier.py --help

Using a Trained Classifier

You can use a trained classifier by loading the pickle file using nltk.data.load::
>>> import nltk.data
>>> classifier = nltk.data.load("classifiers/NAME_OF_CLASSIFIER.pickle")
Or if your classifier pickle file is not in a nltk_data subdirectory, you can load it with pickle.load::
>>> import pickle
>>> classifier = pickle.load(open("/path/to/NAME_OF_CLASSIFIER.pickle"))

Either method will return an object that supports the ClassifierI interface.

Once you have a classifier object, you can use it to classify word features with the classifier.classify(feats) method, which returns a label::
>>> words = ['some', 'words', 'in', 'a', 'sentence']
>>> feats = dict([(word, True) for word in words])
>>> classifier.classify(feats)
If you used the --ngrams option with values greater than 1, you should include these ngrams in the dictionary using nltk.util.ngrams(words, n)::
>>> from nltk.util import ngrams
>>> words = ['some', 'words', 'in', 'a', 'sentence']
>>> feats = dict([(word, True) for word in words + ngrams(words, n)])
>>> classifier.classify(feats)

The list of words you use for creating the feature dictionary should be created by tokenizing the appropriate text instances: sentences, paragraphs, or files depending on the --instances option.

Training Part of Speech Taggers

The train_tagger.py script can use any corpus included with NLTK that implements a tagged_sents() method. It can also train on the timit corpus, which includes tagged sentences that are not available through the TimitCorpusReader.

Example usage can be found in Training Part of Speech Taggers with NLTK Trainer.

Train the default sequential backoff tagger on the treebank corpus::
python train_tagger.py treebank
To use a brill tagger with the default initial tagger::
python train_tagger.py treebank --brill
To train a NaiveBayes classifier based tagger, without a sequential backoff tagger::
python train_tagger.py treebank --sequential '' --classifier NaiveBayes
To train a unigram tagger::
python train_tagger.py treebank --sequential u
To train on the switchboard corpus::
python train_tagger.py switchboard
To train on a custom corpus, whose fileids end in ".pos", using a TaggedCorpusReader::
python train_tagger.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'

The corpus path can be absolute, or relative to a nltk_data directory. For example, both corpora/treebank/tagged and /usr/share/nltk_data/corpora/treebank/tagged will work.

You can also restrict the files used with the --fileids option::
python train_tagger.py conll2000 --fileids train.txt
For a complete list of usage options::
python train_tagger.py --help

Using a Trained Tagger

You can use a trained tagger by loading the pickle file using nltk.data.load::
>>> import nltk.data
>>> tagger = nltk.data.load("taggers/NAME_OF_TAGGER.pickle")
Or if your tagger pickle file is not in a nltk_data subdirectory, you can load it with pickle.load::
>>> import pickle
>>> tagger = pickle.load(open("/path/to/NAME_OF_TAGGER.pickle"))

Either method will return an object that supports the TaggerI interface.

Once you have a tagger object, you can use it to tag sentences (or lists of words) with the tagger.tag(words) method::
>>> tagger.tag(['some', 'words', 'in', 'a', 'sentence'])

tagger.tag(words) will return a list of 2-tuples of the form [(word, tag)].

Analyzing Tagger Coverage

The analyze_tagger_coverage.py script will run a part-of-speech tagger on a corpus to determine how many times each tag is found. Example output can be found in Analyzing Tagged Corpora and NLTK Part of Speech Taggers.

Here's an example using the NLTK default tagger on the treebank corpus::
python analyze_tagger_coverage.py treebank

To get detailed metrics on each tag, you can use the --metrics option. This requires using a tagged corpus in order to compare actual tags against tags found by the tagger. See NLTK Default Tagger Treebank Tag Coverage and NLTK Default Tagger CoNLL2000 Tag Coverage for examples and statistics.

To analyze the coverage of a different tagger, use the --tagger option with a path to the pickled tagger::
python analyze_tagger_coverage.py treebank --tagger /path/to/tagger.pickle
To analyze coverage on a custom corpus, whose fileids end in ".pos", using a TaggedCorpusReader::
python analyze_tagger_coverage.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'

The corpus path can be absolute, or relative to a nltk_data directory. For example, both corpora/treebank/tagged and /usr/share/nltk_data/corpora/treebank/tagged will work.

For a complete list of usage options::
python analyze_tagger_coverage.py --help

Analyzing a Tagged Corpus

The analyze_tagged_corpus.py script will show the following statistics about a tagged corpus:

  • total number of words
  • number of unique words
  • number of tags
  • the number of times each tag occurs

Example output can be found in Analyzing Tagged Corpora and NLTK Part of Speech Taggers.

To analyze the treebank corpus::
python analyze_tagged_corpus.py treebank
To sort the output by tag count from highest to lowest::
python analyze_tagged_corpus.py treebank --sort count --reverse
To see simplified tags, instead of standard tags::
python analyze_tagged_corpus.py treebank --simplify_tags
To analyze a custom corpus, whose fileids end in ".pos", using a TaggedCorpusReader::
python analyze_tagged_corpus.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'

The corpus path can be absolute, or relative to a nltk_data directory. For example, both corpora/treebank/tagged and /usr/share/nltk_data/corpora/treebank/tagged will work.

For a complete list of usage options::
python analyze_tagged_corpus.py --help

Training IOB Chunkers

The train_chunker.py script can use any corpus included with NLTK that implements a chunked_sents() method.

Train the default sequential backoff tagger based chunker on the treebank_chunk corpus::
python train_chunker.py treebank_chunk
To train a NaiveBayes classifier based chunker::
python train_chunker.py treebank_chunk --classifier NaiveBayes
To train on the conll2000 corpus::
python train_chunker.py conll2000
To train on a custom corpus, whose fileids end in ".pos", using a ChunkedCorpusReader::
python train_chunker.py /path/to/corpus --reader nltk.corpus.reader.chunked.ChunkedCorpusReader --fileids '.+\.pos'

The corpus path can be absolute, or relative to a nltk_data directory. For example, both corpora/treebank/tagged and /usr/share/nltk_data/corpora/treebank/tagged will work.

You can also restrict the files used with the --fileids option::
python train_chunker.py conll2000 --fileids train.txt
For a complete list of usage options::
python train_chunker.py --help