A information blog to show the usablitity of the Type-Token-Ratio Measure (TTR) as an introduction to Natural Language Procesing(NLP).
python TTR_Rec.py
STEP 1:
For this script, we are using fantastic NLP library called NLTK.
To install NLTK in your terminal, simply type:
pip install nltk
We will then import nltk and regex by
import nltk as nlp
import re
STEP 2: Declare a string containing our string for which we need to calculate the TTR.
document="""Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken and written -- referred to as natural language. It is a component of artificial intelligence!"""
STEP 3: Remove all special characters using this regex.
document= re.sub(r'[^\w]', ' ', document)
STEP 4: Convert Document to Lower Case
document=document.lower()
Tokenize the document to generate a list of words
tokens=nlp.word_tokenize(document)
STEP 5: Group the tokens and find the count value of each token and store in dict types.
types=nlp.Counter(tokens)
And finally, find the TTR by dividing the length of dict types by length of list tokens
TTR= (len(types)/len(tokens))*100
print(TTR)