Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

termExtraction function: Misleading documentation / bug on "remove.terms" argument #479

Open
kdmaclean opened this issue Jul 19, 2024 · 1 comment

Comments

@kdmaclean
Copy link

Hi,

Not sure if this is intended behaviour or not. If it IS intended, I think the documentation is misleading.

The termExtraction function has a "remove.terms" argument with the following description:

#' @param remove.terms is a character vector. It contains a list of additional terms to delete from the documents before term extraction. The default is \code{remove.terms = NULL}.

However, this is not what actually occurs. The terms are actually removed after term extraction, on the list of terms. The distinction I'm drawing is relevant in the case of a bi-gram. If I want to remove the word "learning", "management learning" as a bi-gram will still exist, because the remove.terms is used after extraction, on the list rather than removing it before, and not allowing "management learning" in the first place.

The relevant part of the code is below in the extractNgrams function.

ngrams <- ngrams %>% 
   unite(ngram, paste("word",1:nword,sep=""), sep = " ") %>%
   dplyr::filter(!ngram %in% custom_stopngrams) %>%
   mutate(ngram = toupper(ngram))

@massimoaria
Copy link
Owner

Hi,
thanks for your remarks.
The terms are removed after term extraction and it is an intended behavoiur. In this way, it is possible to create n-grams containing a certain word and then decide to remove only some of them or remove only the single word, not the n-grams using it.
You're right, the documentation is misleading and we will correct it.
Thanks
Massimo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants