Information retieval have always been an important and hot topic when it comes to data processing, analysis, data science, and the backbone for the currently most used Artificial Intelligence models LLM, this project aims to apply some concepts about textual information retrieval for extracting relevant information, and delivering the correct or at least the most relevant documents related to specific queries, in this case, medical queries.
For this case, I used an approach called "Point-Wise Approach", which basically takes a query and a document in the form of f(query, document) and outputs the relevancy of the document with respect to the query, the way how it "calculates" or estimates that relavancy is by using term frequency-inverse document frequency (TF-IDF), which is used in the field of IR and ML, it basically quantifies the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents, and it's widely used in ranking documents in search engines to indentify important words in documents and perform NLP tasks. There are a lot of documentations that explains TF-IDF measures in details, sadely I won't be going in to much details about the math behind it.
The dataset was extracted from a medical platform called "LOINC", and intentionally added some irrelevant documents for each three queries for our ML model.
We have three queries: "Glucose in Blood", "Bilirubin in Plasma" and "White Blood Cells Count", with each their list of documents.
I also assigned a new column "presence", which basically a "loose label" for terms existence in the query and the documents, basically if a document has at least one word matching with the query, we assign 1, otherwise 0, we also assign 1 if the words are related, for example "Blood" and "Plasma". This is used to help reduce noise for the ML model.
Now that we have our dataset labeled manually as relevant and non relevant, now I had to prepare the data by:
- Removing punctuations
- Tokenizing
- Removing small words (<3)
- Removing stop words
- Stemming, lemmatizing
- Get part of speech tags
After preprocessing, now we can apply TF-IDF, the way how it works, since now we have clean text, we can vectorize it using tfidf vectorizer, and apply it on each document to get the score "How important the words in the documents", and finally applied both cosine and jaccard similarity to determine the similarities between the query and documents.
Before feeding our data to our model, we first have to drop some irrelavant columns that won't be really useful for our model to learn, then we encoded the categorical variables using "Label Encoder", I chose this encoder because we have a lot of unique variables in our dataset, so the best thing you can do for this case is label encoding them with unique numbers, however this introduces a sense of heirachy to our dataset (There is no one way magic spell that works for every dataset in the world of Machine Learning, and I think that's beautiful).
Finally after all the talk, we can just input our dataset to our logistic model, which can now predict whether a document is relevant or not, mainly based on the TF-IDF weights & similarities.
I got some good results with both Glucose and White Blood queries, but low performance with Bilirubin, and from my speculation, that is mainly due to data inbalance.
You can check in details how the whole process works with the notebook assigned here, the results are well documented, and I tried to make the code as clear and concise as possible.