Word Growth Rate Analyzer

The Goal of this Project

The objective of this project is, to automatically analyse, store and notify about long and short-term trends. This will be done by counting and comparing the occurrences of any given word within a given timeline.

Project Architecture

Technical Risks

Fluctuation between peak hours

Since users are more active at different times of the day, and at different days of the week, we will get large fluctuations in the growth rate of each word. While this doesn't result in a direct problem, since in relation to other words, the resulting growth rate will stay meaningful, it will cause our dataset to become less readable.

To resolve this problem, we need to calculate the growth rate of each word, based on the occurrence of the most common word. In our case "the". As an example, if we want to calculate the growth rate of the word "hello", we need the following inputs to calculate the growth rate.

Word	Time Frame	Occurrences
hello	15:00 - 16:00	4000
hello	16:00 - 17:00	6000
the	15:00 - 16:00	120.000
the	16:00 - 17:00	150.000

Given our example input, it may seem as if we have a growth rate of 50% for the word "hello" between our two time frames. However, in reality we first need to take the word "the" - our most occurring word - as a baseline. Then we need to divide the occurrences of "the", with the occurrences of "the" from the time frame before, meaning 150.000 / 120.000. This will give us 1.25, which is the growth rate in user activity between those time frames. Next we have to multiply the occurrences of the word "hello" in our first time frame with our user growth rate, which results in 4000 * 1.25 = 5000. We can now calculate the true growth rate by dividing the occurrences of one time frame with the time frame before using our adjusted occurrences, 6000 / 5000 = 1.2. We then know, that the true growth rate is 20%, and not 50%.

Spam messages

If a user decides to write the word "foobar" hundreds of times in one comment, all occurrences will be added to our database, and the growth rate would be enormous. To resolve this problem, we can implement two potential solutions.

We require a specific threshold, if a word is under this threshold, say 50 occurrences, we won't calculate the growth rate, since its value is not significant enough.
Spam messages usually only occur once. If the occurrence for the word is back to normal after one time frame, we know that it was spam and can ignore it.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.github/workflows		.github/workflows
packages		packages
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
WGRA-Architecture.png		WGRA-Architecture.png
config.example.json		config.example.json
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.prod.yml		docker-compose.prod.yml
init-mongo.sh		init-mongo.sh
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
scraper.Dockerfile		scraper.Dockerfile
wgra.code-workspace		wgra.code-workspace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word Growth Rate Analyzer

The Goal of this Project

Project Architecture

Technical Risks

Fluctuation between peak hours

Spam messages

About

Releases

Packages

Languages

Keimeno/word-growth-rate-analyzer

Folders and files

Latest commit

History

Repository files navigation

Word Growth Rate Analyzer

The Goal of this Project

Project Architecture

Technical Risks

Fluctuation between peak hours

Spam messages

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages