Apache Beam - TF-IDF analysis and keyword extraction

## Story:
- Apache Beam is an open-source unified stream and batch processing model and set of language-specific SDKs (Software Development Kits) for defining and executing data processing workflows.
- Either batch or streaming data processing is possible
- TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents (corpus).

## Expectation:
- Build a data processing pipeline which extracts the Keywords from a published Blogpost with the tf-idf analysis in real time (streaming).
- At the beginning we provide you with about 200 blogpost so you can use them to build the corpus.
- Write a blog article on any platform you prefer e.g. Dev.To, Medium, Personal Blog) about this project.

## Required components:
- Apache Beam 
- SQLite database to persist either the corpus and the blogposts and correspending Keywords
- n8n instance to send blogpost data from the blogpost application to the apache beam application

## Payout
- 1000$ Codesphere credits
- 50$ cash

## Deadline
4 week after assigning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Beam - TF-IDF analysis and keyword extraction #14

Story:

Expectation:

Required components:

Payout

Deadline

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Apache Beam - TF-IDF analysis and keyword extraction #14

Description

Story:

Expectation:

Required components:

Payout

Deadline

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions