You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apache Beam is an open-source unified stream and batch processing model and set of language-specific SDKs (Software Development Kits) for defining and executing data processing workflows.
Either batch or streaming data processing is possible
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents (corpus).
Expectation:
Build a data processing pipeline which extracts the Keywords from a published Blogpost with the tf-idf analysis in real time (streaming).
At the beginning we provide you with about 200 blogpost so you can use them to build the corpus.
Write a blog article on any platform you prefer e.g. Dev.To, Medium, Personal Blog) about this project.
Required components:
Apache Beam
SQLite database to persist either the corpus and the blogposts and correspending Keywords
n8n instance to send blogpost data from the blogpost application to the apache beam application
Story:
Expectation:
Required components:
Payout
Deadline
4 week after assigning