Skip to content

Apache Beam - TF-IDF analysis and keyword extraction #14

@Datata1

Description

@Datata1

Story:

  • Apache Beam is an open-source unified stream and batch processing model and set of language-specific SDKs (Software Development Kits) for defining and executing data processing workflows.
  • Either batch or streaming data processing is possible
  • TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents (corpus).

Expectation:

  • Build a data processing pipeline which extracts the Keywords from a published Blogpost with the tf-idf analysis in real time (streaming).
  • At the beginning we provide you with about 200 blogpost so you can use them to build the corpus.
  • Write a blog article on any platform you prefer e.g. Dev.To, Medium, Personal Blog) about this project.

Required components:

  • Apache Beam
  • SQLite database to persist either the corpus and the blogposts and correspending Keywords
  • n8n instance to send blogpost data from the blogpost application to the apache beam application

Payout

  • 1000$ Codesphere credits
  • 50$ cash

Deadline

4 week after assigning

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions