Feature/article checker workers #443

Kseymur · 2024-05-24T20:03:46Z

This pull request resolves #425.
It involves three worker scripts designed to process GitHub webhook events, perform web searches, and interact with the OpenAI API for text analysis and fact-checking.
The scripts run on the Cloudflare Workers platform and fit within the time limits for the Paid plan.
Here is an example.

Improvements that can be made:

Add full article parsing by links in the RAG part.
Add a worker for plagiarism detection.

evgenydmitriev · 2024-05-27T11:15:08Z

What was the logic for splitting the tool in 3 workers connected with queues? Why not make it one worker? Or use HTTP for communication?

Kseymur · 2024-05-27T11:46:10Z

What was the logic for splitting the tool in 3 workers connected with queues? Why not make it one worker? Or use HTTP for communication?

I tried different options: a single-worker system, a chain of workers interacting via HTTP, and here are the conclusions I came to:

This RAG system utilizes multiple requests to the OpenAI API and the search engine API. Because of this, it exceeds the time limits within a single worker (even on the Paid plan).
Regarding HTTP communication, the issue again lies in the time limits. It appears that each worker has to wait for a response from the subsequent one. If the subsequent worker does not meet the time limits, the entire system fails.

evgenydmitriev · 2024-05-27T11:54:48Z

The only time limit you have is CPU time, 30 seconds. Waiting for fetch promises (from other workers or external APIs) doesn't count towards CPU time.

Kseymur · 2024-05-27T12:16:37Z

The only time limit you have is CPU time, 30 seconds. Waiting for fetch promises (from other workers or external APIs) doesn't count towards CPU time.

I thought the same, but based on my experience, I've noticed some practical limitations.
Also, this is what ChatGPT replies to me:
"Based on the information from Cloudflare's documentation and other sources, fetch requests themselves do not count towards the CPU time limit. The CPU time limit for Cloudflare Workers is 30 seconds, and this limit applies to the actual execution time of the code, not the time spent waiting for fetch requests to complete.

However, while the waiting time for fetch requests is not counted towards the CPU time, the total execution time, including the time spent waiting for these responses, can still lead to practical limitations. If a Worker makes multiple fetch requests sequentially, it may experience delays due to the overall duration required to handle the requests. This can cause the Worker to timeout if the responses are slow or if there are many sequential fetch operations, potentially leading to the Worker exceeding the overall allowed time for request handling."

I can also make a pr to show you the single-worker code that doesn't complete execution. Maybe I missed smth.

evgenydmitriev · 2024-05-27T12:52:09Z

We have workers that stay alive for more than a day waiting for different things. If you handle fetch promises correctly, your worker won't time out. Sequential fetch requests sound like a use case for one worker. You usually want to split things into multiple workers if you need extra isolation, parallelization, or caching. Queues introduce another level of complexity and should probably be avoided where possible.

Try asking Perplexity about all this - it's usually more helpful when you are not sure how to correctly formulate your questions. Claude Opus is also more helpful when advising on Cloudflare due to a more recent cutoff date.

NyanKir · 2024-05-30T21:10:59Z

@Kseymur Hi, there's an issue with the current implementation due to GitHub's 10-second timeout for webhooks. If we can't process the request within that timeframe, GitHub cancels it, which in turn crashes the worker. This is why a queue is needed here. With a queue, we can pass the task to a second worker and send a successful response to GitHub before the timeout occurs.

Kseymur · 2024-05-30T21:40:47Z

Hi, there's an issue with the current implementation due to GitHub's 10-second timeout for webhooks

Hi! Yes, I have already figured out the issue and I am currently considering the best implementation. Thank you for your help!

Kseymur added 12 commits May 23, 2024 13:49

add worker-article-reciever

a1a7a4c

add worker-article-rag

f1e1b32

add worker-article-answer

0fd4930

add test for worker-article-reciever

0671567

add test for worker-article-rag

fb1047c

add test for worker-article-answer

2a59026

fix: update dependencies worker-article-answer

1dcc8e7

fix: update dependencies worker-article-rag

12b32e2

fix: update dependencies worker-article-receiver

49dfb1a

fix: correct package.json worker-article-reciever

5542d95

fix: improve extractAdditionsFromDiff in worker-article-reciever

b2ef80e

add checkerPrompts

12efd3e

Kseymur requested a review from evgenydmitriev May 24, 2024 20:04

Kseymur closed this May 31, 2024

Kseymur deleted the feature/article-checker-workers branch May 31, 2024 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/article checker workers #443

Feature/article checker workers #443

Kseymur commented May 24, 2024

evgenydmitriev commented May 27, 2024

Kseymur commented May 27, 2024

evgenydmitriev commented May 27, 2024

Kseymur commented May 27, 2024

evgenydmitriev commented May 27, 2024

NyanKir commented May 30, 2024

Kseymur commented May 30, 2024

Feature/article checker workers #443

Feature/article checker workers #443

Conversation

Kseymur commented May 24, 2024

evgenydmitriev commented May 27, 2024

Kseymur commented May 27, 2024

evgenydmitriev commented May 27, 2024

Kseymur commented May 27, 2024

evgenydmitriev commented May 27, 2024

NyanKir commented May 30, 2024

Kseymur commented May 30, 2024