-
-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make a script to get the docs we already know are >= 1000 pages long #4839
Comments
@ERosendo, @albertisfu and I talked extensively about this yesterday, and we finally decided we should probably divide the script in two stages, this way we would do a first stage where we fetch all relevant docs from PACER, and then a second stage where we process them. This is helpful in managing the last rounds of the round-robin process, because we checked and the court with the highest number of big docs has over 270 more big docs than the next court, therefore that one court will be the only one being queried in the last ~270 rounds. First stageSo far we were fetching and processing each document in a similar way than
Second stageAfter we've fetched all relevant docs, we still need to process them. This could take a while, but there's no issue in adding as many workers as possible since we're not interacting with PACER anymore. This means we just have to identify the docs that were successfully fetched from the list in cache, and we add the tasks to process those docs and mark their @mlissner what do you think? |
So the idea is, generally, not to create more PACERFetchQueue requests until the one before is complete for each court, right? Seems like a sensible way to go about throttling. Two thoughts so far. First, I think you forgot to explain one of the branches of step 5 (see ???below). My understanding is:
Second, rather than doing the |
Right, and to make extra sure (cause sometimes the fetch task can be completed pretty quickly), we check it wasn't completed right before.
Yeah I wasn't very clear, and the idea isn't quite what you described. We have two nested for-loops: one over rounds, and one over courts in each round. So we check at the beginning of each iteration in the outer loop, maybe It's very roughly something like:
But now I'm thinking, we're only checking the first court in the new round, which is fine to handle the last ~270 rounds of the round-robin in which there's a single court left (this was our main concern); but we also have ~20 rounds with only 2 courts, and ~80 with only 3, so maybe we also need to check those too? Could it be possible for 3-court rounds to complete quickly enough so that the same court is hit too often? How about those with 2? @ERosendo and @albertisfu I would also like to know what you think about this 👀 |
I think this depends on the number of workers in the Celery queue. If there are more workers than the number of remaining courts, it’s possible for the workers to pick up all the tasks in the queue simultaneously and complete them almost at the same time and if you send more tasks and there are workers available, they will be processed immediately. |
Would a simple solution here be to enqueue one fetch per court, and to only do the next one when the one before has completed (and 2s has elapsed)? You can just loop through the courts, make a fetch for each one, then on the next loop, check the court and either do the next item or advance to the next item in the loop.
Perhaps I'm under-thinking it, but you don't even need celery for this because it'll process the fetches in the background. |
So what you're saying is we shouldn't use I had a feeling we were overthinking a bit 😅 |
Sort of. I was thinking if you're creating fetch queue objects, those will do celery work behind the scenes, so you can just rely on your loop to check if those are complete before doing another. |
Oops, hit submit too soon. But what you want to make sure you do is all the courts at a time, not one at a time, which I think should be possible with the above approach. You just enqueue one fetch per court in your loop, then wait until each is finished. |
No description provided.
The text was updated successfully, but these errors were encountered: