Make a script to get the docs we already know are >= 1000 pages long #4839

elisa-a-v · 2024-12-18T15:58:01Z

No description provided.

elisa-a-v · 2025-01-16T17:10:22Z

@ERosendo, @albertisfu and I talked extensively about this yesterday, and we finally decided we should probably divide the script in two stages, this way we would do a first stage where we fetch all relevant docs from PACER, and then a second stage where we process them. This is helpful in managing the last rounds of the round-robin process, because we checked and the court with the highest number of big docs has over 270 more big docs than the next court, therefore that one court will be the only one being queried in the last ~270 rounds.

First stage

So far we were fetching and processing each document in a similar way than do_pacer_fetch does, which meant waiting for a document to be fully processed until a worker was available again, and given the lack of control we have over the amount of time that would take, making sure we didn't overload that single court wasn't easy. Limiting the task to fetch the docs from PACER in a single queue with a single worker would have solved this issue, but given that this wasn't so easy, we opted for a different approach:

We first identify which docs need to be fetched. This is a simple query that filters RECAPDocument instances that are not available in our DB, and that have 1000 pages or more.
We then do the round-robin to fetch docs from PACER. Fetching a doc from PACER normally involves three tasks (fetch, process, finalize), but we're only doing the first one here because that's the only one that needs to be throttled. The next tasks have a much wider range of possible execution times and therefore managing the round-robin process was harder with more than one worker.
After adding the task for each doc, we store the RECAPDocument id in cache so we know which docs need to be processed in the second stage. This is also useful as a recovery mechanism in case the command is interrupted for some reason.
After each round, we keep track of the court id of the last doc fetched as well as the id for the PacerFetchQueue created for that doc in local variables.
Before starting the next round, we check the last court processed, and if the court for the first doc in the new round is the same as that last one, then we check the PacerFetchQueue status. If the PacerFetchQueue is still in progress, we wait a few seconds and then we try again (with exponential back-off and a max number of retries to avoid getting stuck waiting for a PacerFetchQueue in case it had an error and didn't have it's status updated, which is a known issue). If, on the other hand, the PacerFetchQueue has a successful status, we check the time it was last updated. If this was less than 2 seconds ago, we wait. Otherwise, we add the new fetch task to the queue.

Second stage

After we've fetched all relevant docs, we still need to process them. This could take a while, but there's no issue in adding as many workers as possible since we're not interacting with PACER anymore. This means we just have to identify the docs that were successfully fetched from the list in cache, and we add the tasks to process those docs and mark their PacerFetchQueue successful. The order of execution for these tasks is now irrelevant, and the rate is only limited by our own resources, so this part should be pretty straightforward. We don't even need to restrict this to a single queue, but still it's probably best if we don't use other queues used by other services so as not to interfere with them.

@mlissner what do you think?

mlissner · 2025-01-16T22:07:10Z

So the idea is, generally, not to create more PACERFetchQueue requests until the one before is complete for each court, right? Seems like a sensible way to go about throttling.

Two thoughts so far. First, I think you forgot to explain one of the branches of step 5 (see ???below).

My understanding is:

last_court = None
things_to_download = [list of IDs, ready for court-based round-robin]
for thing in things_to_download:
    if thing.court == last_court:
        time_elapsed_since_last_scrape = get_timing_for_previous_fetch_for_court(thing.court)
        if time_elapsed_since_last_scrape < 2:
            sleep(2)
        else:
             fetch(thing)
    else:
        # ???

Second, rather than doing the sleep(2), which pauses the whole loop, could you just skip that and add a sleep(2) at the end of each round-robin loop?

elisa-a-v · 2025-01-17T02:29:49Z

So the idea is, generally, not to create more PACERFetchQueue requests until the one before is complete for each court, right?

Right, and to make extra sure (cause sometimes the fetch task can be completed pretty quickly), we check it wasn't completed right before.

I think you forgot to explain one of the branches of step 5

Second, rather than doing the sleep(2), which pauses the whole loop, could you just skip that and add a sleep(2) at the end of each round-robin loop?

Yeah I wasn't very clear, and the idea isn't quite what you described. We have two nested for-loops: one over rounds, and one over courts in each round. So we check at the beginning of each iteration in the outer loop, maybe sleep(2), then we just continue with executing the actual round. We don't check again until the start of the next iteration, before the next round.

It's very roughly something like:

def check_last_court_and_fetch_queue(first_remaining_court):
    secs_to_sleep = 2
    retries = 3
    same_court = first_remaining_court == last_court
    in_progress = last_fq.status == IN_PROGRESS
    for _ in retries:
        if same_court and (in_progress or time_elapsed_since_last_scrape < 2):
            sleep(secs_to_sleep)
        secs_to_sleep *= 2

rounds = max_doc_count_across_courts
remaining_courts = {dict with courts as keys and lists of RECAPDocuments as values}

for i in range(rounds):
    # all sleep() calls happen here:
    check_last_court_and_fetch_queue(first_remaining_court)

    # now this is the actual round:
    for court_id in remaining_courts.keys():
        doc = remaining_courts[court_id].pop(0)
        fetch(doc)
        # If this court doesn't have any more docs, remove from dict:
        if len(remaining_courts[court_id]) == 0:
            remaining_courts.pop(court_id)
		save_last_court_and_fetch_queue()

But now I'm thinking, we're only checking the first court in the new round, which is fine to handle the last ~270 rounds of the round-robin in which there's a single court left (this was our main concern); but we also have ~20 rounds with only 2 courts, and ~80 with only 3, so maybe we also need to check those too? Could it be possible for 3-court rounds to complete quickly enough so that the same court is hit too often? How about those with 2?

@ERosendo and @albertisfu I would also like to know what you think about this 👀

albertisfu · 2025-01-17T18:39:20Z

Could it be possible for 3-court rounds to complete quickly enough so that the same court is hit too often? How about those with 2?

I think this depends on the number of workers in the Celery queue. If there are more workers than the number of remaining courts, it’s possible for the workers to pick up all the tasks in the queue simultaneously and complete them almost at the same time and if you send more tasks and there are workers available, they will be processed immediately.

mlissner · 2025-01-17T19:41:36Z

Would a simple solution here be to enqueue one fetch per court, and to only do the next one when the one before has completed (and 2s has elapsed)?

You can just loop through the courts, make a fetch for each one, then on the next loop, check the court and either do the next item or advance to the next item in the loop.

fetches_in_progress = {dict mapping court IDs to latest fetch queue IDs}
remaining_courts = {dict with courts as keys and lists of RECAPDocuments as values}
for court in remaining_courts:
    if previous_fetch_completed and two_seconds_elapsed_since_then:
        # pop off the next item in the court and do it
        # save the fetch queue ID into fetches_in_progress:
    else:
        # Go to the next court and try it
        continue

Perhaps I'm under-thinking it, but you don't even need celery for this because it'll process the fetches in the background.

elisa-a-v · 2025-01-20T16:20:25Z

Perhaps I'm under-thinking it, but you don't even need celery for this because it'll process the fetches in the background.

So what you're saying is we shouldn't use apply_async at all when fetching from PACER, and simply call the methods directly in the script, right? I think you're right, that should solve our multiple worker problem and everything 🤔 and we can still use celery for the second phase where multiple workers aren't a problem.

I had a feeling we were overthinking a bit 😅

mlissner · 2025-01-20T16:38:00Z

Sort of. I was thinking if you're creating fetch queue objects, those will do celery work behind the scenes, so you can just rely on your loop to check if those are complete before doing another.

mlissner · 2025-01-20T16:39:25Z

Oops, hit submit too soon.

But what you want to make sure you do is all the courts at a time, not one at a time, which I think should be possible with the above approach. You just enqueue one fetch per court in your loop, then wait until each is finished.

github-project-automation bot added this to Sprint (Web Team) Dec 18, 2024

mlissner moved this to Backlog Dec 23 - Jan 3 (🔍) in Sprint (Web Team) Dec 18, 2024

mlissner moved this from Backlog Dec 23 - Jan 3 (🔍) to Backlog Dec 23 - Jan 3 Final (🌲) in Sprint (Web Team) Dec 20, 2024

mlissner assigned elisa-a-v Dec 20, 2024

mlissner moved this from Backlog Dec 23 - Jan 10 (🎉) to To Do in Sprint (Web Team) Dec 23, 2024

elisa-a-v moved this from To Do to In progress in Sprint (Web Team) Jan 2, 2025

elisa-a-v linked a pull request Jan 8, 2025 that will close this issue

feat(pacer): add command to fetch docs filtered by page count from PACER #4901

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make a script to get the docs we already know are >= 1000 pages long #4839

Make a script to get the docs we already know are >= 1000 pages long #4839

elisa-a-v commented Dec 18, 2024

elisa-a-v commented Jan 16, 2025

mlissner commented Jan 16, 2025

elisa-a-v commented Jan 17, 2025

albertisfu commented Jan 17, 2025

mlissner commented Jan 17, 2025

elisa-a-v commented Jan 20, 2025

mlissner commented Jan 20, 2025

mlissner commented Jan 20, 2025

Make a script to get the docs we already know are >= 1000 pages long #4839

Make a script to get the docs we already know are >= 1000 pages long #4839

Comments

elisa-a-v commented Dec 18, 2024

elisa-a-v commented Jan 16, 2025

First stage

Second stage

mlissner commented Jan 16, 2025

elisa-a-v commented Jan 17, 2025

albertisfu commented Jan 17, 2025

mlissner commented Jan 17, 2025

elisa-a-v commented Jan 20, 2025

mlissner commented Jan 20, 2025

mlissner commented Jan 20, 2025