Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ifeval: Dowload punkt_tab on rank 0 #2267

Merged
merged 4 commits into from
Nov 9, 2024
Merged

Ifeval: Dowload punkt_tab on rank 0 #2267

merged 4 commits into from
Nov 9, 2024

Conversation

baberabb
Copy link
Contributor

@baberabb baberabb commented Aug 30, 2024

closes #2266. Also removed the pkg_resources dependency as that's depreciated.

closes #2471

@al093
Copy link

al093 commented Aug 30, 2024

Sharing a slightly more verbose version.
I would do this:

def download_nltk_resources_guarded() -> None:
    """Download 'punkt_tab' tokenizer.

    Downloading nltk with distributed barrier otherwise race condition can occur
    when multiple processes try to download the same resource later.
    """
    local_rank = os.environ("LOCAL_RANK", 0)

    if local_rank == 0:
        try:
            nltk.data.find("tokenizers/punkt_tab")
        except LookupError:
            logger.info(f"Local rank {local_rank}: Downloading NLTK 'punkt_tab' resource.")
            nltk.download("punkt_tab")
            logger.info(f"Local rank {local_rank}: Downloaded NLTK 'punkt_tab' resource.")

    if torch.distributed.is_initialized():
         torch.distributed.barrier()
    try:
        nltk.data.find("tokenizers/punkt_tab")
    except LookupError:
        logger.error(
            f"Local rank {local_rank}: NLTK 'punkt' resource not found."
            f"This should have been downloaded by local rank 0."
        )
        raise

@al093
Copy link

al093 commented Aug 30, 2024

suggestion: would rethink the current download on import behaviour.

@baberabb
Copy link
Contributor Author

@al093 . Thanks very much for your suggestion. I'm hesitant in add a torch dependency here, as none of the tasks currently require it. I removed time.sleep() as you suggested and we do have a barrier later in the evaluation loop. To add an extra layer we could add a condition to have the user manually download and cache the data: please run python -c "import nltk; nltk.download('punkt'). Thoughts @haileyschoelkopf ?

@al093
Copy link

al093 commented Sep 23, 2024

To add an extra layer we could add a condition to have the user manually download and cache the data: please run python -c "import nltk; nltk.download('punkt')

Works for me. This is actually what i did finally to make it work for my case as well.

We cannot synchronise multiple worker/processes if we dont know what framework was used to launch the multiple processes. So I think it has to handled by user code or manual intervention as you suggest.

@baberabb baberabb merged commit bd80a6c into main Nov 9, 2024
8 checks passed
@baberabb baberabb deleted the ifeval_rank branch November 9, 2024 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants