Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize tokens throws seg fault #454

Open
tclements-usgs opened this issue Jan 22, 2025 · 4 comments
Open

Optimize tokens throws seg fault #454

tclements-usgs opened this issue Jan 22, 2025 · 4 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@tclements-usgs
Copy link

🐛 Bug

Tokenizing a dataset for LLM pre-training using optimize with greater than 1 workers leads to a segmentation fault when loading the dataset with a StreamingDataLoader and batch_size > 1. I think this might be a follow-on to #366 in that StreamingDataLoader errors are linked to the inputs to optimize.

To Reproduce

Steps to reproduce the behavior:

  1. Run optimize with item_loader=TokensLoader() and num_workers>=2
  2. Usebatch_size>1 with StreamingDataLoader
  3. StreamingDataLoader throws a segmentation fault

Interestingly, streaming tokens works if:

  • num_workers>1 in optimize and batch_size==1 when loading.
  • num_workers<2 in optimize and batch_size>=1 when loading.

MWE to give the error below:

Code sample
import os 
import tempfile 

from litdata import optimize, TokensLoader, StreamingDataset, StreamingDataLoader
import torch 
from tqdm import tqdm 

def tokenize_fn(idx): 
    yield torch.randint(low=0, high=127, size=(8192,)) 

def main(output_dir, num_workers=0, batch_size=1):
    outputs = optimize(
            fn=tokenize_fn, 
            inputs=list(range(1000)),
            output_dir=output_dir,
            chunk_size=(2049 * 8012),
            item_loader=TokensLoader(),
            num_workers=num_workers,
    )
    print(os.listdir(output_dir))

    dataset = StreamingDataset(
        input_dir=output_dir,
        item_loader=TokensLoader(block_size=2049),
        shuffle=True,
        drop_last=True,
    )
    dataloader = StreamingDataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=0, drop_last=True)

    # load data
    for data in tqdm(dataloader):
        pass
    
if __name__ == "__main__": 

    with tempfile.TemporaryDirectory() as output_dir:
        main(output_dir, num_workers=0, batch_size=1) # works 

    with tempfile.TemporaryDirectory() as output_dir:
        main(output_dir, num_workers=2, batch_size=1) # works 
    
    with tempfile.TemporaryDirectory() as output_dir:
        main(output_dir, num_workers=0, batch_size=8) # works 

    with tempfile.TemporaryDirectory() as output_dir:
        main(output_dir, num_workers=2, batch_size=8) # creates a seg fault 

Here's the error I get after the segmentation fault:

UserWarning: resource_tracker: There appear to be 19 leaked semaphore objects to clean up at shutdown

which seems to be related to multiprocessing errors.

Expected behavior

Multi-worker dataset creation should lead to smooth dataset streaming with all batch sizes.

Additional context

  • LitData Version: 0.2.36
  • PyTorch Version: 2.5.1
  • OS: Errored on macOS Sonoma and Ubuntu 22.04
  • How you installed PyTorch (conda, pip, source): pip
  • Python version: 3.12
@tclements-usgs tclements-usgs added bug Something isn't working help wanted Extra attention is needed labels Jan 22, 2025
Copy link

Hi! thanks for your contribution!, great first issue!

@tchaton
Copy link
Collaborator

tchaton commented Jan 22, 2025

Hey @tclements-usgs If you want to try to debug it, it should be around there: https://github.com/Lightning-AI/litdata/blob/main/src/litdata/streaming/item_loader.py#L368. In theory, we should close the memap but it seemed they were some more issues doing so but this would be the right way to do it.

@tclements-usgs
Copy link
Author

Great, thanks - I'll have a look!

@tchaton
Copy link
Collaborator

tchaton commented Jan 22, 2025

Feel free to make a PR if you fix it ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants