Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

figure out a way to remove delimiters from dask.bag #29

Open
wikfeldt opened this issue Apr 13, 2022 · 2 comments
Open

figure out a way to remove delimiters from dask.bag #29

wikfeldt opened this issue Apr 13, 2022 · 2 comments
Assignees

Comments

@wikfeldt
Copy link
Contributor

this:

sorted_counts = text.filter(lambda word: word not in DELIMITERS).str.lower().str.strip().str.split().flatten().frequencies().topk(10,key=1).compute()

is not the same thing as the filter function

@ashwinvis ashwinvis self-assigned this Jan 21, 2025
@ashwinvis
Copy link
Contributor

This was modified in python-perf

https://enccs.github.io/python-perf/parallelize/#dask-bag

text = db.read_text(filename, blocksize='1MiB')
filtered = (
    text
    .filter(lambda word: word not in DELIMITERS)
    .str.lower()
    .str.strip()
    .str.split()
    .flatten()
)
ddf = filtered.to_dataframe(columns=['words'])
ddf['words'].value_counts().compute()[:10]

I need to update it here.

@ashwinvis
Copy link
Contributor

I remember now and understand what you meant. We need to use str.replace with regex or some kind of map + filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants