-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
working with big spss files #79
Comments
thanks for the report. Would you be able to produce some python code that using pyreadstat.write_sav, produces a large sample file that raises the error on your end? This is to be able to reproduce the issue but without the need of you transferring the file (but just the code to produce the file) |
Thanks for the reply. I will try playing with the write_sav and will see if I can produce such a file. |
Here is an example of a code that will generate about 84.6MB of a file that cannot be read back due to the same error.
|
Hi, ReadStat restricts individual memory allocations to 16 MB - this is to prevent denial of service type scenarios with mal-formed data. With 1.3 million variables in your file you are likely hitting that limit with the column metadata. Some options are 1) Increasing the limit 2) Adding an option to specify the limit and 3) Removing the limit altogether. |
The second option would be the best one for me. Or something like:
|
That's a good suggestion. Given the experience with pyreadr I think 1 is nit good because there will be always somebody with a larger file that will hit the new limit. I personally think removing it would be better, as it was done in pyreadr. That will be less confusing for the users, as they don't need to be aware if the extra flag to inactivate the limit. |
you are right about the bigger files. now that I think more about it, removing the limit seems also good solution :) |
hi @evanmiller, is this something coming in Readstat version 1.1.5, or not yet? (just for clarity) |
@ofajardo No solution yet |
@evanmiller Ok thanks! |
First I want to say this library is great!
We have some raw SPSS files that are extremely large(about 6GB with 1.3 million of vars). SPSS itself can work with those. Pyreadstat however cannot handle it even with the option of reading the metadata only. While there is still plenty of RAM available left in the system (the usage of Python is about 1.5 GB) and there is 64 GB ram on the machine. The stacktrace is as follows:
This happens both on Windows10 64 bit and Linux64bit with python(64 bit)=3.8 and pyreadstat=1.0.2
Now I understand that spss is probably not the best file-format for this data, but unfortunately, that is what we have.
The text was updated successfully, but these errors were encountered: