working with big spss files #79

AtanasAtanasovIpsos · 2020-09-21T09:22:48Z

First I want to say this library is great!
We have some raw SPSS files that are extremely large(about 6GB with 1.3 million of vars). SPSS itself can work with those. Pyreadstat however cannot handle it even with the option of reading the metadata only. While there is still plenty of RAM available left in the system (the usage of Python is about 1.5 GB) and there is 64 GB ram on the machine. The stacktrace is as follows:

File "C:\Users\thomas\Downloads\Ipsos\Carlsberg\build_column_overview.py", line 32, in <module>
    df, meta = pyreadstat.read_sav(os.path.join(path, file), metadataonly=True)
  File "pyreadstat\pyreadstat.pyx", line 325, in pyreadstat.pyreadstat.read_sav
  File "pyreadstat\_readstat_parser.pyx", line 945, in pyreadstat._readstat_parser.run_conversion
  File "pyreadstat\_readstat_parser.pyx", line 784, in pyreadstat._readstat_parser.run_readstat_parser
  File "pyreadstat\_readstat_parser.pyx", line 714, in pyreadstat._readstat_parser.check_exit_status
ReadstatError: Unable to allocate memory

This happens both on Windows10 64 bit and Linux64bit with python(64 bit)=3.8 and pyreadstat=1.0.2

Now I understand that spss is probably not the best file-format for this data, but unfortunately, that is what we have.

The text was updated successfully, but these errors were encountered:

ofajardo · 2020-09-21T10:40:52Z

thanks for the report. Would you be able to produce some python code that using pyreadstat.write_sav, produces a large sample file that raises the error on your end? This is to be able to reproduce the issue but without the need of you transferring the file (but just the code to produce the file)

AtanasAtanasovIpsos · 2020-09-21T10:54:21Z

Thanks for the reply. I will try playing with the write_sav and will see if I can produce such a file.

AtanasAtanasovIpsos · 2020-09-21T13:07:12Z

Here is an example of a code that will generate about 84.6MB of a file that cannot be read back due to the same error.

import random
import pandas as pd
import numpy as np
import pyreadstat

N=1300000
DataSet = pd.DataFrame(np.random.randn(1, N),columns=['A'+str(x) for x in range(1,N+1)])
pyreadstat.write_sav(DataSet,'DataFile.sav')
#%%
df, meta = pyreadstat.read_sav('DataFile.sav', metadataonly=True)

evanmiller · 2020-09-21T16:31:06Z

Hi, ReadStat restricts individual memory allocations to 16 MB - this is to prevent denial of service type scenarios with mal-formed data. With 1.3 million variables in your file you are likely hitting that limit with the column metadata.

Some options are 1) Increasing the limit 2) Adding an option to specify the limit and 3) Removing the limit altogether.

AtanasAtanasovIpsos · 2020-09-21T17:33:46Z

The second option would be the best one for me. Or something like:

df, meta = pyreadstat.read_sav('DataFile.sav', metadataonly=True,safetylimits=False)

ofajardo · 2020-09-21T18:53:43Z

That's a good suggestion.

Given the experience with pyreadr I think 1 is nit good because there will be always somebody with a larger file that will hit the new limit. I personally think removing it would be better, as it was done in pyreadr. That will be less confusing for the users, as they don't need to be aware if the extra flag to inactivate the limit.

AtanasAtanasovIpsos · 2020-09-21T19:38:36Z

you are right about the bigger files. now that I think more about it, removing the limit seems also good solution :)

ofajardo · 2020-12-03T17:30:57Z

hi @evanmiller, is this something coming in Readstat version 1.1.5, or not yet? (just for clarity)

evanmiller · 2020-12-03T18:09:57Z

@ofajardo No solution yet

ofajardo · 2020-12-03T18:29:14Z

@evanmiller Ok thanks!

ofajardo added enhancement New feature or request requires changes in Readstat waiting for changes in the C library Readstat labels Jan 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

working with big spss files #79

working with big spss files #79

AtanasAtanasovIpsos commented Sep 21, 2020

ofajardo commented Sep 21, 2020

AtanasAtanasovIpsos commented Sep 21, 2020

AtanasAtanasovIpsos commented Sep 21, 2020

evanmiller commented Sep 21, 2020

AtanasAtanasovIpsos commented Sep 21, 2020 •

edited

Loading

ofajardo commented Sep 21, 2020 •

edited

Loading

AtanasAtanasovIpsos commented Sep 21, 2020

ofajardo commented Dec 3, 2020 •

edited

Loading

evanmiller commented Dec 3, 2020

ofajardo commented Dec 3, 2020

working with big spss files #79

working with big spss files #79

Comments

AtanasAtanasovIpsos commented Sep 21, 2020

ofajardo commented Sep 21, 2020

AtanasAtanasovIpsos commented Sep 21, 2020

AtanasAtanasovIpsos commented Sep 21, 2020

evanmiller commented Sep 21, 2020

AtanasAtanasovIpsos commented Sep 21, 2020 • edited Loading

ofajardo commented Sep 21, 2020 • edited Loading

AtanasAtanasovIpsos commented Sep 21, 2020

ofajardo commented Dec 3, 2020 • edited Loading

evanmiller commented Dec 3, 2020

ofajardo commented Dec 3, 2020

AtanasAtanasovIpsos commented Sep 21, 2020 •

edited

Loading

ofajardo commented Sep 21, 2020 •

edited

Loading

ofajardo commented Dec 3, 2020 •

edited

Loading