-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyreadstat not honoring format for strings longer than 255 bytes #267
Comments
Thanks for the report. The issue can also be seen if you read the resulting file with pyreadstat and look at meta.original_variable_types, meaning that the format is not correctly set, probably there is a limit set at 255. It sounds like an issue of the underlying C library Readstat, we would need to file an issue over there and wait for it to be solved. |
@ofajardo And then using it here: |
I am afraid you are mixing things up. What you are looking at is the length of the string itself, which has to be calculated in order for the C library to be able to write it. I am sure that is working fine, because if you set the wrong length, the C library just crashes. A different thing is the format, which is only the display in SPSS. For example you could have a very long string but decide to display only the first 100 characters, or in your case have a short string but force it to display 500 characters. That is the one that is not working for you. That one is set here. and it just passing whatever parameter you give as format, no calculation there, the C library is ignoring it. Theoretically you could also try is to set variable_display_width for your variable, and it should have the effect you want, but in my hands that did not work either (you can try as well!)
Question: I guess that in SPSS you can force the display width of the short string to be 500 right? |
Hi @ofajardo My concerns are that there is indeed bug in readstat library that doesn't take format into account for strings longer than 255 characters. And I'm talking about display_width here. |
maybe, but if the bug is in Readstat (and I think it is), it should be fixed there and not worked around here in pyreadstat. |
I am having this same issue. In my case. I have found it is actually an issue within the pyreadstat.write_sav() function, when reading in the data and modifying, the metadata is contained correctly, however, it is altered after writing the sav file. I had the file print the metadata formats and storage widths before writing the sav, and cross referenced with the metadata formats and storage widths from the output sav: Metadata before saving: Metadata after saving: Furthermore, string variables with data more than 255 bytes are altered to have a container width of what I presume to be the longest actual value in the dataset, i.e. 'QE4': 'A1000' I am building a database with these files, so standardization of formats is vital in the merging stage for me. I have the full script here, but the part of the script where I believe the bug is happing is as follows:
Would appreciate any advice or fixes for this bug, again my main concern is that this needs to be run for many different survey datasets that will eventually be merged. As I understand, the metadata for variables, in this case format and storage width (my next step is to rename the variables, so those will be standardized too) need to be the same in order to merge two variables with the same column name, which cannot be achieved with the current way the str variables are output, unless I manually modify them. |
Describe the issue
I'd like to set up properly data format for strings.
The problem is with format 'A500' (where number is bigger than 255)
However format is properly set if there is a value in df that has proper length.
To Reproduce
Result in SPSS.
File example
File created by above code
Expected behavior
Column Text500 should have width 500.
Setup Information:
How did you install pyreadstat? pip
Platform: macos
Python Version: 3.12
Python Distribution: brew
Using Virtualenv or condaenv? venv
The text was updated successfully, but these errors were encountered: