Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyreadstat not honoring format for strings longer than 255 bytes #267

Open
maver1ck opened this issue Aug 19, 2024 · 6 comments
Open

pyreadstat not honoring format for strings longer than 255 bytes #267

maver1ck opened this issue Aug 19, 2024 · 6 comments
Labels
bug Something isn't working requires changes in Readstat waiting for changes in the C library Readstat to be reported in Readstat

Comments

@maver1ck
Copy link

maver1ck commented Aug 19, 2024

Describe the issue

I'd like to set up properly data format for strings.
The problem is with format 'A500' (where number is bigger than 255)
However format is properly set if there is a value in df that has proper length.

To Reproduce

import pyreadstat
import pandas as pd
df = pd.DataFrame({'Text5': ['0123456789'],
      'Text50': ['0123456789'],
      'Text500': ['0123456789'],
      'Text500A': ['0' * 500],  
    })
variable_format = {'Text5': 'A5', 'Text50': 'A50', 'Text500': 'A500', 'Text500A': 'A500'}
pyreadstat.write_sav(df, 'bug.sav', variable_format=variable_format)

Result in SPSS.
image

File example

File created by above code

Expected behavior

Column Text500 should have width 500.

Setup Information:

How did you install pyreadstat? pip
Platform: macos
Python Version: 3.12
Python Distribution: brew
Using Virtualenv or condaenv? venv

@ofajardo
Copy link
Collaborator

ofajardo commented Sep 2, 2024

Thanks for the report. The issue can also be seen if you read the resulting file with pyreadstat and look at meta.original_variable_types, meaning that the format is not correctly set, probably there is a limit set at 255.

It sounds like an issue of the underlying C library Readstat, we would need to file an issue over there and wait for it to be solved.

@ofajardo ofajardo added bug Something isn't working requires changes in Readstat waiting for changes in the C library Readstat to be reported in Readstat labels Sep 2, 2024
@maver1ck
Copy link
Author

@ofajardo
I'm not sure about it.
It looks like we're calculating this maximum string length here:
https://github.com/Roche/pyreadstat/blob/master/pyreadstat/_readstat_writer.pyx#L145

And then using it here:
https://github.com/Roche/pyreadstat/blob/master/pyreadstat/_readstat_writer.pyx#L698

@ofajardo
Copy link
Collaborator

ofajardo commented Sep 26, 2024

I am afraid you are mixing things up. What you are looking at is the length of the string itself, which has to be calculated in order for the C library to be able to write it. I am sure that is working fine, because if you set the wrong length, the C library just crashes. A different thing is the format, which is only the display in SPSS. For example you could have a very long string but decide to display only the first 100 characters, or in your case have a short string but force it to display 500 characters. That is the one that is not working for you.

That one is set here. and it just passing whatever parameter you give as format, no calculation there, the C library is ignoring it.

Theoretically you could also try is to set variable_display_width for your variable, and it should have the effect you want, but in my hands that did not work either (you can try as well!)

variable_display_width = {'Text500': 500}

Question: I guess that in SPSS you can force the display width of the short string to be 500 right?

@maver1ck
Copy link
Author

maver1ck commented Sep 30, 2024

Hi @ofajardo
May I explain it one more time.
I understand difference between readstat_variable_set_format and readstat_add_variable.

My concerns are that there is indeed bug in readstat library that doesn't take format into account for strings longer than 255 characters.
However this bug may be possibly fixed here by altering this line:
https://github.com/Roche/pyreadstat/blob/master/pyreadstat/_readstat_writer.pyx#L698C24-L698C131
Into something like
variable = readstat_add_variable(writer, variable_name.encode("utf-8"), pandas_to_readstat_types[curtype], max(max_length, length_from_format_string))

And I'm talking about display_width here.

@ofajardo
Copy link
Collaborator

maybe, but if the bug is in Readstat (and I think it is), it should be fixed there and not worked around here in pyreadstat.

@NERNST02
Copy link

I am having this same issue. In my case. I have found it is actually an issue within the pyreadstat.write_sav() function, when reading in the data and modifying, the metadata is contained correctly, however, it is altered after writing the sav file. I had the file print the metadata formats and storage widths before writing the sav, and cross referenced with the metadata formats and storage widths from the output sav:

Metadata before saving:
Variable Formats: {'ResponseId': 'A50', 'IPAddress': 'A16000', 'LocationLatitude': 'A16000', 'LocationLongitude': 'A16000', 'ResponseDurationSeconds': 'F40.2', 'Progress': 'F40.2', 'Finished': 'F40.0', 'StartDate': 'DATETIME20', 'EndDate': 'DATETIME20', 'RecordedDate': 'DATETIME20', 'DueDate': 'DATETIME11', 'SurveyType': 'A200', 'RespondentType': 'A200', 'ClubIndex': 'A200', 'ClubName': 'A200', 'ClubCategory': 'A32', 'ClubTotalMemberCount': 'F8.0', 'ClubTotalSpouseCount': 'F8.0', 'ClubResponseMember': 'F8.0', 'ClubResponseSpouse': 'F8.0', 'TotalClubResponse': 'F8.0', 'ClubMainInitiationFee': 'DOLLAR12.0', 'ClubMainAnnualFee': 'DOLLAR12.0', 'ClubType': 'F8.0', 'ClubAddressRegion': 'F8.0', 'ClubAddressCity': 'A200', 'ClubAddressState': 'A200', 'ClubAddressZip': 'A200', 'Q2_2': 'A16000', 'Q2_3': 'A16000', 'QA1': 'F8.0', 'QA2': 'F8.0', 'QA3': 'F8.0', 'QA4': 'F8.0', ....

Metadata after saving:
Variable Formats: {'ResponseId': 'A50', 'IPAddress': 'A255', 'LocationLatitude': 'A255', 'LocationLongitude': 'A255', 'ResponseDurationSeconds': 'F40.2', 'Progress': 'F40.2', 'Finished': 'F40.0', 'StartDate': 'DATETIME20', 'EndDate': 'DATETIME20', 'RecordedDate': 'DATETIME20', 'DueDate': 'DATETIME11', 'SurveyType': 'A200', 'RespondentType': 'A200', 'ClubIndex': 'A200', 'ClubName': 'A200', 'ClubCategory': 'A32', 'ClubTotalMemberCount': 'F8.0', 'ClubTotalSpouseCount': 'F8.0', 'ClubResponseMember': 'F8.0', 'ClubResponseSpouse': 'F8.0', 'TotalClubResponse': 'F8.0', 'ClubMainInitiationFee': 'DOLLAR12', 'ClubMainAnnualFee': 'DOLLAR12', 'ClubType': 'F8.0', 'ClubAddressRegion': 'F8.0', 'ClubAddressCity': 'A200', 'ClubAddressState': 'A200', 'ClubAddressZip': 'A200', 'Q2_2': 'A255', 'Q2_3': 'A255', 'QA1': 'F8.0', 'QA2': 'F8.0', 'QA3': 'F8.0', 'QA4': 'F8.0', ....

Furthermore, string variables with data more than 255 bytes are altered to have a container width of what I presume to be the longest actual value in the dataset, i.e.

'QE4': 'A1000'
'QG5': 'A1002'
'QG4': 'A1014' ... and so on, despite explicitly setting the metadata format and storage width to be A16000 and 16000 respectively

I am building a database with these files, so standardization of formats is vital in the merging stage for me. I have the full script here, but the part of the script where I believe the bug is happing is as follows:

output_file_path = os.path.join(BASE_PATH, OUTPUT_FILE) print("Saving merged file with metadata...") pyreadstat.write_sav( merged_data, output_file_path, variable_value_labels=merged_meta.variable_value_labels, column_labels=merged_meta.column_labels, variable_display_width=merged_meta.variable_display_width, variable_measure=merged_meta.variable_measure, variable_format=merged_meta.original_variable_types )

Would appreciate any advice or fixes for this bug, again my main concern is that this needs to be run for many different survey datasets that will eventually be merged. As I understand, the metadata for variables, in this case format and storage width (my next step is to rename the variables, so those will be standardized too) need to be the same in order to merge two variables with the same column name, which cannot be achieved with the current way the str variables are output, unless I manually modify them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working requires changes in Readstat waiting for changes in the C library Readstat to be reported in Readstat
Projects
None yet
Development

No branches or pull requests

3 participants