pyreadstat not honoring format for strings longer than 255 bytes #267

maver1ck · 2024-08-19T10:40:59Z

Describe the issue

I'd like to set up properly data format for strings.
The problem is with format 'A500' (where number is bigger than 255)
However format is properly set if there is a value in df that has proper length.

To Reproduce

import pyreadstat
import pandas as pd
df = pd.DataFrame({'Text5': ['0123456789'],
      'Text50': ['0123456789'],
      'Text500': ['0123456789'],
      'Text500A': ['0' * 500],  
    })
variable_format = {'Text5': 'A5', 'Text50': 'A50', 'Text500': 'A500', 'Text500A': 'A500'}
pyreadstat.write_sav(df, 'bug.sav', variable_format=variable_format)

Result in SPSS.

File example

File created by above code

Expected behavior

Column Text500 should have width 500.

Setup Information:

How did you install pyreadstat? pip
Platform: macos
Python Version: 3.12
Python Distribution: brew
Using Virtualenv or condaenv? venv

The text was updated successfully, but these errors were encountered:

ofajardo · 2024-09-02T15:06:48Z

Thanks for the report. The issue can also be seen if you read the resulting file with pyreadstat and look at meta.original_variable_types, meaning that the format is not correctly set, probably there is a limit set at 255.

It sounds like an issue of the underlying C library Readstat, we would need to file an issue over there and wait for it to be solved.

maver1ck · 2024-09-25T20:53:58Z

@ofajardo
I'm not sure about it.
It looks like we're calculating this maximum string length here:
https://github.com/Roche/pyreadstat/blob/master/pyreadstat/_readstat_writer.pyx#L145

And then using it here:
https://github.com/Roche/pyreadstat/blob/master/pyreadstat/_readstat_writer.pyx#L698

ofajardo · 2024-09-26T11:49:09Z

I am afraid you are mixing things up. What you are looking at is the length of the string itself, which has to be calculated in order for the C library to be able to write it. I am sure that is working fine, because if you set the wrong length, the C library just crashes. A different thing is the format, which is only the display in SPSS. For example you could have a very long string but decide to display only the first 100 characters, or in your case have a short string but force it to display 500 characters. That is the one that is not working for you.

That one is set here. and it just passing whatever parameter you give as format, no calculation there, the C library is ignoring it.

Theoretically you could also try is to set variable_display_width for your variable, and it should have the effect you want, but in my hands that did not work either (you can try as well!)

variable_display_width = {'Text500': 500}

Question: I guess that in SPSS you can force the display width of the short string to be 500 right?

maver1ck · 2024-09-30T11:12:32Z

Hi @ofajardo
May I explain it one more time.
I understand difference between readstat_variable_set_format and readstat_add_variable.

My concerns are that there is indeed bug in readstat library that doesn't take format into account for strings longer than 255 characters.
However this bug may be possibly fixed here by altering this line:
https://github.com/Roche/pyreadstat/blob/master/pyreadstat/_readstat_writer.pyx#L698C24-L698C131
Into something like
variable = readstat_add_variable(writer, variable_name.encode("utf-8"), pandas_to_readstat_types[curtype], max(max_length, length_from_format_string))

And I'm talking about display_width here.

ofajardo · 2024-09-30T11:51:00Z

maybe, but if the bug is in Readstat (and I think it is), it should be fixed there and not worked around here in pyreadstat.

NERNST02 · 2024-11-21T16:37:44Z

I am having this same issue. In my case. I have found it is actually an issue within the pyreadstat.write_sav() function, when reading in the data and modifying, the metadata is contained correctly, however, it is altered after writing the sav file. I had the file print the metadata formats and storage widths before writing the sav, and cross referenced with the metadata formats and storage widths from the output sav:

Metadata before saving:
Variable Formats: {'ResponseId': 'A50', 'IPAddress': 'A16000', 'LocationLatitude': 'A16000', 'LocationLongitude': 'A16000', 'ResponseDurationSeconds': 'F40.2', 'Progress': 'F40.2', 'Finished': 'F40.0', 'StartDate': 'DATETIME20', 'EndDate': 'DATETIME20', 'RecordedDate': 'DATETIME20', 'DueDate': 'DATETIME11', 'SurveyType': 'A200', 'RespondentType': 'A200', 'ClubIndex': 'A200', 'ClubName': 'A200', 'ClubCategory': 'A32', 'ClubTotalMemberCount': 'F8.0', 'ClubTotalSpouseCount': 'F8.0', 'ClubResponseMember': 'F8.0', 'ClubResponseSpouse': 'F8.0', 'TotalClubResponse': 'F8.0', 'ClubMainInitiationFee': 'DOLLAR12.0', 'ClubMainAnnualFee': 'DOLLAR12.0', 'ClubType': 'F8.0', 'ClubAddressRegion': 'F8.0', 'ClubAddressCity': 'A200', 'ClubAddressState': 'A200', 'ClubAddressZip': 'A200', 'Q2_2': 'A16000', 'Q2_3': 'A16000', 'QA1': 'F8.0', 'QA2': 'F8.0', 'QA3': 'F8.0', 'QA4': 'F8.0', ....

Metadata after saving:
Variable Formats: {'ResponseId': 'A50', 'IPAddress': 'A255', 'LocationLatitude': 'A255', 'LocationLongitude': 'A255', 'ResponseDurationSeconds': 'F40.2', 'Progress': 'F40.2', 'Finished': 'F40.0', 'StartDate': 'DATETIME20', 'EndDate': 'DATETIME20', 'RecordedDate': 'DATETIME20', 'DueDate': 'DATETIME11', 'SurveyType': 'A200', 'RespondentType': 'A200', 'ClubIndex': 'A200', 'ClubName': 'A200', 'ClubCategory': 'A32', 'ClubTotalMemberCount': 'F8.0', 'ClubTotalSpouseCount': 'F8.0', 'ClubResponseMember': 'F8.0', 'ClubResponseSpouse': 'F8.0', 'TotalClubResponse': 'F8.0', 'ClubMainInitiationFee': 'DOLLAR12', 'ClubMainAnnualFee': 'DOLLAR12', 'ClubType': 'F8.0', 'ClubAddressRegion': 'F8.0', 'ClubAddressCity': 'A200', 'ClubAddressState': 'A200', 'ClubAddressZip': 'A200', 'Q2_2': 'A255', 'Q2_3': 'A255', 'QA1': 'F8.0', 'QA2': 'F8.0', 'QA3': 'F8.0', 'QA4': 'F8.0', ....

Furthermore, string variables with data more than 255 bytes are altered to have a container width of what I presume to be the longest actual value in the dataset, i.e.

'QE4': 'A1000'
'QG5': 'A1002'
'QG4': 'A1014' ... and so on, despite explicitly setting the metadata format and storage width to be A16000 and 16000 respectively

I am building a database with these files, so standardization of formats is vital in the merging stage for me. I have the full script here, but the part of the script where I believe the bug is happing is as follows:

output_file_path = os.path.join(BASE_PATH, OUTPUT_FILE) print("Saving merged file with metadata...") pyreadstat.write_sav( merged_data, output_file_path, variable_value_labels=merged_meta.variable_value_labels, column_labels=merged_meta.column_labels, variable_display_width=merged_meta.variable_display_width, variable_measure=merged_meta.variable_measure, variable_format=merged_meta.original_variable_types )

Would appreciate any advice or fixes for this bug, again my main concern is that this needs to be run for many different survey datasets that will eventually be merged. As I understand, the metadata for variables, in this case format and storage width (my next step is to rename the variables, so those will be standardized too) need to be the same in order to merge two variables with the same column name, which cannot be achieved with the current way the str variables are output, unless I manually modify them.

ofajardo added bug Something isn't working requires changes in Readstat waiting for changes in the C library Readstat to be reported in Readstat labels Sep 2, 2024

NERNST02 mentioned this issue Nov 21, 2024

Pyreadstat.write_sav() altering str formats and storage widths WizardMac/ReadStat#321

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyreadstat not honoring format for strings longer than 255 bytes #267

pyreadstat not honoring format for strings longer than 255 bytes #267

maver1ck commented Aug 19, 2024 •

edited

Loading

ofajardo commented Sep 2, 2024

maver1ck commented Sep 25, 2024

ofajardo commented Sep 26, 2024 •

edited

Loading

maver1ck commented Sep 30, 2024 •

edited

Loading

ofajardo commented Sep 30, 2024

NERNST02 commented Nov 21, 2024

pyreadstat not honoring format for strings longer than 255 bytes #267

pyreadstat not honoring format for strings longer than 255 bytes #267

Comments

maver1ck commented Aug 19, 2024 • edited Loading

Describe the issue

To Reproduce

File example

Expected behavior

Setup Information:

ofajardo commented Sep 2, 2024

maver1ck commented Sep 25, 2024

ofajardo commented Sep 26, 2024 • edited Loading

maver1ck commented Sep 30, 2024 • edited Loading

ofajardo commented Sep 30, 2024

NERNST02 commented Nov 21, 2024

maver1ck commented Aug 19, 2024 •

edited

Loading

ofajardo commented Sep 26, 2024 •

edited

Loading

maver1ck commented Sep 30, 2024 •

edited

Loading