Skip to content

invalid parquet version error for parquet files generated via python script #144

@saritvakrat

Description

@saritvakrat

Hi, I am trying to read parquet files that are in S3 and were generated via python script.
I get the following error:
Error: thrown: "invalid parquet version"
When I am trying to read similar file but the file was generated via spark - it manages to digest the file and read it.

I am also able to parse the python file and open it in a parquet viewer

Any idea why? the file is parquet lvl 2
File metadata:
file written by pyarrow 11.0.0
created_by: parquet-cpp-arrow version 11.0.0
num_columns: 6
num_rows: 42
num_row_groups: 1
format_version: 2.6
serialized_size: 3975

Full error:

(node:41711) V8: /Users/saritvakrat/Documents/automation/be_automation/node_modules/brotli/build/encode.js:34 Linking failure in asm.js: Unexpected stdlib member (Use node --trace-warnings ...` to show where the warning was created)
console.error
Error parsing Parquet file: invalid parquet version

  39 |         return records;
  40 |     } catch (error) {
> 41 |         console.error('Error parsing Parquet file:', error);
     |                 ^
  42 |         throw error; // Rethrow the error to be handled by the caller
  43 |     }
  44 | }`
  
  Packages:
      "parquetjs": "^0.11.2",
"@types/parquetjs": "^0.10.6",

My function:
``export async function parseParquetFile(filePath: string): Promise<any[]> {
try {
// create new ParquetReader
const reader = await ParquetReader.openFile(filePath) as any;
// create a new cursor
const cursor = reader.getCursor();
const records = [];
// read all records from the file and print them
let record = await cursor.next();
while (record !== null) {
records.push(record);
record = await cursor.next();
}
await reader.close();
return records;
} catch (error) {
console.error('Error parsing Parquet file:', error);
throw error; // Rethrow the error to be handled by the caller
}
}

`async parseSingleParquetFromS3(bucketName: string, key: string | null | undefined): Promise<any[]> {
if (!bucketName || !key) {
throw new Error('S3 client or bucket name is not provided');
}

    const getObjectCommand = new GetObjectCommand({
        Bucket: bucketName,
        Key: key
    });

    let objectResponse;
    try {
        objectResponse = await this.s3Client.send(getObjectCommand);
    } catch (error) {
        console.error(`Error fetching object from S3: ${error}`);
        throw error;
    }

    const objectData = objectResponse.Body;
    if (!(objectData instanceof Readable)) {
        throw new Error('Object data is not a readable stream');
    }

    const fileName = key.split('/').pop() || 'temp.parquet';
    const tempFilePath = join(tmpdir(), fileName);

    try {
        await pipeline(objectData, createWriteStream(tempFilePath));
        return await parseParquetFile(tempFilePath);
    } catch (error) {
        console.error(`Error in streaming data to file: ${error}`);
        throw error;
    }
}`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions