-
Notifications
You must be signed in to change notification settings - Fork 178
invalid parquet version error for parquet files generated via python script #144
Description
Hi, I am trying to read parquet files that are in S3 and were generated via python script.
I get the following error:
Error: thrown: "invalid parquet version"
When I am trying to read similar file but the file was generated via spark - it manages to digest the file and read it.
I am also able to parse the python file and open it in a parquet viewer
Any idea why? the file is parquet lvl 2
File metadata:
file written by pyarrow 11.0.0
created_by: parquet-cpp-arrow version 11.0.0
num_columns: 6
num_rows: 42
num_row_groups: 1
format_version: 2.6
serialized_size: 3975
Full error:
(node:41711) V8: /Users/saritvakrat/Documents/automation/be_automation/node_modules/brotli/build/encode.js:34 Linking failure in asm.js: Unexpected stdlib member (Use node --trace-warnings ...` to show where the warning was created)
console.error
Error parsing Parquet file: invalid parquet version
39 | return records;
40 | } catch (error) {
> 41 | console.error('Error parsing Parquet file:', error);
| ^
42 | throw error; // Rethrow the error to be handled by the caller
43 | }
44 | }`
Packages:
"parquetjs": "^0.11.2",
"@types/parquetjs": "^0.10.6",
My function:
``export async function parseParquetFile(filePath: string): Promise<any[]> {
try {
// create new ParquetReader
const reader = await ParquetReader.openFile(filePath) as any;
// create a new cursor
const cursor = reader.getCursor();
const records = [];
// read all records from the file and print them
let record = await cursor.next();
while (record !== null) {
records.push(record);
record = await cursor.next();
}
await reader.close();
return records;
} catch (error) {
console.error('Error parsing Parquet file:', error);
throw error; // Rethrow the error to be handled by the caller
}
}
`async parseSingleParquetFromS3(bucketName: string, key: string | null | undefined): Promise<any[]> {
if (!bucketName || !key) {
throw new Error('S3 client or bucket name is not provided');
}
const getObjectCommand = new GetObjectCommand({
Bucket: bucketName,
Key: key
});
let objectResponse;
try {
objectResponse = await this.s3Client.send(getObjectCommand);
} catch (error) {
console.error(`Error fetching object from S3: ${error}`);
throw error;
}
const objectData = objectResponse.Body;
if (!(objectData instanceof Readable)) {
throw new Error('Object data is not a readable stream');
}
const fileName = key.split('/').pop() || 'temp.parquet';
const tempFilePath = join(tmpdir(), fileName);
try {
await pipeline(objectData, createWriteStream(tempFilePath));
return await parseParquetFile(tempFilePath);
} catch (error) {
console.error(`Error in streaming data to file: ${error}`);
throw error;
}
}`