Reader and Writer for the Spark Engine #266

zerodarkzone · 2025-01-28T19:44:10Z

Hi,

Pyspark supports the following syntax for reading different kind of sources (csv, parquet, delta, etc...)

df = spark.sql("select * from parquet.`employee.parquet`")
df.show()

Wouldn't it be better to use that instead of the Pandas reader it is using right now?

It should also be possible to use the native spark DataFrameWriter when using this backend.

eakmanrq · 2025-01-29T06:30:52Z

This would then mean you have a dependency on PySpark right? Therefore now the user would be utilizing a spark context for writing which if nothing else is not very intuitive given the library is about removing PySpark from the processing pipeline.

That being said, I agree the current Pandas solution is sub-optimal. It was added as an initial way to support these methods.

Similar to the rest of the project, the end goal in my mind is push as much as possible down to the engine as possible. Therefore, for example, if BigQuery has a way to natively read parquet files and then do processing on that then SQLFrame would produce the correct SQL to do that and the user would have the full operation pushed down to the engine.

zerodarkzone · 2025-01-29T12:48:52Z

Yes,
But I'm thinking on doing it just for the spark backend which already uses pyspark. (The same can be used for databricks).

The spark backend is using pandas when it could be just using it's native tools.

For now I would just replace the pandas reader/writer for the spark and maybe the databricks backends just like DuckDB works right now (For databricks I'd also leave the pandas reader/writer just to be able to read/write local files).

eakmanrq · 2025-01-29T16:08:22Z

Ah ok I see that makes sense for Spark.

In terms of Databricks, we wouldn't want to use PySpark directly right? One of the advantages I was seeing for using SQLFrame on Databricks was the ability to take PySpark DataFrame code and run it on a SQL warehouse. If we did this then the only way that would work is if you have databricks-connect which is annoying and could possibly have conflicts with SQLFrame (not sure about that).

zerodarkzone · 2025-01-29T16:58:46Z

Yes,
Databricks uses the same syntax as pyspark so the Reader could be the same/similar. The writer again I think can be similar (databricks supports writing to a file using sql but i'm not sure spark can do the same)

eakmanrq · 2025-01-30T02:23:12Z

Yeah this is interesting idea. As I said I used Pandas initially just to get something working but improving this to be more native to the engine being used, and ideally push more down to the engine and less on the client, is where we would ultimately want to end up. I would need to think a bit more on the details for each engine though. If you want try to tackle this for a specific engine I would suggest opening up a draft PR that demonstrates the direction you are wanting to go and we could align on that before investing in full implementation/tests.

zerodarkzone · 2025-01-31T13:35:03Z

Hi,
I'll spend some time working with the spark reader/writer. I'll upload a draft in the next couple of weeks.

zerodarkzone mentioned this issue Feb 5, 2025

Feat: use native spark reader writer #287

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reader and Writer for the Spark Engine #266

Reader and Writer for the Spark Engine #266

zerodarkzone commented Jan 28, 2025

eakmanrq commented Jan 29, 2025

zerodarkzone commented Jan 29, 2025

eakmanrq commented Jan 29, 2025

zerodarkzone commented Jan 29, 2025

eakmanrq commented Jan 30, 2025

zerodarkzone commented Jan 31, 2025

Reader and Writer for the Spark Engine #266

Reader and Writer for the Spark Engine #266

Comments

zerodarkzone commented Jan 28, 2025

eakmanrq commented Jan 29, 2025

zerodarkzone commented Jan 29, 2025

eakmanrq commented Jan 29, 2025

zerodarkzone commented Jan 29, 2025

eakmanrq commented Jan 30, 2025

zerodarkzone commented Jan 31, 2025