Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reader and Writer for the Spark Engine #266

Open
zerodarkzone opened this issue Jan 28, 2025 · 6 comments
Open

Reader and Writer for the Spark Engine #266

zerodarkzone opened this issue Jan 28, 2025 · 6 comments

Comments

@zerodarkzone
Copy link
Contributor

Hi,

Pyspark supports the following syntax for reading different kind of sources (csv, parquet, delta, etc...)

df = spark.sql("select * from parquet.`employee.parquet`")
df.show()

Wouldn't it be better to use that instead of the Pandas reader it is using right now?

It should also be possible to use the native spark DataFrameWriter when using this backend.

@eakmanrq
Copy link
Owner

This would then mean you have a dependency on PySpark right? Therefore now the user would be utilizing a spark context for writing which if nothing else is not very intuitive given the library is about removing PySpark from the processing pipeline.

That being said, I agree the current Pandas solution is sub-optimal. It was added as an initial way to support these methods.

Similar to the rest of the project, the end goal in my mind is push as much as possible down to the engine as possible. Therefore, for example, if BigQuery has a way to natively read parquet files and then do processing on that then SQLFrame would produce the correct SQL to do that and the user would have the full operation pushed down to the engine.

@zerodarkzone
Copy link
Contributor Author

Yes,
But I'm thinking on doing it just for the spark backend which already uses pyspark. (The same can be used for databricks).

The spark backend is using pandas when it could be just using it's native tools.
Image

For now I would just replace the pandas reader/writer for the spark and maybe the databricks backends just like DuckDB works right now (For databricks I'd also leave the pandas reader/writer just to be able to read/write local files).

@eakmanrq
Copy link
Owner

Ah ok I see that makes sense for Spark.

In terms of Databricks, we wouldn't want to use PySpark directly right? One of the advantages I was seeing for using SQLFrame on Databricks was the ability to take PySpark DataFrame code and run it on a SQL warehouse. If we did this then the only way that would work is if you have databricks-connect which is annoying and could possibly have conflicts with SQLFrame (not sure about that).

@zerodarkzone
Copy link
Contributor Author

Yes,
Databricks uses the same syntax as pyspark so the Reader could be the same/similar. The writer again I think can be similar (databricks supports writing to a file using sql but i'm not sure spark can do the same)

@eakmanrq
Copy link
Owner

Yeah this is interesting idea. As I said I used Pandas initially just to get something working but improving this to be more native to the engine being used, and ideally push more down to the engine and less on the client, is where we would ultimately want to end up. I would need to think a bit more on the details for each engine though. If you want try to tackle this for a specific engine I would suggest opening up a draft PR that demonstrates the direction you are wanting to go and we could align on that before investing in full implementation/tests.

@zerodarkzone
Copy link
Contributor Author

Hi,
I'll spend some time working with the spark reader/writer. I'll upload a draft in the next couple of weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants