-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reader and Writer for the Spark Engine #266
Comments
This would then mean you have a dependency on PySpark right? Therefore now the user would be utilizing a spark context for writing which if nothing else is not very intuitive given the library is about removing PySpark from the processing pipeline. That being said, I agree the current Pandas solution is sub-optimal. It was added as an initial way to support these methods. Similar to the rest of the project, the end goal in my mind is push as much as possible down to the engine as possible. Therefore, for example, if BigQuery has a way to natively read parquet files and then do processing on that then SQLFrame would produce the correct SQL to do that and the user would have the full operation pushed down to the engine. |
Ah ok I see that makes sense for Spark. In terms of Databricks, we wouldn't want to use PySpark directly right? One of the advantages I was seeing for using SQLFrame on Databricks was the ability to take PySpark DataFrame code and run it on a SQL warehouse. If we did this then the only way that would work is if you have databricks-connect which is annoying and could possibly have conflicts with SQLFrame (not sure about that). |
Yes, |
Yeah this is interesting idea. As I said I used Pandas initially just to get something working but improving this to be more native to the engine being used, and ideally push more down to the engine and less on the client, is where we would ultimately want to end up. I would need to think a bit more on the details for each engine though. If you want try to tackle this for a specific engine I would suggest opening up a draft PR that demonstrates the direction you are wanting to go and we could align on that before investing in full implementation/tests. |
Hi, |
Hi,
Pyspark supports the following syntax for reading different kind of sources (csv, parquet, delta, etc...)
Wouldn't it be better to use that instead of the Pandas reader it is using right now?
It should also be possible to use the native spark DataFrameWriter when using this backend.
The text was updated successfully, but these errors were encountered: