-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Rename ExcelWriter to make clear attributes are not public #43088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Given xlsxwriter documents the use of the |
As far as I understand, it is possible for users to interact with |
Yes, in fact I quite like working with |
I think the worry is to have code that does something like (taken from the xlsxwriter docs):
Doing something directly with the writer may lead to unexpected behavior on the call to It's been suggested (#42222 (comment)) that perhaps we should just deprecate ExcelWriter entirely. The one use case for ExcelWriter that I am aware of is when you want to write multiple DataFrames to the same excel workbook/sheet. I don't believe there is currently a way to do that using Assuming |
agree with @rhshadrach lets depreciate ExcelWriter |
Re deprecation of ExcelWriter, will to_excel with mode='append' allow writing multiple sheets to the excel file at once in some manner? Won't there be perf loss if you're writing a bunch of sheets to an Excel file and you have to close and reopen it each time, esp. if it's stored on a remote? e.g. currently this works: with pd.ExcelWriter('s3://my_bucket/my_obj.xlsx') as writer:
for i in range(50):
dfs[i].to_excel(writer, sheet_name=vars[i]) Just want to make sure there's no loss to the (I'd imagine substantially more common) case of writing a multiple-sheet workbook once, as the ability to modify/add sheets to existing workbooks is added. I'd also note that things like #42222 (writing multiple frames to one sheet) also seem more useful to me than modifying existing workbooks, FWIW. I assume the supported means of using fancier formatting, nonstandard locations, or charts going forward would just be using openpyxl or xlsxwriter directly without any reference to pandas' to_excel function; which, you know, fair enough. It just seems like that was a useful, if unsupported, integration, so losing it is rather disappointing. |
Not sure it should be deprecated yet @jreback - but looking into the impacts it may have.
@Liam3851 are you able to benchmark this? You can compare the time to write using code you posted against just calling
I don't believe this is the case. For nonstandard locations, isn't this just using startrow/startcol? For fancier formatting, see #40231 (comment). Not sure about charts. |
I'd think the overhead would actually come from large excel files, if I'm understanding correctly (I may not be). My impression is that instead of ExcelWriter, we'd use df.to_excel with mode='a', repeatedly appending to the same file. In the remote case though, instead of committing the file as one operation via the end of the ExcelWriter, we would now write a file (sending it to S3), then open it for append, requiring reading it back (retrieving it from S3), close the file again (sending it back to S3), etc. You're basically looking at 2n-1 file transfers to make a workbook of n sheets, and need to transfer O(n^2) bytes of data back and forth. If I write a workbook of 15 sheets of 5 MB each I need to transfer 1.1 GB of data. Of course this is true without the remote too, but the perf cost is clearer if you picture the transfer going across a 100 Mbps WAN.
You're absolutely right about startrow/startcol and the addition of styler.to_excel obviating the need for much of the fancier formatting that used to exist. Conditional formatting in the cells is something I use sometimes in case a downstream recipient wants to add to it, but that is definitely a more niche use case and more suited to an external. |
Thanks for the comments on overhead @Liam3851 - agreed. For any of the charts/formatting examples coming from e.g. the xlsxwriter docs, I think this can still be accomplished in two steps - write the excel file and then open the excel file with xlsxwriter. Let me know if you think this is incorrect. This leaves (a) the overhead for writing multiple DataFrames and (b) perhaps slightly more convenient for users as the reasons for keeping ExcelWriter. I tracked down the origin of the comment about all attributes/methods being protected to #22359 (comment), but there is no reason given there. @jorisvandenbossche do you have any recollection as to why this was added? One last thing to note is that ExcelWriter itself is used for |
One use case I had for fumbling with the import pandas as pd
sheetname = "some_existing_sheet_in_file"
with ExcelWriter("file.xlsx",engine="openpyxl",if_sheet_exists="overlay") as writer:
ws = writer.book[summary_sheet]
srow = ws.max_row + 1
df.to_excel(writer, sheet_name=sheetname, startrow=srow)
There is also the thing that data validation is not supported by openpyxl, so whenever appending to a workbook with data validation, this has to be re-added by openpyxl. Yes, you could re-open the book, but that could take also quite some (un-needed) time with large excel files.
I also cannot think of something useful that would break everything. Regarding |
The concern is for users to do something like this:
This raises "ValueError: I/O operation on closed file", and other than all attributes being declared protected, there is nothing in the documentation informing a user what the can and cannot do. Certainly if we were to open up the attributes for user access, the close method should be protected from calls. Now this is straightforward, but with all the attributes on Looking at the code for Assuming this is the case for other engines, I am +1 on making all other attributes either read-only or protected, and allowing user read/write access to book. |
Thanks for the detailed comment @joeperdefloep - agree on the bit about
Typically workbooks have < 100 sheets, and as a result recreating |
For xlsxwriter, I'm seeing 0.1ms for 1000 sheets to create sheets:
|
Hmm, I do think it is nice that sheets is a more general form. ODSwriter puts the tables ( |
@rhshadrach IIRC some of this got done, is this still active? |
Thanks @jbrockmendel - completed by #50093. |
Ref: #43068 (comment)
Copying from the comment above:
It seems this line the documentation would be easy to miss, and it'd be better to make this more clear via the standard naming convention of a single leading underscore.
The text was updated successfully, but these errors were encountered: