Skip to content

Commit 79067a7

Browse files
authored
ENH: add calamine excel reader (close #50395) (#54998)
1 parent 705d431 commit 79067a7

20 files changed

+290
-58
lines changed

ci/deps/actions-310.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ dependencies:
4646
- pymysql>=1.0.2
4747
- pyreadstat>=1.1.5
4848
- pytables>=3.7.0
49+
- python-calamine>=0.1.6
4950
- pyxlsb>=1.0.9
5051
- s3fs>=2022.05.0
5152
- scipy>=1.8.1

ci/deps/actions-311-downstream_compat.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ dependencies:
4747
- pymysql>=1.0.2
4848
- pyreadstat>=1.1.5
4949
- pytables>=3.7.0
50+
- python-calamine>=0.1.6
5051
- pyxlsb>=1.0.9
5152
- s3fs>=2022.05.0
5253
- scipy>=1.8.1

ci/deps/actions-311.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ dependencies:
4646
- pymysql>=1.0.2
4747
- pyreadstat>=1.1.5
4848
# - pytables>=3.7.0, 3.8.0 is first version that supports 3.11
49+
- python-calamine>=0.1.6
4950
- pyxlsb>=1.0.9
5051
- s3fs>=2022.05.0
5152
- scipy>=1.8.1

ci/deps/actions-39-minimum_versions.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ dependencies:
4848
- pymysql=1.0.2
4949
- pyreadstat=1.1.5
5050
- pytables=3.7.0
51+
- python-calamine=0.1.6
5152
- pyxlsb=1.0.9
5253
- s3fs=2022.05.0
5354
- scipy=1.8.1

ci/deps/actions-39.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ dependencies:
4646
- pymysql>=1.0.2
4747
- pyreadstat>=1.1.5
4848
- pytables>=3.7.0
49+
- python-calamine>=0.1.6
4950
- pyxlsb>=1.0.9
5051
- s3fs>=2022.05.0
5152
- scipy>=1.8.1

ci/deps/circle-310-arm64.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ dependencies:
4747
- pymysql>=1.0.2
4848
# - pyreadstat>=1.1.5 not available on ARM
4949
- pytables>=3.7.0
50+
- python-calamine>=0.1.6
5051
- pyxlsb>=1.0.9
5152
- s3fs>=2022.05.0
5253
- scipy>=1.8.1

doc/source/getting_started/install.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -281,6 +281,7 @@ xlrd 2.0.1 excel Reading Excel
281281
xlsxwriter 3.0.3 excel Writing Excel
282282
openpyxl 3.0.10 excel Reading / writing for xlsx files
283283
pyxlsb 1.0.9 excel Reading for xlsb files
284+
python-calamine 0.1.6 excel Reading for xls/xlsx/xlsb/ods files
284285
========================= ================== =============== =============================================================
285286

286287
HTML

doc/source/user_guide/io.rst

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3453,7 +3453,8 @@ Excel files
34533453
The :func:`~pandas.read_excel` method can read Excel 2007+ (``.xlsx``) files
34543454
using the ``openpyxl`` Python module. Excel 2003 (``.xls``) files
34553455
can be read using ``xlrd``. Binary Excel (``.xlsb``)
3456-
files can be read using ``pyxlsb``.
3456+
files can be read using ``pyxlsb``. All formats can be read
3457+
using :ref:`calamine<io.calamine>` engine.
34573458
The :meth:`~DataFrame.to_excel` instance method is used for
34583459
saving a ``DataFrame`` to Excel. Generally the semantics are
34593460
similar to working with :ref:`csv<io.read_csv_table>` data.
@@ -3494,6 +3495,9 @@ using internally.
34943495

34953496
* For the engine odf, pandas is using :func:`odf.opendocument.load` to read in (``.ods``) files.
34963497

3498+
* For the engine calamine, pandas is using :func:`python_calamine.load_workbook`
3499+
to read in (``.xlsx``), (``.xlsm``), (``.xls``), (``.xlsb``), (``.ods``) files.
3500+
34973501
.. code-block:: python
34983502
34993503
# Returns a DataFrame
@@ -3935,7 +3939,8 @@ The :func:`~pandas.read_excel` method can also read binary Excel files
39353939
using the ``pyxlsb`` module. The semantics and features for reading
39363940
binary Excel files mostly match what can be done for `Excel files`_ using
39373941
``engine='pyxlsb'``. ``pyxlsb`` does not recognize datetime types
3938-
in files and will return floats instead.
3942+
in files and will return floats instead (you can use :ref:`calamine<io.calamine>`
3943+
if you need recognize datetime types).
39393944

39403945
.. code-block:: python
39413946
@@ -3947,6 +3952,20 @@ in files and will return floats instead.
39473952
Currently pandas only supports *reading* binary Excel files. Writing
39483953
is not implemented.
39493954

3955+
.. _io.calamine:
3956+
3957+
Calamine (Excel and ODS files)
3958+
------------------------------
3959+
3960+
The :func:`~pandas.read_excel` method can read Excel file (``.xlsx``, ``.xlsm``, ``.xls``, ``.xlsb``)
3961+
and OpenDocument spreadsheets (``.ods``) using the ``python-calamine`` module.
3962+
This module is a binding for Rust library `calamine <https://crates.io/crates/calamine>`__
3963+
and is faster than other engines in most cases. The optional dependency 'python-calamine' needs to be installed.
3964+
3965+
.. code-block:: python
3966+
3967+
# Returns a DataFrame
3968+
pd.read_excel("path_to_file.xlsb", engine="calamine")
39503969
39513970
.. _io.clipboard:
39523971

doc/source/whatsnew/v2.2.0.rst

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,27 @@ including other versions of pandas.
1414
Enhancements
1515
~~~~~~~~~~~~
1616

17-
.. _whatsnew_220.enhancements.enhancement1:
17+
.. _whatsnew_220.enhancements.calamine:
1818

19-
enhancement1
20-
^^^^^^^^^^^^
19+
Calamine engine for :func:`read_excel`
20+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
21+
22+
The ``calamine`` engine was added to :func:`read_excel`.
23+
It uses ``python-calamine``, which provides Python bindings for the Rust library `calamine <https://crates.io/crates/calamine>`__.
24+
This engine supports Excel files (``.xlsx``, ``.xlsm``, ``.xls``, ``.xlsb``) and OpenDocument spreadsheets (``.ods``) (:issue:`50395`).
25+
26+
There are two advantages of this engine:
27+
28+
1. Calamine is often faster than other engines, some benchmarks show results up to 5x faster than 'openpyxl', 20x - 'odf', 4x - 'pyxlsb', and 1.5x - 'xlrd'.
29+
But, 'openpyxl' and 'pyxlsb' are faster in reading a few rows from large files because of lazy iteration over rows.
30+
2. Calamine supports the recognition of datetime in ``.xlsb`` files, unlike 'pyxlsb' which is the only other engine in pandas that can read ``.xlsb`` files.
31+
32+
.. code-block:: python
33+
34+
pd.read_excel("path_to_file.xlsb", engine="calamine")
35+
36+
37+
For more, see :ref:`io.calamine` in the user guide on IO tools.
2138

2239
.. _whatsnew_220.enhancements.enhancement2:
2340

environment.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ dependencies:
4747
- pymysql>=1.0.2
4848
- pyreadstat>=1.1.5
4949
- pytables>=3.7.0
50+
- python-calamine>=0.1.6
5051
- pyxlsb>=1.0.9
5152
- s3fs>=2022.05.0
5253
- scipy>=1.8.1

0 commit comments

Comments
 (0)