Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supports list-like Python objects for Series comparison. #2022

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Jan 27, 2021

Currently Series doesn't support the comparison to list-like Python objects such as list, tuple, dict, set.

>>> kser
0    1
1    2
2    3
dtype: int64

>>> kser == [3, 2, 1]
Traceback (most recent call last):
...
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o77.equalTo.
...

This PR proposes supporting them as well for Series comparison.

>>> kser
0    1
1    2
2    3
dtype: int64

>>> kser == [3, 2, 1]
0    False
1     True
2    False
dtype: bool

This should resolve #2018

@ueshin
Copy link
Collaborator

ueshin commented Jan 27, 2021

Found a bug:

>>> pser = pd.Series([1,2,3], index=[10,20,30])
>>> pser == [3, 2, 1]
10    False
20     True
30    False
dtype: bool

whereas:

>>> kser = ks.Series([1,2,3], index=[10,20,30])
>>> kser == [3, 2, 1]
0     False
1     False
10    False
2     False
30    False
20    False
dtype: bool

@codecov-io
Copy link

codecov-io commented Jan 27, 2021

Codecov Report

Merging #2022 (44e34f6) into master (901cea6) will decrease coverage by 1.52%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2022      +/-   ##
==========================================
- Coverage   94.70%   93.18%   -1.53%     
==========================================
  Files          54       54              
  Lines       11480    11393      -87     
==========================================
- Hits        10872    10616     -256     
- Misses        608      777     +169     
Impacted Files Coverage Δ
databricks/koalas/indexes/base.py 97.62% <100.00%> (+0.19%) ⬆️
databricks/koalas/series.py 96.72% <100.00%> (-0.07%) ⬇️
databricks/koalas/usage_logging/__init__.py 27.58% <0.00%> (-64.92%) ⬇️
databricks/koalas/usage_logging/usage_logger.py 47.82% <0.00%> (-52.18%) ⬇️
databricks/koalas/__init__.py 82.66% <0.00%> (-8.38%) ⬇️
databricks/koalas/accessors.py 86.43% <0.00%> (-7.04%) ⬇️
databricks/conftest.py 93.22% <0.00%> (-6.78%) ⬇️
databricks/koalas/namespace.py 79.91% <0.00%> (-4.50%) ⬇️
databricks/koalas/generic.py 90.57% <0.00%> (-2.68%) ⬇️
databricks/koalas/typedef/typehints.py 92.03% <0.00%> (-1.77%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 901cea6...44e34f6. Read the comment docs.

@ueshin
Copy link
Collaborator

ueshin commented Jan 27, 2021

Btw, we might also want to support binary operations with list-like Python objects? cc @HyukjinKwon

>>> pser + [3, 2, 1]
10    4
20    4
30    4
dtype: int64
>>> pser - [3, 2, 1]
10   -2
20    0
30    2
dtype: int64
>>> [3, 2, 1] + pser
10    4
20    4
30    4
dtype: int64

@itholic
Copy link
Contributor Author

itholic commented Feb 4, 2021

FYI: Seems like pandas has some inconsistent behavior as below.

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])

>>> a.eq(b)
a     True
b    False
c    False
d    False
e    False
dtype: bool

>>> a == b
Traceback (most recent call last):
...
ValueError: Can only compare identically-labeled Series objects

However, in their API doc for Series.eq, it says "Equivalent to series == other".

I posted question to pandas repo, and will share if they response.

@itholic
Copy link
Contributor Author

itholic commented Feb 4, 2021

Btw, we might also want to support binary operations with list-like Python objects? cc @HyukjinKwon

>>> pser + [3, 2, 1]
10    4
20    4
30    4
dtype: int64
>>> pser - [3, 2, 1]
10   -2
20    0
30    2
dtype: int64
>>> [3, 2, 1] + pser
10    4
20    4
30    4
dtype: int64

Let me do this in the separated PR since there may be inconsistent cases like eq.

@ueshin
Copy link
Collaborator

ueshin commented Feb 4, 2021

The eq case sounds different from the topic here which is binary operations between Series and list.

databricks/koalas/series.py Outdated Show resolved Hide resolved
def __eq__(self, other):
if isinstance(other, (list, tuple)):
other = ks.Index(other, name=self.name)
# pandas always returns False for all items with dict and set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why pandas behaves like this ..


equals = eq

def __eq__(self, other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Index support this case too? it might be best to move to base.py.

@xinrong-meng
Copy link
Contributor

HyukjinKwon pushed a commit to apache/spark that referenced this pull request Oct 13, 2021
…parison

### What changes were proposed in this pull request?

This PR proposes to implement `Series` comparison with list-like Python objects.

Currently `Series` doesn't support the comparison to list-like Python objects such as `list`, `tuple`, `dict`, `set`.

**Before**

```python
>>> psser
0    1
1    2
2    3
dtype: int64

>>> psser == [3, 2, 1]
Traceback (most recent call last):
...
TypeError: The operation can not be applied to list.
...
```

**After**

```python
>>> psser
0    1
1    2
2    3
dtype: int64

>>> psser == [3, 2, 1]
0    False
1     True
2    False
dtype: bool
```

This was originally proposed in databricks/koalas#2022, and all reviews in origin PR has been resolved.

### Why are the changes needed?

To follow pandas' behavior.

### Does this PR introduce _any_ user-facing change?

Yes, the `Series` comparison with list-like Python objects now possible.

### How was this patch tested?

Unittests

Closes #34114 from itholic/SPARK-36438.

Authored-by: itholic <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't use __eq__ with list
5 participants