Skip to content

Multiarray searchsorted fails #14833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Anaphory opened this issue Dec 8, 2016 · 15 comments · May be fixed by #61435
Open

Multiarray searchsorted fails #14833

Anaphory opened this issue Dec 8, 2016 · 15 comments · May be fixed by #61435
Assignees
Labels

Comments

@Anaphory
Copy link

Anaphory commented Dec 8, 2016

Code Sample, a copy-pastable example if possible

pandas.MultiIndex([[0],["a"]], [[0],[0]]).searchsorted((1,"b"))

Problem description

The entry (1,"b") should come after the existing (0,"a") in the MultiIndex. (Alternatively, MultiIndex could throw a clean error message.) Instead, an intransparent exception is raised:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/site-packages/pandas/core/base.py", line 1156, in searchsorted
    return self.values.searchsorted(key, side=side, sorter=sorter)
TypeError: unorderable types: tuple() > str()

This is because Index.searchsorted naïvely passes its arguments to numpy.searchsorted, which is unaware that its second argument is a sequence of tuples, not a plain array just of dimension one higher.

Expected Output

1

Output of pd.show_versions()

# Paste the output here pd.show_versions() here NSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.8.10-1-ARCH machine: x86_64 processor:  byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.25.1
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: 1.0.0
xlwt: 0.8.0
xlsxwriter: None
lxml: None
bs4: 4.5.1
html5lib: 0.9999999
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Dec 8, 2016

I suppose could just disable this. numpy doesn't undertsand object array searchsorted generally

In [4]: pandas.MultiIndex([[0],["a"]], [[0],[0]]).values.searchsorted((1,"b"))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-b9352c7b8bea> in <module>()
----> 1 pandas.MultiIndex([[0],["a"]], [[0],[0]]).values.searchsorted((1,"b"))

TypeError: unorderable types: tuple() > str()

maybe just raise a NotImplementedError. This is pretty much a useless operation anyhow, you always to search by levels via indexing

@jreback jreback added Difficulty Novice Error Reporting Incorrect or improved errors from pandas MultiIndex labels Dec 8, 2016
@jreback jreback added this to the Next Major Release milestone Dec 8, 2016
@ritwickdsouza
Copy link

Is this issue still available to fix ?

@jreback
Copy link
Contributor

jreback commented Feb 1, 2017

yes,

note that we should simply define .searchsorted in pandas/indexes/multi.py and use the direct indexers, .get_indexer, which is way more efficient (as its hashtable based).

@ritwickdsouza
Copy link

@jreback I am new to pandas, could you throw some light on how i can use .get_indexer to implement .searchsorted ?

@jreback
Copy link
Contributor

jreback commented Feb 3, 2017

.get_indexer returns the indexer , IOW the location of the point. -1 marks not found items. This works on any multi-index. Note these don't even have to be sorted (but is more efficient if they are).

In [2]: i = pd.MultiIndex.from_tuples([(0, 'a'), (0, 'b'), (1, 'a')])

In [3]: i
Out[3]: 
MultiIndex(levels=[[0, 1], ['a', 'b']],
           labels=[[0, 0, 1], [0, 1, 0]])

In [4]: i.values
Out[4]: array([(0, 'a'), (0, 'b'), (1, 'a')], dtype=object)

In [5]: i.get_indexer([(0,'b'), (1, 'a'), (2, 'c')])
Out[5]: array([ 1,  2, -1])

Here's what searchsorted does; I am using an interger array because numpy doesn't play nice with tuples. It returns the indexer of the match (IOW where it is in the array). Note if something is not found it returns the last index before that (which is really unintuitve!)

In [6]: np.array([1, 2, 3]).searchsorted([2, 3])
Out[6]: array([1, 2])

In [7]: np.array([1, 2, 3]).searchsorted([2, 3, 5])
Out[7]: array([1, 2, 3])

@bhavybarca
Copy link

@jreback this issue still open ?

@jreback
Copy link
Contributor

jreback commented Mar 2, 2018

yes

@bhavybarca
Copy link

@TomAugspurger @jreback i am a little bit confused to what should i do exactly, i mean should i simply raise a NotImplementedError as said by @jreback ? or should i replace the searchsorted by get_indexer and somehow get a value for even tuples as said is issue example

@TomAugspurger
Copy link
Contributor

If supporting searchsorted is an option that makes sense. #14833 (comment) indicates that that can be done using get_indexer.

@btel
Copy link

btel commented Oct 16, 2018

Since there was no activity since March, I would be interested in working on this issue. It might be nice to implement searchsorted as suggested by @jreback, but the issue I see is that numpy's searchsorted can give the location of an element in a sorted array that would keep the sort order even if the element does not exist. Here is an example:

>>> np.searchsorted([1, 3, 5], [2])
array([1])

From what I understand .get_indexer will simply return -1 (element not found). Naive implementation of numpy's behaviour might use bisect module from Python's standard library for the not-found elements, but it would be rather inefficient.

Interestingly, numpy's searchsorted can also work with tuples if we define an appropriate dtype:

>>> dtype = [("int", 'i8'), ("str", "U1" )]
>>> arr = np.array([(0, 'a'), (0, 'b'), (1, 'c')], dtype=dtype)
>>> arr.searchsorted(np.array([(1, 'a')], dtype=dtype))
array([2])

A possible implementation of searchsorted would then coerce the multiindex to an ndarray with adapted dtype, and use the numpy's builtin searchsorted. What do you think?

@jreback
Copy link
Contributor

jreback commented Oct 16, 2018

so this might be easier now as the implementation of MI was recently refactored to directly keep the underlying codes in the cython table

@SaturnFromTitan
Copy link
Contributor

take

@Condielj
Copy link

take

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@GSAUC3
Copy link

GSAUC3 commented May 10, 2025

Hi, is anyone still working on this, or may I take it up?
if the answer is NO, i.e. no one is working on this, then i have a couple of question:

btel suggested in #14833 (comment)

This can be one way to handle it. but it assumes the input array to be of 2-dimensional.
Should the input array be restricted to 2 dimensional ?
May i go ahead with this implementation, or should i just simple raise NotImplementedError ?

@GSAUC3
Copy link

GSAUC3 commented May 11, 2025

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment