Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-130285: Fix handling of zero or empty counts in random.sample() #130291

Merged
merged 6 commits into from
Feb 21, 2025

Conversation

rhettinger
Copy link
Contributor

@rhettinger rhettinger commented Feb 19, 2025

First draft for discussion. Will add doc updates and misc/news entry in a bit.

@Dominik1123
Copy link
Contributor

Looks good. When updating the docs, one could also mention that negative counts may lead to invalid results. When considering the equivalence for repeated elements:

sample(['red', 'blue'], counts=[4, 2], k=5) is equivalent to sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

it may be tempting to extrapolate this to negative counts based on the observation that repeating (a sequence) a negative number of times always gives an empty sequence:

>>> [1]*-1
[]
>>> list(it.repeat(1, -1))
[]
>>> list(range(-1))
[]

Though that's probably a very rare edge case when considering random.sample.

@rhettinger
Copy link
Contributor Author

I'm having some misgivings about this. The docs only say, "Repeated elements can be specified one at a time or with the optional keyword-only counts parameter". That speaks to the case of one-or-more and makes no promises about a count total of zero.

If I understand your original application, a sample was chosen from a pool, the selections were removed from the pool, and the process was repeated. Presumably along the way k was being reduced as well to avoid a ValueError.

At first that seemed reasonable to me, but the loop would need a stopping condition. An empty pool or k==0seems like a reasonable way to do that.

Also if that was the application, even better approaches are possible with the current API. Shuffle the dataset and extract subgroups as needed. That samples without replacement until the pool is drained. Likewise, sample could be called just once and the subgroups extracted from the supersample. The docs speak directly to this use case, "The resulting list is in selection order so that all sub-slices will also be valid random samples."

So, I'm a little dubious that PR is needed at all, that sample('ab', counts=[0,0], k=0) or sample([], 0, counts=[]) is something we want to encourage or enable, or that the docs for counts implied anything beyond simplifying "repeated elements".

@Dominik1123
Copy link
Contributor

Dominik1123 commented Feb 20, 2025

I'm not sure if this PR is the right place to discuss the details of my application, but basically it uses multiple pools and a stopping condition that excludes k==0:

while more_items_need_to_be_selected:
    pool = ...  # some logic to select the pool
    k = ...  # some logic to choose k; guarantees k > 0
    try:
        items = random.sample(pool, k, counts=[weights[x] for x in pool])
    except ValueError:  # the pool doesn't have enough items
        ...
    else:
        ...

The purpose of the try/except is to handle the situation len(pool) < k and I expected 0 == len(pool) < k simply to be a special case of this. Especially since this part of the documentation

If the sample size is larger than the population size, a ValueError is raised.

comes after the explanation of the parameter counts. So, I didn't expect that specifying counts would make a difference. Also, the IndexError is not documented and the fact that it is raised stems from an implementation detail. For those reasons it feels like a bug to me.

I didn't encounter the other case (yet), where all counts are zero, but it seems reasonable to me. If the counts are used to control the specific population from which samples are chosen, some of the members may reach a count of zero, i.e., effectively being removed from the population. This is also a test case. So, if the counts are used to control the population, the only way to get to an empty population is when all counts are set to zero. In my opinion, this should be equivalent to sample([], k).

I don't think that the docs for counts imply anything beyond simplifying "repeated elements", as you wrote. But what exactly is "repeating elements" in Python? Above I gave three examples and they work with zero and even with negative counts:

>>> [...]*0, [...]*-1
([], [])
>>> list(it.repeat(..., 0)), list(it.repeat(..., -1))
([], [])
>>> list(range(0)), list(range(-1))
([], [])

So, repeating an element zero times seems reasonable to me, as it implies the absence of that element (not only in Python). Negative counts are questionable, though.

Whether it's really needed is a different question, though. From a practical point of view? Probably not. Someone encountering either of the two errors will not have a too hard time figuring out what went wrong and adjusting their code accordingly. I changed my code to

if len(pool) < k:
    ...

which is even more explicit, so it can go without a comment. But I can't use try/except (I wouldn't want to rely on an undocumented IndexError here).

Is it needed for the sake of correctness? I would say, yes. As I explained above, both scenarios appear reasonable to me and the behavior in the first one even feels like a bug.

@rhettinger
Copy link
Contributor Author

ISTM that an explicit k==0 stop condition is warranted when using sample in a loop that progressively reduces the population counts. Further, it seems that a much better design would be to shuffle the whole population and extract the subsamples as needed. Also, more is being read into the counts documentation than was intended. The phrase "repeated elements can be specified one at a time" meant one-or-more.

That said, I don't see any downside for supporting the more expansive reading as zero-or-more even though that can only succeed when k==0. So, I'll move this forward.

@rhettinger rhettinger marked this pull request as ready for review February 21, 2025 17:02
@rhettinger rhettinger added needs backport to 3.12 bug and security fixes needs backport to 3.13 bugs and security fixes and removed awaiting core review labels Feb 21, 2025
@rhettinger rhettinger merged commit 286c517 into python:main Feb 21, 2025
45 checks passed
@miss-islington-app
Copy link

Thanks @rhettinger for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

@rhettinger rhettinger deleted the sample_zero_counts branch February 21, 2025 17:33
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Feb 21, 2025
…e() (pythongh-130291)

(cherry picked from commit 286c517)

Co-authored-by: Raymond Hettinger <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Feb 21, 2025

GH-130416 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Feb 21, 2025
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Feb 21, 2025
…e() (pythongh-130291)

(cherry picked from commit 286c517)

Co-authored-by: Raymond Hettinger <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Feb 21, 2025

GH-130417 is a backport of this pull request to the 3.12 branch.

@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot AMD64 FreeBSD14 3.x has failed when building commit 286c517.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/1232/builds/4734) and take a look at the build logs.
  4. Check if the failure is related to this commit (286c517) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/1232/builds/4734

Failed tests:

  • test_interpreters

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "<frozen getpath>", line 483, in <module>
ValueError: embedded null byte
Warning -- Uncaught thread exception: InterpreterError
Exception in thread Thread-282 (run):
RuntimeError: error evaluating path


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.opsec-fbsd14/build/Lib/threading.py", line 1054, in _bootstrap_inner
    self.run()
    ~~~~~~~~^^
  File "/home/buildbot/buildarea/3.x.opsec-fbsd14/build/Lib/threading.py", line 996, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.x.opsec-fbsd14/build/Lib/test/test_interpreters/test_stress.py", line 30, in task
    interp = interpreters.create()
  File "/home/buildbot/buildarea/3.x.opsec-fbsd14/build/Lib/test/support/interpreters/__init__.py", line 76, in create
    id = _interpreters.create(reqrefs=True)
interpreters.InterpreterError: interpreter creation failed
k


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.opsec-fbsd14/build/Lib/threading.py", line 1054, in _bootstrap_inner
    self.run()
    ~~~~~~~~^^
  File "/home/buildbot/buildarea/3.x.opsec-fbsd14/build/Lib/threading.py", line 996, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.x.opsec-fbsd14/build/Lib/test/test_interpreters/test_stress.py", line 47, in run
    interp = interpreters.create()
  File "/home/buildbot/buildarea/3.x.opsec-fbsd14/build/Lib/test/support/interpreters/__init__.py", line 76, in create
    id = _interpreters.create(reqrefs=True)
interpreters.InterpreterError: interpreter creation failed
k


Traceback (most recent call last):
  File "<frozen getpath>", line 483, in <module>
ValueError: embedded null byte
Warning -- Uncaught thread exception: InterpreterError
Exception in thread Thread-216 (task):
RuntimeError: error evaluating path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants