Skip to content

Resolve intermittent capture_output deadlock #3601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
May 14, 2025

Conversation

jsiirola
Copy link
Member

@jsiirola jsiirola commented May 8, 2025

Fixes # .

Summary/Motivation:

We are continuing to see intermittent deadlocks or attempts to write to closed file descriptors in the HiGHS tests on main. I think I have (finally) tracked down the root cause to the use of capture_input() in the Gurobi interface __del__ method. It looks like the GC will (randomly) interrupt a HiGHS test to run the collector (which, if it is collecting a Gurobi interface will call capture_output. If this happens during another test's use of capture_output.__enter__ or .__exit__, then the Gurobi interface's use of capture could happen when the underlying output streams are in a transitory state (and could be closed).

This PR resolves that by:

  • rework the Gurobi interface to use it's own private Env (which suppresses the output to stdout and removes the need to use capture_output in __del__.
  • Add thread locks so that one call to capture_output will not interrupts another fall to __enter__ / exit`
  • improve some implementations within TeeStream / capture_output
  • add additional tests that catch some edge cases identified during development

Changes proposed in this PR:

  • (see above)

Legal Acknowledgement

By contributing to this software project, I have read the contribution guide and agree to the following terms and conditions for my contribution:

  1. I agree my contributions are submitted under the BSD license.
  2. I represent I am authorized to make the contributions and grant the license. If my employer has rights to intellectual property that includes these contributions, I represent that I have received permission to make contributions and grant the required license on behalf of that employer.

Copy link

codecov bot commented May 8, 2025

Codecov Report

Attention: Patch coverage is 94.89796% with 5 lines in your changes missing coverage. Please review.

Project coverage is 88.76%. Comparing base (457f436) to head (41b722a).

Files with missing lines Patch % Lines
pyomo/contrib/solver/solvers/gurobi_persistent.py 62.50% 3 Missing ⚠️
pyomo/contrib/solver/solvers/gurobi_direct.py 95.00% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3601   +/-   ##
=======================================
  Coverage   88.76%   88.76%           
=======================================
  Files         891      891           
  Lines      102297   102320   +23     
=======================================
+ Hits        90802    90826   +24     
+ Misses      11495    11494    -1     
Flag Coverage Δ
builders 26.64% <65.30%> (+0.02%) ⬆️
default 84.87% <88.77%> (?)
expensive 34.01% <72.44%> (?)
linux 86.55% <91.83%> (-1.95%) ⬇️
linux_other 86.55% <91.83%> (+<0.01%) ⬆️
osx 82.84% <88.77%> (+<0.01%) ⬆️
win 84.75% <91.83%> (+<0.01%) ⬆️
win_other 84.75% <91.83%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -559,6 +579,31 @@ def test_capture_output_stack_error(self):
sys.stdout, sys.stderr = old
logging.getLogger('pyomo.common.tee').handlers.clear()

def test_atomic_deadlock(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nifty test, I like it.

@@ -288,7 +288,7 @@ def set_instance(self, model):
f'({self.available()}).'
)

with TeeStream(*ostreams) as t, capture_output(t.STDOUT, capture_fd=True):
with capture_output(TeeStream(*ostreams), capture_fd=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between doing this and just doing with TeeStream(*ostreams) as OUT? Do we have a preference? (I ask because we don't seem to have decided on a "standard" and we probably should)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capture_output wraps the streams it is passed into a TeeStream in order to get handles it can give to stdout / stderr. So the former implementation actually had 2 nested TeeStream objects. If you just pass the TeeStream object to capture_output, it won't create a new TeeStream, so the new implementation only has a single TeeStream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, let me rephrase. In ipopt currently, we do this:

                with TeeStream(*ostreams) as t:
                    timer.start('subprocess')
                    process = subprocess.run(
                        cmd,
                        timeout=timeout,
                        env=env,
                        universal_newlines=True,
                        stdout=t.STDOUT,
                        stderr=t.STDERR,
                        check=False,
                    )
                    timer.stop('subprocess')
                    # This is the stuff we need to parse to get the iterations
                    # and time
                    (iters, ipopt_time_nofunc, ipopt_time_func, ipopt_total_time) = (
                        self._parse_ipopt_output(ostreams[0])
                    )

In GUROBI, we do:

            with capture_output(TeeStream(*ostreams), capture_fd=False):
                gurobi_model = gurobipy.Model()

                timer.start('transfer_model')
                x = gurobi_model.addMVar(
                    len(repn.columns),
                    lb=lb,
                    ub=ub,
                    obj=repn.c.todense()[0] if repn.c.shape[0] else 0,
                    vtype=vtype,
                )

Which one is preferable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are both correct for their respective use cases:

  • subprocess.run needs to be passed 2 TextIO objects, so it is appropriate (and necessary) to pass the raw TextIO objects from TeeStream.
  • capture_output needs a single stream to send all output to (we have no added support for separate our/err targets (so internally it creates a TeeStream so that it can get raw file descriptors if it needs them (for capture_fd=True). Here Passing a TeeStream directly avoids the need to create a second one that wraps the stream from the first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thank you!

@mrmundt
Copy link
Contributor

mrmundt commented May 14, 2025

Per @blnicho - we could merge this today if no one has time to do another review. Ergo, risking it for the biscuit and merging.

@mrmundt mrmundt merged commit f680e04 into Pyomo:main May 14, 2025
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants