ACA-Py Test Sets Failing to start...usually #896

swcurran · 2025-01-14T00:16:44Z

Recently (since Dec. 20) the ACA-Py tests are failing, because the agents are not starting. Running the tests locally work fine -- agents run and all ACA-Py tests pass.

When I first saw this, I thought it was the unpinned poetry update that broke ACA-Py briefly, but I see that is not the issue. When I look at the logs from the runs, I see that ngrok gives errors (see attached). However, I checked a successful run from a few days ago (ACA-Py aip10 tests) and the ngrok log looked the same.

docker-ngrok-tails-server-1.log

swcurran · 2025-01-14T00:17:27Z

@ianco -- could you look into this? This is pretty high priority -- not having the tests running at all is bad.

ianco · 2025-01-14T16:06:11Z

@ianco -- could you look into this? This is pretty high priority -- not having the tests running at all is bad.

on it

ianco · 2025-01-14T16:44:45Z

I think the ngrok error is a red herring. Ngrok changed their model awhile ago to require a "token" - which we don't have - we should probably remove ngrok out of the tails startup ...

swcurran · 2025-01-14T17:43:13Z

Agreed - since it gives the same error on success or failure, it's not like the issue. Looking forward to determining what it is!

swcurran · 2025-01-14T17:44:32Z

Re-running the test now. Started to yesterday, but stopped because of github issues. I'll see if anything pops up in that.

ianco · 2025-01-14T18:26:12Z

I'll update the manage script to dump out the aca-py logs if the agent fails to start. (I'm not having any luck reproducing the error)

ianco · 2025-01-14T19:09:23Z

I restarted a couple of the tests (aca-py aip10 and aip20) and they are both working :-S

I'll check tomorrow to see if any of the "auto" runs failed.

swcurran · 2025-01-14T19:27:20Z

I ran aip20 today, and it did run -- w00t. However, a number of the tests failed. I'm re-running locally, but pretty sure I ran them yesterday and all passed. Weird...

ianco · 2025-01-15T16:23:55Z

I'm seeing this error:

2025-01-15 00:26:11,751 acapy_agent.core.plugin_registry ERROR Module doesn't exist: redis_events.v1_0.redis_queue.events

ianco · 2025-01-15T16:49:41Z

... not sure why there is a redis error when aca-py is starting up (especially since it seems to be intermittent).

We're not actually using (or running) redis so maybe we should remove redis references from the aca-py backchannel?

@swcurran @jamshale any thoughts?

swcurran · 2025-01-15T16:55:20Z

I think that is fine — it’s not crucial. Are those errors showing up in the failed tests? It’s odd those are failing in the GHA runs, but working locally. I’m trying to understand why redis would be used in some tests in the runset, but not all or none.

jamshale · 2025-01-15T17:00:35Z

I don't have a lot of experience with this project but I remember it had something for testing the redis events plugin. I don't think this is the right place for that testing, and the plugin repo is also testing it. Perhaps that whole thing should be removed from here.

ianco · 2025-01-15T18:29:27Z

The redis updated didn't fix the issue! (Just a "redis" herring.) It looks like the agent is starting but then the backchannel either can't connect or times out. Back to the drawing board :-(

ianco · 2025-01-16T01:19:27Z

OK I managed to reproduce the issue locally (just by fluke I think). It looks like at some point on startup the aca-py agent just "hangs" and then the startup times out. Might be a threading issue where aca-py "yields" and then never gets control back?

ianco · 2025-01-16T18:46:49Z

OK the agent is just getting frozen when starting up, here is a log from my local this morning:

Starting aca-py agent ...
2025-01-16 18:26:37: starting backchannel process
Got extra args: {"wallet-type":"askar-anoncreds"}
Listening to backchannel on port 9020
Listening to web_hooks on port 9023
Got DID: CgESJTUURkSU9dXbCUpJuJ
Location of ACA-Py: ./bin/aca-py
2025-01-16 18:26:37: Starting agent sub-process ...
agent starting with params: 
['./bin/aca-py', 'start', '--label', 'aca-py.Acme', '...']
2025-01-16 18:26:38,261 acapy_agent.core.plugin_registry DEBUG Loaded module: acapy_agent.protocols.routing
2025-01-16 18:26:38,262 acapy_agent.core.plugin_registry DEBUG Loaded module: acapy_agent.protocols.did_rotate
...
2025-01-16 18:26:38,409 acapy_agent.core.plugin_registry DEBUG Loaded module: acapy_agent.anoncreds.default.legacy_indy
2025-01-16 18:26:38,411 acapy_agent.core.plugin_registry DEBUG Loaded module: acapy_agent.revocation_anoncreds
>>> 2025-01-16 18:27:38: Error starting agent on admin url http://host.docker.internal:9022

Notice the agent isn't doing anything from 18:26:38 to 18:27:38 ... maybe a threading issue?

ianco · 2025-01-20T18:23:27Z

This is still happening sporadically in the github test runs, however this morning on my local it's happening consistently and I can't run any tests :-(

I found this link: https://thraxil.org/users/anders/posts/2008/03/13/Subprocess-Hanging-PIPE-is-your-enemy/

(and a couple of stackoverflow comments that were similar)

Basically there's a known bug with subprocess.POpen where the process will "hang" if too much data gets stuffed into the stdout or stderr buffer. We have processes reading these two pipes, however there's a possible timing issue where the data is getting stuffed in faster than it's getting read out.

I'll take a look, since this is blocking me right now ...

ianco · 2025-01-20T19:23:25Z

Basically there's a known bug with subprocess.POpen where the process will "hang" if too much data gets stuffed into the stdout or stderr buffer

Using tempfile instead of pipes seems to work, however now I'm losing the output ...

ianco self-assigned this Jan 14, 2025

ianco mentioned this issue Jan 14, 2025

Dump out docker logs for agent that doesn't start #897

Merged

ianco mentioned this issue Jan 15, 2025

Comment out redis references (not used in OATH testing) #898

Merged

ianco mentioned this issue Jan 16, 2025

A few tweaks and some additional logging to debug startup issues #899

Merged

ianco mentioned this issue Jan 23, 2025

Dump agent stdout and stderr directly to file (don't pipe to console) #900

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACA-Py Test Sets Failing to start...usually #896

ACA-Py Test Sets Failing to start...usually #896

swcurran commented Jan 14, 2025

swcurran commented Jan 14, 2025

ianco commented Jan 14, 2025

ianco commented Jan 14, 2025

swcurran commented Jan 14, 2025

swcurran commented Jan 14, 2025

ianco commented Jan 14, 2025

ianco commented Jan 14, 2025

swcurran commented Jan 14, 2025

ianco commented Jan 15, 2025

ianco commented Jan 15, 2025

swcurran commented Jan 15, 2025

jamshale commented Jan 15, 2025 •

edited

Loading

ianco commented Jan 15, 2025

ianco commented Jan 16, 2025

ianco commented Jan 16, 2025

ianco commented Jan 20, 2025

ianco commented Jan 20, 2025

ACA-Py Test Sets Failing to start...usually #896

ACA-Py Test Sets Failing to start...usually #896

Comments

swcurran commented Jan 14, 2025

swcurran commented Jan 14, 2025

ianco commented Jan 14, 2025

ianco commented Jan 14, 2025

swcurran commented Jan 14, 2025

swcurran commented Jan 14, 2025

ianco commented Jan 14, 2025

ianco commented Jan 14, 2025

swcurran commented Jan 14, 2025

ianco commented Jan 15, 2025

ianco commented Jan 15, 2025

swcurran commented Jan 15, 2025

jamshale commented Jan 15, 2025 • edited Loading

ianco commented Jan 15, 2025

ianco commented Jan 16, 2025

ianco commented Jan 16, 2025

ianco commented Jan 20, 2025

ianco commented Jan 20, 2025

jamshale commented Jan 15, 2025 •

edited

Loading