Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACA-Py Test Sets Failing to start...usually #896

Open
swcurran opened this issue Jan 14, 2025 · 17 comments
Open

ACA-Py Test Sets Failing to start...usually #896

swcurran opened this issue Jan 14, 2025 · 17 comments
Assignees

Comments

@swcurran
Copy link
Contributor

Recently (since Dec. 20) the ACA-Py tests are failing, because the agents are not starting. Running the tests locally work fine -- agents run and all ACA-Py tests pass.

When I first saw this, I thought it was the unpinned poetry update that broke ACA-Py briefly, but I see that is not the issue. When I look at the logs from the runs, I see that ngrok gives errors (see attached). However, I checked a successful run from a few days ago (ACA-Py aip10 tests) and the ngrok log looked the same.

docker-ngrok-tails-server-1.log

@swcurran
Copy link
Contributor Author

@ianco -- could you look into this? This is pretty high priority -- not having the tests running at all is bad.

@ianco
Copy link
Contributor

ianco commented Jan 14, 2025

@ianco -- could you look into this? This is pretty high priority -- not having the tests running at all is bad.

on it

@ianco ianco self-assigned this Jan 14, 2025
@ianco
Copy link
Contributor

ianco commented Jan 14, 2025

I think the ngrok error is a red herring. Ngrok changed their model awhile ago to require a "token" - which we don't have - we should probably remove ngrok out of the tails startup ...

@swcurran
Copy link
Contributor Author

Agreed - since it gives the same error on success or failure, it's not like the issue. Looking forward to determining what it is!

@swcurran
Copy link
Contributor Author

Re-running the test now. Started to yesterday, but stopped because of github issues. I'll see if anything pops up in that.

@ianco
Copy link
Contributor

ianco commented Jan 14, 2025

I'll update the manage script to dump out the aca-py logs if the agent fails to start. (I'm not having any luck reproducing the error)

@ianco
Copy link
Contributor

ianco commented Jan 14, 2025

I restarted a couple of the tests (aca-py aip10 and aip20) and they are both working :-S

I'll check tomorrow to see if any of the "auto" runs failed.

@swcurran
Copy link
Contributor Author

I ran aip20 today, and it did run -- w00t. However, a number of the tests failed. I'm re-running locally, but pretty sure I ran them yesterday and all passed. Weird...

@ianco
Copy link
Contributor

ianco commented Jan 15, 2025

I'm seeing this error:

2025-01-15 00:26:11,751 acapy_agent.core.plugin_registry ERROR Module doesn't exist: redis_events.v1_0.redis_queue.events

@ianco
Copy link
Contributor

ianco commented Jan 15, 2025

... not sure why there is a redis error when aca-py is starting up (especially since it seems to be intermittent).

We're not actually using (or running) redis so maybe we should remove redis references from the aca-py backchannel?

@swcurran @jamshale any thoughts?

@swcurran
Copy link
Contributor Author

I think that is fine — it’s not crucial. Are those errors showing up in the failed tests? It’s odd those are failing in the GHA runs, but working locally. I’m trying to understand why redis would be used in some tests in the runset, but not all or none.

@jamshale
Copy link
Contributor

jamshale commented Jan 15, 2025

I don't have a lot of experience with this project but I remember it had something for testing the redis events plugin. I don't think this is the right place for that testing, and the plugin repo is also testing it. Perhaps that whole thing should be removed from here.

@ianco
Copy link
Contributor

ianco commented Jan 15, 2025

The redis updated didn't fix the issue! (Just a "redis" herring.) It looks like the agent is starting but then the backchannel either can't connect or times out. Back to the drawing board :-(

@ianco
Copy link
Contributor

ianco commented Jan 16, 2025

OK I managed to reproduce the issue locally (just by fluke I think). It looks like at some point on startup the aca-py agent just "hangs" and then the startup times out. Might be a threading issue where aca-py "yields" and then never gets control back?

@ianco
Copy link
Contributor

ianco commented Jan 16, 2025

OK the agent is just getting frozen when starting up, here is a log from my local this morning:

Starting aca-py agent ...
2025-01-16 18:26:37: starting backchannel process
Got extra args: {"wallet-type":"askar-anoncreds"}
Listening to backchannel on port 9020
Listening to web_hooks on port 9023
Got DID: CgESJTUURkSU9dXbCUpJuJ
Location of ACA-Py: ./bin/aca-py
2025-01-16 18:26:37: Starting agent sub-process ...
agent starting with params: 
['./bin/aca-py', 'start', '--label', 'aca-py.Acme', '...']
2025-01-16 18:26:38,261 acapy_agent.core.plugin_registry DEBUG Loaded module: acapy_agent.protocols.routing
2025-01-16 18:26:38,262 acapy_agent.core.plugin_registry DEBUG Loaded module: acapy_agent.protocols.did_rotate
...
2025-01-16 18:26:38,409 acapy_agent.core.plugin_registry DEBUG Loaded module: acapy_agent.anoncreds.default.legacy_indy
2025-01-16 18:26:38,411 acapy_agent.core.plugin_registry DEBUG Loaded module: acapy_agent.revocation_anoncreds
>>> 2025-01-16 18:27:38: Error starting agent on admin url http://host.docker.internal:9022

Notice the agent isn't doing anything from 18:26:38 to 18:27:38 ... maybe a threading issue?

@ianco
Copy link
Contributor

ianco commented Jan 20, 2025

This is still happening sporadically in the github test runs, however this morning on my local it's happening consistently and I can't run any tests :-(

I found this link: https://thraxil.org/users/anders/posts/2008/03/13/Subprocess-Hanging-PIPE-is-your-enemy/

(and a couple of stackoverflow comments that were similar)

Basically there's a known bug with subprocess.POpen where the process will "hang" if too much data gets stuffed into the stdout or stderr buffer. We have processes reading these two pipes, however there's a possible timing issue where the data is getting stuffed in faster than it's getting read out.

I'll take a look, since this is blocking me right now ...

@ianco
Copy link
Contributor

ianco commented Jan 20, 2025

Basically there's a known bug with subprocess.POpen where the process will "hang" if too much data gets stuffed into the stdout or stderr buffer

Using tempfile instead of pipes seems to work, however now I'm losing the output ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants