-
-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make browsertrix-crawler runnable in serverless environments #448
Comments
Thanks for flagging this!
Hm, I believe that these errors are from the browser itself, not necessarily Puppeteer. From some quick looking around, it looks like Chromium/Chrome/Brave may need to be built in a slightly different way to be able to run on AWS Lambda. We could probably accomplish this by having a separate browser base for Lambda, or perhaps the changes necessary could just be folded into the main release. |
Thanks, it makes sense it's chrome accessing those dirs. In that case, a separate base would be the ideal scenario. Is it possible that all the changes needed can be accommodate by chrome flags that we could already configure with This (2yo) medium post points to a set of flags needed for chrome to run in lambda:
trying to gather whether it's worth testing that or if it has no future. |
It is possible! Tbh I'd have to dig deeper into it myself to say either way. It's also worth noting that current releases of the crawler are built on Brave Browser (see #189 for rationale), though it's still possible to build the crawler on Chrome/Chromium via the older debs in the https://github.com/webrecorder/browsertrix-browser-base repo. If you're willing to put some time into investigating this I'd be happy to help/review a PR! |
Hello @msramalho, have you been able to run in Lambda ? I'm considering a similar setup |
Hey @kema-dev, no updates from my side but still eager to see how this progresses. Several changes have been made to the project since and I wonder if any (changes to the browser base) can make this issue easier to solve. |
Hey, I tried a bit and didn't achieve a reasonable result. I switched to use ECS + Fargate + EFS, got not problem with this method |
Cool! Care to share any configurations or tips for replication? |
Sure !
|
@kema-dev Thanks for sharing this! If there's a format that would make the most specify to specify this in (Terraform? Ansible playbook) or just as docs, happy to integrate this into the repo and/or our docs! |
I personally use Pulumi, but it uses TF providers as backends anyway. Thoses resources are just AWS services that need to be provisioned, using Console, Ansible, TF, or Pulumi goes the same way. I'm designing a complete solution with Event Bridge as scheduler and the ECS stuff I described above. Anyway, the core of the solution resides in my precedent message ! |
Hi all,
I've been experimenting with making an AWS lambda function for browsertrix-crawler and I've gone some distance but hit a snag that the maintainers are probably better equipped to help with.
The problem is: AWS lambda function environment (I'm guessing other serverless options are similar) runs in a controlled environment where the only write permission to the
/tmp
directory and no other. For browsertrix-crawler outputs the--cwd
option should solve it but it's still trying to write to.local
(maybe that's playwright/redis or some other dependency?).So the current issue error I get is:
and this is the version info
I've put the
Dockerfile
andlambda_function.py
in this gist you can use it if you want to replicate the issue.For reference, I'm following these instructions: https://docs.aws.amazon.com/lambda/latest/dg/python-image.html
And I'm using the API gateway to make testing quick:
The text was updated successfully, but these errors were encountered: