Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make browsertrix-crawler runnable in serverless environments #448

Open
msramalho opened this issue Dec 11, 2023 · 10 comments
Open

make browsertrix-crawler runnable in serverless environments #448

msramalho opened this issue Dec 11, 2023 · 10 comments

Comments

@msramalho
Copy link

msramalho commented Dec 11, 2023

Hi all,

I've been experimenting with making an AWS lambda function for browsertrix-crawler and I've gone some distance but hit a snag that the maintainers are probably better equipped to help with.

The problem is: AWS lambda function environment (I'm guessing other serverless options are similar) runs in a controlled environment where the only write permission to the /tmp directory and no other. For browsertrix-crawler outputs the --cwd option should solve it but it's still trying to write to .local (maybe that's playwright/redis or some other dependency?).

So the current issue error I get is:

mkdir: cannot create directory ‘/.local’: Read-only file system
touch: cannot touch '/.local/share/applications/mimeapps.list': No such file or directory
/usr/bin/google-chrome: line 45: /dev/fd/63: No such file or directory
/usr/bin/google-chrome: line 46: /dev/fd/63: No such file or directory
{
    "logLevel": "warn",
    "context": "redis",
    "message": "ioredis error",
    "details": {
        "error": "[ioredis] Unhandled error event:"
    }
}
{
    "logLevel": "warn",
    "context": "state",
    "message": "Waiting for redis at redis://localhost:6379/0",
    "details": {}
}
{
    "logLevel": "error",
    "context": "general",
    "message": "Crawl failed",
    "details": {
        "type": "exception",
        "message": "Timed out after 30000 ms while waiting for the WS endpoint URL to appear in stdout!",
        "stack": "TimeoutError: Timed out after 30000 ms while waiting for the WS endpoint URL to appear in stdout!\n    at ChromeLauncher.launch (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/node/ProductLauncher.js:123:23)\n    at async Browser._init (file:///app/util/browser.js:236:20)\n    at async Browser.launch (file:///app/util/browser.js:61:5)\n    at async Crawler.crawl (file:///app/crawler.js:821:5)\n    at async Crawler.run (file:///app/crawler.js:311:7)"
    }
}

and this is the version info

{
    "logLevel": "info",
    "context": "general",
    "message": "Browsertrix-Crawler 0.11.2 (with warcio.js 1.6.2 pywb 2.7.4)",
    "details": {}
}

I've put the Dockerfile and lambda_function.py in this gist you can use it if you want to replicate the issue.

For reference, I'm following these instructions: https://docs.aws.amazon.com/lambda/latest/dg/python-image.html
And I'm using the API gateway to make testing quick:
image

@tw4l
Copy link
Member

tw4l commented Dec 11, 2023

Thanks for flagging this!

mkdir: cannot create directory ‘/.local’: Read-only file system
touch: cannot touch '/.local/share/applications/mimeapps.list': No such file or directory
/usr/bin/google-chrome: line 45: /dev/fd/63: No such file or directory
/usr/bin/google-chrome: line 46: /dev/fd/63: No such file or directory

Hm, I believe that these errors are from the browser itself, not necessarily Puppeteer. From some quick looking around, it looks like Chromium/Chrome/Brave may need to be built in a slightly different way to be able to run on AWS Lambda. We could probably accomplish this by having a separate browser base for Lambda, or perhaps the changes necessary could just be folded into the main release.

@msramalho
Copy link
Author

Thanks, it makes sense it's chrome accessing those dirs.

In that case, a separate base would be the ideal scenario.

Is it possible that all the changes needed can be accommodate by chrome flags that we could already configure with CHROME_FLAGS as described in the README?

This (2yo) medium post points to a set of flags needed for chrome to run in lambda:

const chromeFlags = ['--no-xshm','--disable-dev-shm-usage','--single-process',
'--no-sandbox','--no-first-run',`--load-extension=${extensionDir}`]

// and then actually just 
'--no-first-run'

trying to gather whether it's worth testing that or if it has no future.

@tw4l
Copy link
Member

tw4l commented Dec 13, 2023

Is it possible that all the changes needed can be accommodate by chrome flags that we could already configure with CHROME_FLAGS as described in the README?

It is possible! Tbh I'd have to dig deeper into it myself to say either way. It's also worth noting that current releases of the crawler are built on Brave Browser (see #189 for rationale), though it's still possible to build the crawler on Chrome/Chromium via the older debs in the https://github.com/webrecorder/browsertrix-browser-base repo.

If you're willing to put some time into investigating this I'd be happy to help/review a PR!

@kema-dev
Copy link

kema-dev commented Apr 8, 2024

Hello @msramalho, have you been able to run in Lambda ? I'm considering a similar setup

@msramalho
Copy link
Author

Hey @kema-dev, no updates from my side but still eager to see how this progresses. Several changes have been made to the project since and I wonder if any (changes to the browser base) can make this issue easier to solve.

@kema-dev
Copy link

Hey, I tried a bit and didn't achieve a reasonable result. I switched to use ECS + Fargate + EFS, got not problem with this method

@msramalho
Copy link
Author

Cool! Care to share any configurations or tips for replication?

@kema-dev
Copy link

Sure !

  • ECS Cluster
  • Fargate Capacity Provider
  • VPC
    • 3 Public Subnets
    • 3 Private Subnets
  • EFS
    • Volume for profiles
    • Mount targets in each private subnet of VPC
  • Security Groups
    • Browsertrix Profile creation
      • Ingress
        • tcp/6080
        • tcp/9223
      • Egress
        • tcp/all
    • Browsertrix Crawling
      • Egress
        • tcp/all
    • EFS
      • Profile creation
        • Ingress
          • tcp/2049 from Browsertrix Profile creation SG
        • Egress
          • tcp/all
      • Crawling
        • Ingress
          • tcp/2049 from Browsertrix Crawling SG
        • Egress
          • tcp/all
  • S3 bucket for crawling results
  • IAM
    • Browsertrix crawling role
      • S3 bucket access
        • "Statement": [
            {
              "Effect": "Allow",
              "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
              "Resource": "<s3 bucket arn>/*",
            },
            {
              "Effect": "Allow",
              "Action": ["s3:ListBucket", "s3:GetBucketLocation"],
              "Resource": "<s3 bucket arn>",
            },
          ],
    • IAM User access to the Role (as I did not achieve to make it work with the ECS Task Role to be assumed by the ECS Task)
    • Access key for the user
  • Secrets Manager for Access Key Id and Access Key Secret
  • ECS Task role (actually 2 roles, one for Browsertrix Profile creation and one for Browsertrix Crawling)
    • Condition
      "Statement": [
          {
            "Effect": "Allow",
            "Principal": {
              "Service": ["ecs-tasks.amazonaws.com"],
            },
            "Action": "sts:AssumeRole",
            "Condition": {
              "ArnLike": {
                "aws:SourceArn": "arn:aws:ecs:<awsRegion>:<awsAccountId>:*",
              },
              "StringEquals": {
                "aws:SourceAccount": awsAccountId,
              },
            },
          },
        ],
    • Policy (for Browsertrix Crawling, remove Secrets Manager access for Browsertrix Profile creation Role)
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "ecr:GetAuthorizationToken",
            "ecr:BatchCheckLayerAvailability",
            "ecr:GetDownloadUrlForLayer",
            "ecr:BatchGetImage",
            "logs:CreateLogStream",
            "logs:PutLogEvents",
            "logs:CreateLogGroup",
          ],
          "Resource": "*", // Needs further restriction, suitable for development only
        },
        {
          "Effect": "Allow",
          "Action": ["secretsmanager:GetSecretValue"],
          "Resource": <keyId>,
        },
        {
          "Effect": "Allow",
          "Action": ["secretsmanager:GetSecretValue"],
          "Resource": <keySecret>,
        },
      ],
  • ECS Task Role (for ECS Exec)
    • Condition
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": ["ecs-tasks.amazonaws.com"],
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "ArnLike": {
              "aws:SourceArn": "arn:aws:ecs:<awsRegion>:<awsAccountId>:*",
            },
            "StringEquals": {
              "aws:SourceAccount": awsAccountId,
            },
          },
        },
      ],
    • Policy (for ECS Exec)
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "ssmmessages:CreateControlChannel",
            "ssmmessages:CreateDataChannel",
            "ssmmessages:OpenControlChannel",
            "ssmmessages:OpenDataChannel",
          ],
          "Resource": "*",
        },
      ],
  • ECS Task
    • "container": {
        "name": "<as you wish>",
        "memory": 2048,
        "cpu": 1024,
        "entryPoint": ["<as you wish>"],
        "command": [
          "<as>",
          "<you>",
          "<wish>",
        ],
        "environment": [
          {
            "name": "STORE_ENDPOINT_URL",
            "value": "<s3 url>"
          },
          {
            "name": "STORE_FILENAME",
            "value": "<as you wish>",
          },
          {
            "name": "STORE_PATH",
            "value": "<as you wish>",
          },
        ],
        "secrets": [
          {
            "name": "STORE_ACCESS_KEY",
            "valueFrom": "<iam access key arn>",
          },
          {
            "name": "STORE_SECRET_KEY",
            "valueFrom": "<iam secret key arn>",
          },
        ],
        "logConfiguration": {
          "logDriver": "awslogs",
          "options": {
            "awslogs-create-group"":" "true",
            "awslogs-group"":" "<as you wish>",
            "awslogs-region"":" "<awsRegion>",
            "awslogs-stream-prefix"":" "ecs",
          },
        },
        "mountPoints": [
          {
            "containerPath": "/crawls/profiles",
            "sourceVolume": "<EFS volume name>",
            "readOnly": false,
          },
        ],
      },
      "volumes": [
        {
          "name": "<EFS volume name>",
          "efsVolumeConfiguration": {
            "fileSystemId": "<EFS file system id>",
            "transitEncryption": "ENABLED",
          },
        },
      ],
      "runtimePlatform": {
        "operatingSystemFamily": "LINUX",
        "cpuArchitecture": "ARM64", // FinOps
      },
      "skipDestroy": false,
      "executionRole": {
        "roleArn": "<ecsTaskRoleArn>",
      },
      "taskRole": {
        "roleArn": "<ecsTaskRoleArn>",
      },
      "logGroup": {
        "args": {
          "name": "<as you wish>",
          "retentionInDays": <as you wish>,
          "tags": {
            <as you wish>
          },
        },
      },
      "tags": {
        <as you wish>
      },

@ikreymer
Copy link
Member

@kema-dev Thanks for sharing this! If there's a format that would make the most specify to specify this in (Terraform? Ansible playbook) or just as docs, happy to integrate this into the repo and/or our docs!

@kema-dev
Copy link

I personally use Pulumi, but it uses TF providers as backends anyway. Thoses resources are just AWS services that need to be provisioned, using Console, Ansible, TF, or Pulumi goes the same way.

I'm designing a complete solution with Event Bridge as scheduler and the ECS stuff I described above. Anyway, the core of the solution resides in my precedent message !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

4 participants