Skip to content

davidjpetersen/course-crawler

Repository files navigation

Course Crawler

AWS Device Farm Desktop Browser automation system for course crawling and video capture.

Overview

Course Crawler is a serverless application built on AWS that accepts JSON-formatted browser commands and executes them on AWS Device Farm Desktop Browser sessions. The system captures video recordings of browser interactions and returns session artifacts for debugging and verification.

Architecture

  • Monorepo Structure: TypeScript workspace with infrastructure and services
  • AWS CDK: Infrastructure as Code for serverless deployment
  • Lambda Functions: Event-driven compute for browser automation
  • Device Farm: Desktop browser sessions for web automation
  • S3: Storage for video artifacts and session data

Prerequisites

  • Node.js 20+
  • npm or pnpm
  • AWS CLI configured
  • AWS CDK CLI

Project Structure

course-crawler/
├── infra/                    # AWS CDK Infrastructure
│   ├── bin/app.ts           # CDK application entry point
│   ├── lib/                 # CDK stack definitions
│   └── package.json         # CDK dependencies
├── services/                # Lambda Functions
│   └── [future services]    # Browser automation services
├── tests/                   # Test suites
│   ├── contract/           # Schema validation tests
│   └── integration/        # End-to-end tests
├── package.json            # Root workspace configuration
├── .nvmrc                  # Node.js version specification
└── README.md               # This file

Development Setup

  1. Install Node.js (version 20+)

    nvm use  # Uses version from .nvmrc
  2. Install Dependencies

    npm install              # Root workspace
    cd infra && npm install  # CDK workspace
  3. Build and Test

    npm run build           # Compile TypeScript
    npm run lint            # Code quality checks
    npm run format          # Code formatting

Next Steps

This repository provides the foundation for AWS Device Farm browser automation. Future development will include:

  1. API Gateway: REST endpoints for browser command submission
  2. Step Functions: Orchestration of browser automation workflows
  3. Lambda Services: Browser command execution and artifact collection
  4. Video Processing: Session recording and storage management

Constitutional Principles

This project follows AWS-First architecture principles:

  • Serverless and event-driven design
  • Managed AWS services over custom implementations
  • Video recording for all browser sessions
  • JSON-based command DSL
  • Least-privilege security policies

License

MIT License - see LICENSE file for details.


High-level architecture (Future Implementation)

• API Gateway (HTTP API) — accepts a JSON “script” and optional browser matrix. • Step Functions — orchestrates the run: get a Device Farm URL, drive the browser via Selenium, collect artifacts, and return links. • Two Lambdas (Node.js 20)

  1. SeleniumRunner: translates your JSON commands → Selenium calls against a signed TestGrid URL, then closes the session.
  2. ArtifactCollector: looks up the session, lists VIDEO artifacts, and copies them to S3 (then returns pre-signed links). • S3 (videos bucket) — durable storage for video artifacts (with lifecycle policy + pre-signed download links). • (Optional) VPC hook-up — if your web app is private, configure TestGrid VPC access for us-west-2 and peer as needed. 

Why this shape: • Device Farm’s TestGrid produces video recordings automatically for desktop browser sessions.  • You create a short-lived Selenium endpoint with CreateTestGridUrl (valid 60–86,400s) and then drive it via RemoteWebDriver.  • After the run, you fetch VIDEO artifacts via ListTestGridSessionArtifacts (videos can be split into parts). 

JSON command format (your mini-DSL)

Accept a payload like:

{ "browsers": ["chrome","firefox"], // optional; default chrome "capabilities": { "screenResolution": "1920x1080" }, "commands": [ {"action":"goto","url":"https://example.com"}, {"action":"type","selector":"#email","text":"[email protected]"}, {"action":"type","selector":"#password","text":"s3cret"}, {"action":"click","selector":"button[type=submit]"}, {"action":"waitFor","selector":".dashboard", "timeoutMs":10000}, {"action":"sleep","ms":1000}, {"action":"screenshot","name":"post-login"} ], "sessionTTLSeconds": 900 // maps to CreateTestGridUrl.expiresInSeconds }

Supported actions (expandable): goto, click, type, waitFor, sleep, screenshot, keys, evaluateJS.

Orchestration flow (Step Functions)

  1. CreateTestGridUrl (SDK integration) for each requested browser (use a Map state for parallelism). 
  2. SeleniumRunner Lambda: • Start RemoteWebDriver with browserName capability + signed URL. • Execute JSON commands. • On completion, .quit() to close and finalize video.
  3. GetTestGridSession: resolve the session ARN/ID (if not captured by runner, list by creation time & ACTIVE/CLOSED). 
  4. ArtifactCollector Lambda: • ListTestGridSessionArtifacts(type="VIDEO") → fetch each artifact URL. • Stream to S3 (videos/{executionId}/{browser}/part-N.mp4). 
  5. Respond (API Gateway integration): pre-signed S3 URLs + metadata (logs if you want).

Note: videos may come in multiple parts; you can return an ordered list or (optionally) concatenate server-side. 

CDK: key resources (TypeScript)

// bin/app.ts new DeviceFarmRunnerStack(app, 'DeviceFarmRunner', { env: { account, region: 'us-west-2' }});

// lib/stack.ts import { Stack, Duration } from 'aws-cdk-lib'; import as sfn from 'aws-cdk-lib/aws-stepfunctions'; import as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks'; import as lambda from 'aws-cdk-lib/aws-lambda'; import as apigw from 'aws-cdk-lib/aws-apigatewayv2'; import as integrations from 'aws-cdk-lib/aws-apigatewayv2-integrations'; import as s3 from 'aws-cdk-lib/aws-s3'; import * as iam from 'aws-cdk-lib/aws-iam';

export class DeviceFarmRunnerStack extends Stack { constructor(scope: Construct, id: string, props?: StackProps) { super(scope, id, props);

const bucket = new s3.Bucket(this, 'Videos', {
  encryption: s3.BucketEncryption.S3_MANAGED,
  lifecycleRules: [{ expiration: Duration.days(30) }],
  blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
});

const seleniumRunner = new lambda.Function(this, 'SeleniumRunner', {
  runtime: lambda.Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: lambda.Code.fromAsset('lambda/selenium-runner'),
  timeout: Duration.minutes(10),
  memorySize: 1536,
  environment: { VIDEOS_BUCKET: bucket.bucketName }
});

const artifactCollector = new lambda.Function(this, 'ArtifactCollector', {
  runtime: lambda.Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: lambda.Code.fromAsset('lambda/artifact-collector'),
  timeout: Duration.minutes(5),
  memorySize: 1024,
  environment: { VIDEOS_BUCKET: bucket.bucketName }
});

// IAM for Device Farm TestGrid + S3
[seleniumRunner, artifactCollector].forEach(fn => {
  fn.addToRolePolicy(new iam.PolicyStatement({
    actions: [
      'devicefarm:CreateTestGridUrl',
      'devicefarm:GetTestGridSession',
      'devicefarm:ListTestGridSessions',
      'devicefarm:ListTestGridSessionArtifacts'
    ],
    resources: ['*'] // tighten to specific project ARN in prod
  }));
  bucket.grantReadWrite(fn);
});

// Step Functions: Map over browsers
const createUrl = new tasks.CallAwsService(this, 'CreateTestGridUrl', {
  service: 'devicefarm',
  action: 'createTestGridUrl',
  parameters: {
    projectArn: sfn.JsonPath.stringAt('$.projectArn'),
    expiresInSeconds: sfn.JsonPath.numberAt('$.sessionTTLSeconds')
  },
  iamResources: ['*']
});

const runSelenium = new tasks.LambdaInvoke(this, 'Run Commands', {
  lambdaFunction: seleniumRunner,
  payload: sfn.TaskInput.fromObject({
    testGridUrl: sfn.JsonPath.stringAt('$.CreateTestGridUrl.url'),
    browserName: sfn.JsonPath.stringAt('$.browser'),
    commands: sfn.JsonPath.stringAt('$.commands'),
    capabilities: sfn.JsonPath.stringAt('$.capabilities')
  }),
  resultPath: '$.runner'
});

const collectArtifacts = new tasks.LambdaInvoke(this, 'Collect Video', {
  lambdaFunction: artifactCollector,
  payload: sfn.TaskInput.fromObject({
    projectArn: sfn.JsonPath.stringAt('$.projectArn'),
    sessionId: sfn.JsonPath.stringAt('$.runner.sessionId'),
    browser: sfn.JsonPath.stringAt('$.browser')
  }),
  resultPath: '$.artifacts'
});

const chain = createUrl.next(runSelenium).next(collectArtifacts);
const map = new sfn.Map(this, 'Per Browser', {
  itemsPath: sfn.JsonPath.stringAt('$.browsers')
}).iterator(chain);

const stateMachine = new sfn.StateMachine(this, 'RunnerSm', {
  definition: map,
  timeout: Duration.minutes(30)
});

const api = new apigw.HttpApi(this, 'RunnerApi', {
  defaultIntegration: new integrations.HttpLambdaIntegration('StartExec',
    new lambda.Function(this, 'ApiHandler', {
      runtime: lambda.Runtime.NODEJS_20_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda/api'),
      environment: { SM_ARN: stateMachine.stateMachineArn, PROJECT_ARN: 'arn:aws:devicefarm:us-west-2:123456789012:testgrid-project:...' }
    }))
});

stateMachine.grantStartExecution(api.defaultStage!.node.tryFindChild('DefaultStage') as any);

} }

SeleniumRunner Lambda (Node.js outline)

// lambda/selenium-runner/index.js const { Builder, By, until, Key } = require('selenium-webdriver'); const { DeviceFarmClient, GetTestGridSessionCommand } = require('@aws-sdk/client-device-farm');

exports.handler = async (event) => { const { testGridUrl, browserName='chrome', commands=[], capabilities={} } = event;

const driver = await new Builder() .usingServer(testGridUrl) .withCapabilities({ browserName, ...capabilities }) .build();

try { for (const cmd of commands) { switch (cmd.action) { case 'goto': await driver.get(cmd.url); break; case 'click': await driver.findElement(By.css(cmd.selector)).click(); break; case 'type': await driver.findElement(By.css(cmd.selector)).sendKeys(cmd.text); break; case 'waitFor': await driver.wait(until.elementLocated(By.css(cmd.selector)), cmd.timeoutMs||10000); break; case 'keys': await driver.actions().sendKeys(cmd.sequence.map(k => Key[k]||k)).perform(); break; case 'sleep': await driver.sleep(cmd.ms); break; case 'screenshot': await driver.takeScreenshot(); break; } } } finally { // Hint: Session ID is available on the driver const session = await driver.getSession(); const sessionId = session.getId(); // use alongside projectArn for GetTestGridSession await driver.quit(); return { sessionId, ok: true }; } };

(Using TestGrid’s signed URL with RemoteWebDriver is the intended path. )

ArtifactCollector Lambda (Node.js outline)

// lambda/artifact-collector/index.js const fetch = require('node-fetch'); const { S3Client, PutObjectCommand, GetObjectCommand } = require('@aws-sdk/client-s3'); const { DeviceFarmClient, GetTestGridSessionCommand, ListTestGridSessionArtifactsCommand } = require('@aws-sdk/client-device-farm');

const s3 = new S3Client(); const df = new DeviceFarmClient();

exports.handler = async ({ projectArn, sessionId, browser }) => { // Resolve ARN from projectArn + sessionId const { testGridSession } = await df.send(new GetTestGridSessionCommand({ projectArn, sessionId })); const sessionArn = testGridSession.arn;

// List artifacts of type VIDEO const out = await df.send(new ListTestGridSessionArtifactsCommand({ sessionArn, type: 'VIDEO' })); const uploaded = []; let part = 0; for (const a of out.artifacts || []) { const res = await fetch(a.url); const body = Buffer.from(await res.arrayBuffer()); const key = videos/${sessionId}/${browser}/part-${++part}.mp4; await s3.send(new PutObjectCommand({ Bucket: process.env.VIDEOS_BUCKET, Key: key, Body: body, ContentType: 'video/mp4' })); uploaded.push({ s3Key: key }); }

// Return pre-signed URLs in API layer (or sign here if you prefer) return { sessionArn, parts: uploaded }; };

(Listing VIDEO artifacts and downloading them is the supported flow; video may be split across multiple artifacts. )

API request/response

Request (POST /run): the JSON shown above. Response: array per browser with pre-signed S3 URLs (and optional logs).

{ "runId": "abc123", "results": [ { "browser": "chrome", "sessionArn": "arn:aws:devicefarm:...:testgrid-session:...", "video": [ "https://s3.amazonaws.com/your-bucket/videos/abc123/chrome/part-1.mp4?X-Amz-Expires=600", "https://s3.amazonaws.com/your-bucket/videos/abc123/chrome/part-2.mp4?..." ] } ] }

IAM & security essentials • Allow Lambdas: devicefarm:CreateTestGridUrl, devicefarm:GetTestGridSession, devicefarm:ListTestGridSessions, devicefarm:ListTestGridSessionArtifacts; S3 PutObject/GetObject.  • Scope permissions to your project ARN (least privilege). • If you must hit private endpoints, configure TestGrid VPC (only in us-west-2) and peer to other regions as needed. 

Operational notes & limits • URL TTL: 60–86,400 seconds; set sessionTTLSeconds accordingly.  • Video availability: recording runs start-to-end of session; closing the driver finalizes artifacts.  • Finding the session: use GetTestGridSession(projectArn, sessionId); if needed, ListTestGridSessions with creation time & status filters.  • Costs: billed per-minute of browser time (parallel Map = faster but more minutes). See Device Farm pricing. 

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •