Course Crawler

AWS Device Farm Desktop Browser automation system for course crawling and video capture.

Overview

Course Crawler is a serverless application built on AWS that accepts JSON-formatted browser commands and executes them on AWS Device Farm Desktop Browser sessions. The system captures video recordings of browser interactions and returns session artifacts for debugging and verification.

Architecture

Monorepo Structure: TypeScript workspace with infrastructure and services
AWS CDK: Infrastructure as Code for serverless deployment
Lambda Functions: Event-driven compute for browser automation
Device Farm: Desktop browser sessions for web automation
S3: Storage for video artifacts and session data

Prerequisites

Node.js 20+
npm or pnpm
AWS CLI configured
AWS CDK CLI

Project Structure

course-crawler/
├── infra/                    # AWS CDK Infrastructure
│   ├── bin/app.ts           # CDK application entry point
│   ├── lib/                 # CDK stack definitions
│   └── package.json         # CDK dependencies
├── services/                # Lambda Functions
│   └── [future services]    # Browser automation services
├── tests/                   # Test suites
│   ├── contract/           # Schema validation tests
│   └── integration/        # End-to-end tests
├── package.json            # Root workspace configuration
├── .nvmrc                  # Node.js version specification
└── README.md               # This file

Development Setup

Install Node.js (version 20+)
```
nvm use  # Uses version from .nvmrc
```

Install Dependencies

npm install              # Root workspace
cd infra && npm install  # CDK workspace

Build and Test

npm run build           # Compile TypeScript
npm run lint            # Code quality checks
npm run format          # Code formatting

Next Steps

This repository provides the foundation for AWS Device Farm browser automation. Future development will include:

API Gateway: REST endpoints for browser command submission
Step Functions: Orchestration of browser automation workflows
Lambda Services: Browser command execution and artifact collection
Video Processing: Session recording and storage management

Constitutional Principles

This project follows AWS-First architecture principles:

Serverless and event-driven design
Managed AWS services over custom implementations
Video recording for all browser sessions
JSON-based command DSL
Least-privilege security policies

License

MIT License - see LICENSE file for details.

High-level architecture (Future Implementation)

• API Gateway (HTTP API) — accepts a JSON “script” and optional browser matrix. • Step Functions — orchestrates the run: get a Device Farm URL, drive the browser via Selenium, collect artifacts, and return links. • Two Lambdas (Node.js 20)

SeleniumRunner: translates your JSON commands → Selenium calls against a signed TestGrid URL, then closes the session.
ArtifactCollector: looks up the session, lists VIDEO artifacts, and copies them to S3 (then returns pre-signed links). • S3 (videos bucket) — durable storage for video artifacts (with lifecycle policy + pre-signed download links). • (Optional) VPC hook-up — if your web app is private, configure TestGrid VPC access for us-west-2 and peer as needed.

Why this shape: • Device Farm’s TestGrid produces video recordings automatically for desktop browser sessions. • You create a short-lived Selenium endpoint with CreateTestGridUrl (valid 60–86,400s) and then drive it via RemoteWebDriver. • After the run, you fetch VIDEO artifacts via ListTestGridSessionArtifacts (videos can be split into parts).

⸻

JSON command format (your mini-DSL)

Accept a payload like:

{ "browsers": ["chrome","firefox"], // optional; default chrome "capabilities": { "screenResolution": "1920x1080" }, "commands": [ {"action":"goto","url":"https://example.com"}, {"action":"type","selector":"#email","text":"[email protected]"}, {"action":"type","selector":"#password","text":"s3cret"}, {"action":"click","selector":"button[type=submit]"}, {"action":"waitFor","selector":".dashboard", "timeoutMs":10000}, {"action":"sleep","ms":1000}, {"action":"screenshot","name":"post-login"} ], "sessionTTLSeconds": 900 // maps to CreateTestGridUrl.expiresInSeconds }

Supported actions (expandable): goto, click, type, waitFor, sleep, screenshot, keys, evaluateJS.

⸻

Orchestration flow (Step Functions)

CreateTestGridUrl (SDK integration) for each requested browser (use a Map state for parallelism).
SeleniumRunner Lambda: • Start RemoteWebDriver with browserName capability + signed URL. • Execute JSON commands. • On completion, .quit() to close and finalize video.
GetTestGridSession: resolve the session ARN/ID (if not captured by runner, list by creation time & ACTIVE/CLOSED).
ArtifactCollector Lambda: • ListTestGridSessionArtifacts(type="VIDEO") → fetch each artifact URL. • Stream to S3 (videos/{executionId}/{browser}/part-N.mp4).
Respond (API Gateway integration): pre-signed S3 URLs + metadata (logs if you want).

Note: videos may come in multiple parts; you can return an ordered list or (optionally) concatenate server-side.

⸻

CDK: key resources (TypeScript)

// bin/app.ts new DeviceFarmRunnerStack(app, 'DeviceFarmRunner', { env: { account, region: 'us-west-2' }});

// lib/stack.ts import { Stack, Duration } from 'aws-cdk-lib'; import as sfn from 'aws-cdk-lib/aws-stepfunctions'; import as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks'; import as lambda from 'aws-cdk-lib/aws-lambda'; import as apigw from 'aws-cdk-lib/aws-apigatewayv2'; import as integrations from 'aws-cdk-lib/aws-apigatewayv2-integrations'; import as s3 from 'aws-cdk-lib/aws-s3'; import * as iam from 'aws-cdk-lib/aws-iam';

export class DeviceFarmRunnerStack extends Stack { constructor(scope: Construct, id: string, props?: StackProps) { super(scope, id, props);

const bucket = new s3.Bucket(this, 'Videos', {
  encryption: s3.BucketEncryption.S3_MANAGED,
  lifecycleRules: [{ expiration: Duration.days(30) }],
  blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
});

const seleniumRunner = new lambda.Function(this, 'SeleniumRunner', {
  runtime: lambda.Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: lambda.Code.fromAsset('lambda/selenium-runner'),
  timeout: Duration.minutes(10),
  memorySize: 1536,
  environment: { VIDEOS_BUCKET: bucket.bucketName }
});

const artifactCollector = new lambda.Function(this, 'ArtifactCollector', {
  runtime: lambda.Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: lambda.Code.fromAsset('lambda/artifact-collector'),
  timeout: Duration.minutes(5),
  memorySize: 1024,
  environment: { VIDEOS_BUCKET: bucket.bucketName }
});

// IAM for Device Farm TestGrid + S3
[seleniumRunner, artifactCollector].forEach(fn => {
  fn.addToRolePolicy(new iam.PolicyStatement({
    actions: [
      'devicefarm:CreateTestGridUrl',
      'devicefarm:GetTestGridSession',
      'devicefarm:ListTestGridSessions',
      'devicefarm:ListTestGridSessionArtifacts'
    ],
    resources: ['*'] // tighten to specific project ARN in prod
  }));
  bucket.grantReadWrite(fn);
});

// Step Functions: Map over browsers
const createUrl = new tasks.CallAwsService(this, 'CreateTestGridUrl', {
  service: 'devicefarm',
  action: 'createTestGridUrl',
  parameters: {
    projectArn: sfn.JsonPath.stringAt('$.projectArn'),
    expiresInSeconds: sfn.JsonPath.numberAt('$.sessionTTLSeconds')
  },
  iamResources: ['*']
});

const runSelenium = new tasks.LambdaInvoke(this, 'Run Commands', {
  lambdaFunction: seleniumRunner,
  payload: sfn.TaskInput.fromObject({
    testGridUrl: sfn.JsonPath.stringAt('$.CreateTestGridUrl.url'),
    browserName: sfn.JsonPath.stringAt('$.browser'),
    commands: sfn.JsonPath.stringAt('$.commands'),
    capabilities: sfn.JsonPath.stringAt('$.capabilities')
  }),
  resultPath: '$.runner'
});

const collectArtifacts = new tasks.LambdaInvoke(this, 'Collect Video', {
  lambdaFunction: artifactCollector,
  payload: sfn.TaskInput.fromObject({
    projectArn: sfn.JsonPath.stringAt('$.projectArn'),
    sessionId: sfn.JsonPath.stringAt('$.runner.sessionId'),
    browser: sfn.JsonPath.stringAt('$.browser')
  }),
  resultPath: '$.artifacts'
});

const chain = createUrl.next(runSelenium).next(collectArtifacts);
const map = new sfn.Map(this, 'Per Browser', {
  itemsPath: sfn.JsonPath.stringAt('$.browsers')
}).iterator(chain);

const stateMachine = new sfn.StateMachine(this, 'RunnerSm', {
  definition: map,
  timeout: Duration.minutes(30)
});

const api = new apigw.HttpApi(this, 'RunnerApi', {
  defaultIntegration: new integrations.HttpLambdaIntegration('StartExec',
    new lambda.Function(this, 'ApiHandler', {
      runtime: lambda.Runtime.NODEJS_20_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda/api'),
      environment: { SM_ARN: stateMachine.stateMachineArn, PROJECT_ARN: 'arn:aws:devicefarm:us-west-2:123456789012:testgrid-project:...' }
    }))
});

stateMachine.grantStartExecution(api.defaultStage!.node.tryFindChild('DefaultStage') as any);

} }

⸻

SeleniumRunner Lambda (Node.js outline)

// lambda/selenium-runner/index.js const { Builder, By, until, Key } = require('selenium-webdriver'); const { DeviceFarmClient, GetTestGridSessionCommand } = require('@aws-sdk/client-device-farm');

exports.handler = async (event) => { const { testGridUrl, browserName='chrome', commands=[], capabilities={} } = event;

const driver = await new Builder() .usingServer(testGridUrl) .withCapabilities({ browserName, ...capabilities }) .build();

try { for (const cmd of commands) { switch (cmd.action) { case 'goto': await driver.get(cmd.url); break; case 'click': await driver.findElement(By.css(cmd.selector)).click(); break; case 'type': await driver.findElement(By.css(cmd.selector)).sendKeys(cmd.text); break; case 'waitFor': await driver.wait(until.elementLocated(By.css(cmd.selector)), cmd.timeoutMs||10000); break; case 'keys': await driver.actions().sendKeys(cmd.sequence.map(k => Key[k]||k)).perform(); break; case 'sleep': await driver.sleep(cmd.ms); break; case 'screenshot': await driver.takeScreenshot(); break; } } } finally { // Hint: Session ID is available on the driver const session = await driver.getSession(); const sessionId = session.getId(); // use alongside projectArn for GetTestGridSession await driver.quit(); return { sessionId, ok: true }; } };

(Using TestGrid’s signed URL with RemoteWebDriver is the intended path. )

⸻

ArtifactCollector Lambda (Node.js outline)

// lambda/artifact-collector/index.js const fetch = require('node-fetch'); const { S3Client, PutObjectCommand, GetObjectCommand } = require('@aws-sdk/client-s3'); const { DeviceFarmClient, GetTestGridSessionCommand, ListTestGridSessionArtifactsCommand } = require('@aws-sdk/client-device-farm');

const s3 = new S3Client(); const df = new DeviceFarmClient();

exports.handler = async ({ projectArn, sessionId, browser }) => { // Resolve ARN from projectArn + sessionId const { testGridSession } = await df.send(new GetTestGridSessionCommand({ projectArn, sessionId })); const sessionArn = testGridSession.arn;

// List artifacts of type VIDEO const out = await df.send(new ListTestGridSessionArtifactsCommand({ sessionArn, type: 'VIDEO' })); const uploaded = []; let part = 0; for (const a of out.artifacts || []) { const res = await fetch(a.url); const body = Buffer.from(await res.arrayBuffer()); const key = videos/${sessionId}/${browser}/part-${++part}.mp4; await s3.send(new PutObjectCommand({ Bucket: process.env.VIDEOS_BUCKET, Key: key, Body: body, ContentType: 'video/mp4' })); uploaded.push({ s3Key: key }); }

// Return pre-signed URLs in API layer (or sign here if you prefer) return { sessionArn, parts: uploaded }; };

(Listing VIDEO artifacts and downloading them is the supported flow; video may be split across multiple artifacts. )

⸻

API request/response

Request (POST /run): the JSON shown above. Response: array per browser with pre-signed S3 URLs (and optional logs).

{ "runId": "abc123", "results": [ { "browser": "chrome", "sessionArn": "arn:aws:devicefarm:...:testgrid-session:...", "video": [ "https://s3.amazonaws.com/your-bucket/videos/abc123/chrome/part-1.mp4?X-Amz-Expires=600", "https://s3.amazonaws.com/your-bucket/videos/abc123/chrome/part-2.mp4?..." ] } ] }

⸻

IAM & security essentials • Allow Lambdas: devicefarm:CreateTestGridUrl, devicefarm:GetTestGridSession, devicefarm:ListTestGridSessions, devicefarm:ListTestGridSessionArtifacts; S3 PutObject/GetObject. • Scope permissions to your project ARN (least privilege). • If you must hit private endpoints, configure TestGrid VPC (only in us-west-2) and peer to other regions as needed.

⸻

Operational notes & limits • URL TTL: 60–86,400 seconds; set sessionTTLSeconds accordingly. • Video availability: recording runs start-to-end of session; closing the driver finalizes artifacts. • Finding the session: use GetTestGridSession(projectArn, sessionId); if needed, ListTestGridSessions with creation time & status filters. • Costs: billed per-minute of browser time (parallel Map = faster but more minutes). See Device Farm pricing.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.claude/commands		.claude/commands
.specify		.specify
infra		infra
services		services
specs		specs
tests		tests
.editorconfig		.editorconfig
.eslintrc.cjs		.eslintrc.cjs
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierrc		.prettierrc
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Course Crawler

Overview

Architecture

Prerequisites

Project Structure

Development Setup

Next Steps

Constitutional Principles

License

High-level architecture (Future Implementation)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

davidjpetersen/course-crawler

Folders and files

Latest commit

History

Repository files navigation

Course Crawler

Overview

Architecture

Prerequisites

Project Structure

Development Setup

Next Steps

Constitutional Principles

License

High-level architecture (Future Implementation)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages