diff --git a/README.md b/README.md index 5540e56..51a27ee 100644 --- a/README.md +++ b/README.md @@ -1,69 +1,65 @@ -# Tulsa Transcribe +# `tgov-scraper-js` -A system for scraping, processing, and serving Tulsa Government meeting videos and documents. +Scrape and ingest recordings and documents from meetings of the City of Tulsa's municipal Agencies, Boards, and Commissions (ABCs). ## Architecture -This application is structured as a set of microservices, each with its own responsibility: +This application is structured as a set of microservices, each with its own responsibility (For more details, see the [architecture documentation](./docs/architecture.md)): ### 1. TGov Service -- Scrapes Tulsa Government meeting information -- Stores committee and meeting data -- Extracts video URLs from viewer pages ### 2. Media Service -- Downloads and processes videos -- Extracts audio from videos -- Manages batch processing of videos ### 3. Documents Service -- Handles document storage and retrieval -- Links documents to meeting records ### 4. Transcription Service -- Converts audio files to text using the OpenAI Whisper API -- Stores and retrieves transcriptions with time-aligned segments -- Manages transcription jobs - -For more details, see the [architecture documentation](./docs/architecture.md). ## Getting Started -### Prerequisites - -- Node.js LTS and npm -- [Encore CLI](https://encore.dev/docs/install) -- ffmpeg (for video processing) -- OpenAI API key (for transcription) - ### Setup 1. Clone the repository: + ```bash -git clone -cd tulsa-transcribe +git clone https://github.com/codefortulsa/tgov-scraper-js.git +cd tgov-scraper-js ``` -2. Install dependencies: +2. Install `node` v22 and `npm` v11 using your favorite version manager. If you don't have one, we recommend [nvm](https://github.com/nvm-sh/nvm#installing-and-updating): + ```bash -npm install +nvm install 22 +nvm use 22 +nvm install-latest-npm ``` -3. Run the setup script to configure your environment: +3. [Install Docker Desktop](https://docs.docker.com/get-docker/) + +4. [Install `ffmpeg`](https://ffmpeg.org/download.html) + +5. [Install the Encore CLI](https://encore.dev/docs/ts/install#install-the-encore-cli) + +6. Install NPM dependencies: + ```bash -npx ts-node setup.ts +npm install ``` -4. Update the `.env` file with your database credentials and API keys: +7. Copy the example [local secret overrides file](https://encore.dev/docs/ts/primitives/secrets#overriding-local-secrets): + +```bash +cp .secrets.local.cue.EXAMPLE .secrets.local.cue ``` -TGOV_DATABASE_URL="postgresql://username:password@localhost:5432/tgov?sslmode=disable" -MEDIA_DATABASE_URL="postgresql://username:password@localhost:5432/media?sslmode=disable" -DOCUMENTS_DATABASE_URL="postgresql://username:password@localhost:5432/documents?sslmode=disable" -TRANSCRIPTION_DATABASE_URL="postgresql://username:password@localhost:5432/transcription?sslmode=disable" -OPENAI_API_KEY="your-openai-api-key" + +. Set your local secrets: + +```sh +# path: ./.secrets.local.cue +OPENAI_API_KEY: "" ``` -5. Run the application using Encore CLI: +9. Run the application using Encore CLI: + ```bash encore run ``` @@ -72,43 +68,43 @@ encore run ### TGov Service -| Endpoint | Method | Description | -|----------|--------|-------------| -| `/scrape/tgov` | GET | Trigger a scrape of the TGov website | -| `/tgov/meetings` | GET | List meetings with filtering options | -| `/tgov/committees` | GET | List all committees | -| `/tgov/extract-video-url` | POST | Extract a video URL from a viewer page | +| Endpoint | Method | Description | +| ------------------------- | ------ | -------------------------------------- | +| `/scrape/tgov` | GET | Trigger a scrape of the TGov website | +| `/tgov/meetings` | GET | List meetings with filtering options | +| `/tgov/committees` | GET | List all committees | +| `/tgov/extract-video-url` | POST | Extract a video URL from a viewer page | ### Media Service -| Endpoint | Method | Description | -|----------|--------|-------------| -| `/api/videos/download` | POST | Download videos from URLs | -| `/api/media/:blobId/info` | GET | Get information about a media file | -| `/api/videos` | GET | List all stored videos | -| `/api/audio` | GET | List all stored audio files | -| `/api/videos/batch/queue` | POST | Queue a batch of videos for processing | -| `/api/videos/batch/:batchId` | GET | Get the status of a batch | -| `/api/videos/batch/process` | POST | Process the next batch of videos | +| Endpoint | Method | Description | +| ---------------------------- | ------ | -------------------------------------- | +| `/api/videos/download` | POST | Download videos from URLs | +| `/api/media/:blobId/info` | GET | Get information about a media file | +| `/api/videos` | GET | List all stored videos | +| `/api/audio` | GET | List all stored audio files | +| `/api/videos/batch/queue` | POST | Queue a batch of videos for processing | +| `/api/videos/batch/:batchId` | GET | Get the status of a batch | +| `/api/videos/batch/process` | POST | Process the next batch of videos | ### Documents Service -| Endpoint | Method | Description | -|----------|--------|-------------| -| `/api/documents/download` | POST | Download and store a document | -| `/api/documents` | GET | List documents with filtering options | -| `/api/documents/:id` | GET | Get a specific document | -| `/api/documents/:id` | PATCH | Update document metadata | -| `/api/meeting-documents` | POST | Download and link meeting agenda documents | +| Endpoint | Method | Description | +| ------------------------- | ------ | ------------------------------------------ | +| `/api/documents/download` | POST | Download and store a document | +| `/api/documents` | GET | List documents with filtering options | +| `/api/documents/:id` | GET | Get a specific document | +| `/api/documents/:id` | PATCH | Update document metadata | +| `/api/meeting-documents` | POST | Download and link meeting agenda documents | ### Transcription Service -| Endpoint | Method | Description | -|----------|--------|-------------| -| `/transcribe` | POST | Request transcription for an audio file | -| `/jobs/:jobId` | GET | Get the status of a transcription job | -| `/transcriptions/:transcriptionId` | GET | Get a transcription by ID | -| `/meetings/:meetingId/transcriptions` | GET | Get all transcriptions for a meeting | +| Endpoint | Method | Description | +| ------------------------------------- | ------ | --------------------------------------- | +| `/transcribe` | POST | Request transcription for an audio file | +| `/jobs/:jobId` | GET | Get the status of a transcription job | +| `/transcriptions/:transcriptionId` | GET | Get a transcription by ID | +| `/meetings/:meetingId/transcriptions` | GET | Get all transcriptions for a meeting | ## Cron Jobs