Skip to content

fix: update README.md #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 60 additions & 64 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,65 @@
# Tulsa Transcribe
# `tgov-scraper-js`

A system for scraping, processing, and serving Tulsa Government meeting videos and documents.
Scrape and ingest recordings and documents from meetings of the City of Tulsa's municipal Agencies, Boards, and Commissions (ABCs).

## Architecture

This application is structured as a set of microservices, each with its own responsibility:
This application is structured as a set of microservices, each with its own responsibility (For more details, see the [architecture documentation](./docs/architecture.md)):

### 1. TGov Service
- Scrapes Tulsa Government meeting information
- Stores committee and meeting data
- Extracts video URLs from viewer pages

### 2. Media Service
- Downloads and processes videos
- Extracts audio from videos
- Manages batch processing of videos

### 3. Documents Service
- Handles document storage and retrieval
- Links documents to meeting records

### 4. Transcription Service
- Converts audio files to text using the OpenAI Whisper API
- Stores and retrieves transcriptions with time-aligned segments
- Manages transcription jobs

For more details, see the [architecture documentation](./docs/architecture.md).

## Getting Started

### Prerequisites

- Node.js LTS and npm
- [Encore CLI](https://encore.dev/docs/install)
- ffmpeg (for video processing)
- OpenAI API key (for transcription)

### Setup

1. Clone the repository:

```bash
git clone <repository-url>
cd tulsa-transcribe
git clone https://github.com/codefortulsa/tgov-scraper-js.git
cd tgov-scraper-js
```

2. Install dependencies:
2. Install `node` v22 and `npm` v11 using your favorite version manager. If you don't have one, we recommend [nvm](https://github.com/nvm-sh/nvm#installing-and-updating):

```bash
npm install
nvm install 22
nvm use 22
nvm install-latest-npm
```

3. Run the setup script to configure your environment:
3. [Install Docker Desktop](https://docs.docker.com/get-docker/)

4. [Install `ffmpeg`](https://ffmpeg.org/download.html)

5. [Install the Encore CLI](https://encore.dev/docs/ts/install#install-the-encore-cli)

6. Install NPM dependencies:

```bash
npx ts-node setup.ts
npm install
```

4. Update the `.env` file with your database credentials and API keys:
7. Copy the example [local secret overrides file](https://encore.dev/docs/ts/primitives/secrets#overriding-local-secrets):

```bash
cp .secrets.local.cue.EXAMPLE .secrets.local.cue
```
TGOV_DATABASE_URL="postgresql://username:password@localhost:5432/tgov?sslmode=disable"
MEDIA_DATABASE_URL="postgresql://username:password@localhost:5432/media?sslmode=disable"
DOCUMENTS_DATABASE_URL="postgresql://username:password@localhost:5432/documents?sslmode=disable"
TRANSCRIPTION_DATABASE_URL="postgresql://username:password@localhost:5432/transcription?sslmode=disable"
OPENAI_API_KEY="your-openai-api-key"

. Set your local secrets:

```sh
# path: ./.secrets.local.cue
OPENAI_API_KEY: "<your-openai-api-key>"
```

5. Run the application using Encore CLI:
9. Run the application using Encore CLI:

```bash
encore run
```
Expand All @@ -72,43 +68,43 @@ encore run

### TGov Service

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/scrape/tgov` | GET | Trigger a scrape of the TGov website |
| `/tgov/meetings` | GET | List meetings with filtering options |
| `/tgov/committees` | GET | List all committees |
| `/tgov/extract-video-url` | POST | Extract a video URL from a viewer page |
| Endpoint | Method | Description |
| ------------------------- | ------ | -------------------------------------- |
| `/scrape/tgov` | GET | Trigger a scrape of the TGov website |
| `/tgov/meetings` | GET | List meetings with filtering options |
| `/tgov/committees` | GET | List all committees |
| `/tgov/extract-video-url` | POST | Extract a video URL from a viewer page |

### Media Service

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/videos/download` | POST | Download videos from URLs |
| `/api/media/:blobId/info` | GET | Get information about a media file |
| `/api/videos` | GET | List all stored videos |
| `/api/audio` | GET | List all stored audio files |
| `/api/videos/batch/queue` | POST | Queue a batch of videos for processing |
| `/api/videos/batch/:batchId` | GET | Get the status of a batch |
| `/api/videos/batch/process` | POST | Process the next batch of videos |
| Endpoint | Method | Description |
| ---------------------------- | ------ | -------------------------------------- |
| `/api/videos/download` | POST | Download videos from URLs |
| `/api/media/:blobId/info` | GET | Get information about a media file |
| `/api/videos` | GET | List all stored videos |
| `/api/audio` | GET | List all stored audio files |
| `/api/videos/batch/queue` | POST | Queue a batch of videos for processing |
| `/api/videos/batch/:batchId` | GET | Get the status of a batch |
| `/api/videos/batch/process` | POST | Process the next batch of videos |

### Documents Service

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/documents/download` | POST | Download and store a document |
| `/api/documents` | GET | List documents with filtering options |
| `/api/documents/:id` | GET | Get a specific document |
| `/api/documents/:id` | PATCH | Update document metadata |
| `/api/meeting-documents` | POST | Download and link meeting agenda documents |
| Endpoint | Method | Description |
| ------------------------- | ------ | ------------------------------------------ |
| `/api/documents/download` | POST | Download and store a document |
| `/api/documents` | GET | List documents with filtering options |
| `/api/documents/:id` | GET | Get a specific document |
| `/api/documents/:id` | PATCH | Update document metadata |
| `/api/meeting-documents` | POST | Download and link meeting agenda documents |

### Transcription Service

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/transcribe` | POST | Request transcription for an audio file |
| `/jobs/:jobId` | GET | Get the status of a transcription job |
| `/transcriptions/:transcriptionId` | GET | Get a transcription by ID |
| `/meetings/:meetingId/transcriptions` | GET | Get all transcriptions for a meeting |
| Endpoint | Method | Description |
| ------------------------------------- | ------ | --------------------------------------- |
| `/transcribe` | POST | Request transcription for an audio file |
| `/jobs/:jobId` | GET | Get the status of a transcription job |
| `/transcriptions/:transcriptionId` | GET | Get a transcription by ID |
| `/meetings/:meetingId/transcriptions` | GET | Get all transcriptions for a meeting |

## Cron Jobs

Expand Down