Government Document Compliance Evaluator

A comprehensive application for evaluating document compliance with Utah's General Records Schedule (GRS) retention policies using AWS Bedrock and generative AI.

Overview

The Utah Government Document Compliance Evaluator is an AI-powered application that helps government agencies evaluate documents for compliance with Utah's General Records Schedule (GRS) retention policies. The application analyzes document content, determines the appropriate GRS category and item number, and provides compliance recommendations based on document date and retention period requirements.

Technology Stack

Core Technologies

Python 3.9+: Primary programming language
AWS Bedrock: Foundation model service for generative AI capabilities
Streamlit: Web application framework for the user interface
Boto3: AWS SDK for Python to interact with AWS services

Document Processing

PyPDF2: PDF parsing and text extraction
python-docx: DOCX parsing and text extraction

Data Storage and Processing

Pandas: Data manipulation and analysis
JSON/JSONL: Enhanced knowledge base format

AWS Services

AWS Bedrock Agent Runtime: Powers the compliance agent with generative AI capabilities
Amazon SageMaker: Hosts the application (optional deployment target)
AWS IAM: Manages access permissions to AWS resources

Features

Document Analysis

Upload and process PDF, DOCX, and TXT files
Automatic document type detection
Extraction of key metadata (dates, document type, etc.)
Content analysis for better classification

Compliance Evaluation

Identification of applicable GRS item number
Determination of retention period requirements
Compliance status assessment (Compliant, Non-Compliant, Needs Review)
Detailed recommendations for document handling

User Interface

Clean, intuitive web interface
Document preview and editing capabilities
Chat interface for asking questions about compliance rules
Visual indicators for compliance status
Sample documents for demonstration purposes

Knowledge Base

Enhanced JSONL format for better document classification
Structured metadata for improved semantic matching
Related items linking for comprehensive compliance understanding
Document type and keyword extraction for better matching

Large Document Handling

Automatic truncation of large documents
PDF processing limited to first 10 pages
Character limit of 50,000 for all document types
Clear indication when documents are truncated

Installation

Prerequisites

Python 3.9 or higher
AWS account with Bedrock access
AWS CLI configured with appropriate credentials

Setup

Clone the repository:

git clone https://github.com/your-organization/utah-compliance-evaluator.git
cd utah-compliance-evaluator

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Configure AWS credentials:

aws configure

Configuration

AWS Bedrock Setup

Create an AWS Bedrock agent:
- Navigate to the AWS Bedrock console
- Create a new agent with appropriate knowledge base
- Note the agent ID and agent alias ID
Update the agent configuration in compliance-agent.py:

# Agent alias ARN
agent_id = "YOUR_AGENT_ID"  # Supervisor agent
agent_alias_id = "YOUR_AGENT_ALIAS_ID"  # alias

Configure AWS region:

client = boto3.client('bedrock-agent-runtime', region_name='YOUR_REGION')

Knowledge Base Preparation

Prepare the GRS data in CSV format (ScheduleItems.csv)
Convert to enhanced JSONL format:

python convert_to_enhanced_jsonl.py

Usage

Running the Application

Start the Streamlit application:

streamlit run compliance-agent.py

Access the application in your web browser at http://localhost:8501

Document Evaluation Process

Upload Document: Use the file uploader to submit a PDF, DOCX, or TXT file
Review Content: The document text will be displayed in the main area
Evaluate: Click the "Evaluate Document" button to analyze for compliance
Review Results: See the compliance status, GRS item, and recommendations
Ask Questions: Use the chat interface for additional compliance inquiries

Sample Documents

The application includes sample documents for demonstration:

Monthly Report
Budget Proposal
Meeting Minutes

Select any sample from the sidebar to see how the compliance evaluation works.

Enhanced Knowledge Base Format

The application uses an enhanced knowledge base format for better document classification and compliance determination:

{
  "grs_item_number": "949",
  "title": "Protest files",
  "description": "These are written protests by owners of property to be assessed in a special improvement district. The governing body hears protests and approves changes or cancels districts.",
  "retention_period": "Retain for 2 years after resolution of issue, and then destroy records.",
  "category": "Special Assessment",
  "status": "Current",
  "approved_date": "1989-03-01",
  "document_types": ["protest", "special assessment document", "file"],
  "keywords": ["improvement", "changes", "files", "governing", "assessed", "body", "protest", "cancels", "written", "hears", "districts", "special", "assessment", "district", "property", "owners", "protests", "approves"],
  "related_items": ["948", "950", "953"]
}

Key Improvements

Structured JSON Format
- Clear field names for direct access to information
- Consistent structure for all GRS items
Enhanced Metadata
- grs_item_number: Explicit field for the GRS item number
- document_types: List of document types this GRS item applies to
- keywords: Extracted meaningful terms for better semantic matching
- related_items: Cross-references to related GRS items
Benefits for AI Processing
- More accurate document classification
- Better matching of user queries to relevant GRS items
- Improved compliance determination
- Reduced need for excessive prompt engineering

Knowledge Base Configuration

The application uses AWS Bedrock's knowledge base capabilities with optimized configuration for better document matching:

Parsing Strategy

We use Amazon Bedrock Data Automation as the parsing strategy to extract structured information from GRS documents, which helps identify key fields like GRS item numbers, document types, and retention periods.

Chunking Strategy

We use Semantic Chunking with the following parameters:

Max Buffer Size for Comparing Sentence Groups: 1
- Allows for comparing adjacent sentence groups
- Helps maintain logical connections between related content
Max Token Size for a Chunk: 600
- Ensures complete GRS items stay together in a single chunk
- Prevents splitting related information across multiple chunks
Breakpoint Threshold for Sentence Group Similarity: 85
- Creates distinct chunks for different GRS items
- Keeps related information together within each GRS item
- Balances between too many small chunks and too few large chunks

Benefits of Optimized Configuration

Better matching between documents and GRS items
More accurate identification of document types
Improved compliance determination
Reduced errors in GRS item number assignment

Large Document Handling

The application automatically handles large documents by:

PDF Processing Limits
- Processes only the first 10 pages of PDF documents
- Extracts text up to 50,000 characters
- Provides metadata about truncation
DOCX Processing Limits
- Processes paragraphs up to 50,000 characters
- Tracks paragraph count and truncation information
Text File Limits
- Limits plain text files to 50,000 characters
- Provides metadata about truncation
User Interface Indicators
- Clear visual indicators when documents are truncated
- Information about how many pages/paragraphs were processed
- Transparency about document processing limitations
Prompt Engineering
- Informs the AI agent when working with truncated documents
- Focuses analysis on the most relevant portions of documents

This approach ensures that the compliance agent can efficiently process documents of any size while maintaining accuracy in classification and compliance determination.

Architecture

Component Diagram

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │
│  Web Interface  │────▶│  Document       │────▶│  AWS Bedrock    │
│  (Streamlit)    │     │  Processor      │     │  Agent Runtime  │
│                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘
        │                       │                       │
        ▼                       ▼                       ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │
│  User Input     │     │  Document       │     │  Knowledge      │
│  Handler        │     │  Analyzer       │     │  Base (JSONL)   │
│                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Data Flow

User uploads document through Streamlit interface
Document processor extracts and limits text content
Document analyzer extracts metadata and key information
Enhanced prompt is created with document content and metadata
AWS Bedrock agent processes the document and determines compliance
Results are parsed and displayed to the user
User can ask follow-up questions through the chat interface

Development

Project Structure

utah-compliance-evaluator/
├── compliance-agent.py       # Main application file
├── convert_to_enhanced_jsonl.py  # Conversion script
├── requirements.txt          # Python dependencies
├── ScheduleItems.csv         # Original GRS data
├── enhanced_compliance_records.jsonl  # Enhanced knowledge base
├── README.md                 # This documentation
├── AmazonQ.md                # Amazon Q integration guide
└── sm-monitoring/            # SageMaker monitoring components

Development Workflow

Local Development:
- Make changes to the application code
- Test locally using streamlit run compliance-agent.py
- Verify functionality with sample documents
Knowledge Base Updates:
- Update the CSV data as needed
- Run the conversion script to generate updated JSONL
- Test with the updated knowledge base
AWS Bedrock Agent Updates:
- Make changes to the agent configuration in AWS console
- Update the agent IDs in the application code
- Test the integration with the updated agent

Deployment

Streamlit Cloud Deployment

Push your code to a GitHub repository
Connect your repository to Streamlit Cloud
Configure the deployment settings
Deploy the application

AWS SageMaker Deployment

Package the application as a SageMaker-compatible Docker image
Upload the image to Amazon ECR
Create a SageMaker endpoint configuration
Deploy the endpoint
Configure access permissions

Environment Variables

For production deployments, use environment variables for sensitive configuration:

import os

agent_id = os.environ.get("BEDROCK_AGENT_ID")
agent_alias_id = os.environ.get("BEDROCK_AGENT_ALIAS_ID")
region = os.environ.get("AWS_REGION", "us-west-2")

Troubleshooting

Common Issues

AWS Authentication Errors:
- Verify AWS credentials are configured correctly
- Check IAM permissions for Bedrock access
- Ensure the region is set correctly
Document Processing Errors:
- Check if the document is password-protected
- Verify the document format is supported
- Try with a smaller document if processing fails
Agent Response Issues:
- Check the agent configuration in AWS Bedrock
- Verify the knowledge base is properly formatted and configured
- Review the knowledge base chunking parameters
- Ensure the document type is clearly identifiable in the document

Knowledge Base Troubleshooting

If the agent is providing incorrect GRS item numbers:

Check Knowledge Base Configuration:
- Verify semantic chunking parameters are optimized
- Ensure the max token size is sufficient for complete GRS items
- Adjust the breakpoint threshold if related items are being split
Document Type Clarity:
- Ensure documents clearly state their type (e.g., "Health Inspection Report")
- Include key identifying information in the first few paragraphs
- For ambiguous documents, add more context about the document purpose
Test with Simple Queries:
- Try direct queries like "What is the GRS item for hotel inspection reports?"
- Verify the agent can correctly retrieve specific GRS items
- Use the results to diagnose knowledge base access issues

Logging

The application includes logging for troubleshooting:

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Example usage
logger.info("Processing document: %s", file_name)
logger.error("Error processing document: %s", str(e))

Contributing

Contributions to the Utah Government Document Compliance Evaluator are welcome!

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

Please ensure your code follows the project's style guidelines and includes appropriate tests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
ScheduleItems.csv		ScheduleItems.csv
compliance-agent.py		compliance-agent.py
convert_to_enhanced_jsonl.py		convert_to_enhanced_jsonl.py
enhanced_compliance_records.jsonl		enhanced_compliance_records.jsonl
requirements.txt		requirements.txt

cmackdev/compliance_evaluator

Folders and files

Latest commit

History

Repository files navigation

Government Document Compliance Evaluator

Table of Contents

Overview

Technology Stack

Core Technologies

Document Processing

Data Storage and Processing

AWS Services

Features

Document Analysis

Compliance Evaluation

User Interface

Knowledge Base

Large Document Handling

Installation

Prerequisites

Setup

Configuration

AWS Bedrock Setup

Knowledge Base Preparation

Usage

Running the Application

Document Evaluation Process

Sample Documents

Enhanced Knowledge Base Format

Key Improvements

Knowledge Base Configuration

Parsing Strategy

Chunking Strategy

Benefits of Optimized Configuration

Large Document Handling

Architecture

Component Diagram

Data Flow

Development

Project Structure

Development Workflow

Deployment

Streamlit Cloud Deployment

AWS SageMaker Deployment

Environment Variables

Troubleshooting

Common Issues

Knowledge Base Troubleshooting

Logging

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages