A comprehensive application for evaluating document compliance with Utah's General Records Schedule (GRS) retention policies using AWS Bedrock and generative AI.
- Overview
- Technology Stack
- Features
- Installation
- Configuration
- Usage
- Enhanced Knowledge Base Format
- Knowledge Base Configuration
- Large Document Handling
- Architecture
- Development
- Deployment
- Troubleshooting
- Contributing
- License
The Utah Government Document Compliance Evaluator is an AI-powered application that helps government agencies evaluate documents for compliance with Utah's General Records Schedule (GRS) retention policies. The application analyzes document content, determines the appropriate GRS category and item number, and provides compliance recommendations based on document date and retention period requirements.
- Python 3.9+: Primary programming language
- AWS Bedrock: Foundation model service for generative AI capabilities
- Streamlit: Web application framework for the user interface
- Boto3: AWS SDK for Python to interact with AWS services
- PyPDF2: PDF parsing and text extraction
- python-docx: DOCX parsing and text extraction
- Pandas: Data manipulation and analysis
- JSON/JSONL: Enhanced knowledge base format
- AWS Bedrock Agent Runtime: Powers the compliance agent with generative AI capabilities
- Amazon SageMaker: Hosts the application (optional deployment target)
- AWS IAM: Manages access permissions to AWS resources
- Upload and process PDF, DOCX, and TXT files
- Automatic document type detection
- Extraction of key metadata (dates, document type, etc.)
- Content analysis for better classification
- Identification of applicable GRS item number
- Determination of retention period requirements
- Compliance status assessment (Compliant, Non-Compliant, Needs Review)
- Detailed recommendations for document handling
- Clean, intuitive web interface
- Document preview and editing capabilities
- Chat interface for asking questions about compliance rules
- Visual indicators for compliance status
- Sample documents for demonstration purposes
- Enhanced JSONL format for better document classification
- Structured metadata for improved semantic matching
- Related items linking for comprehensive compliance understanding
- Document type and keyword extraction for better matching
- Automatic truncation of large documents
- PDF processing limited to first 10 pages
- Character limit of 50,000 for all document types
- Clear indication when documents are truncated
- Python 3.9 or higher
- AWS account with Bedrock access
- AWS CLI configured with appropriate credentials
- Clone the repository:
git clone https://github.com/your-organization/utah-compliance-evaluator.git
cd utah-compliance-evaluator
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Configure AWS credentials:
aws configure
-
Create an AWS Bedrock agent:
- Navigate to the AWS Bedrock console
- Create a new agent with appropriate knowledge base
- Note the agent ID and agent alias ID
-
Update the agent configuration in
compliance-agent.py
:
# Agent alias ARN
agent_id = "YOUR_AGENT_ID" # Supervisor agent
agent_alias_id = "YOUR_AGENT_ALIAS_ID" # alias
- Configure AWS region:
client = boto3.client('bedrock-agent-runtime', region_name='YOUR_REGION')
- Prepare the GRS data in CSV format (ScheduleItems.csv)
- Convert to enhanced JSONL format:
python convert_to_enhanced_jsonl.py
- Start the Streamlit application:
streamlit run compliance-agent.py
- Access the application in your web browser at
http://localhost:8501
- Upload Document: Use the file uploader to submit a PDF, DOCX, or TXT file
- Review Content: The document text will be displayed in the main area
- Evaluate: Click the "Evaluate Document" button to analyze for compliance
- Review Results: See the compliance status, GRS item, and recommendations
- Ask Questions: Use the chat interface for additional compliance inquiries
The application includes sample documents for demonstration:
- Monthly Report
- Budget Proposal
- Meeting Minutes
Select any sample from the sidebar to see how the compliance evaluation works.
The application uses an enhanced knowledge base format for better document classification and compliance determination:
{
"grs_item_number": "949",
"title": "Protest files",
"description": "These are written protests by owners of property to be assessed in a special improvement district. The governing body hears protests and approves changes or cancels districts.",
"retention_period": "Retain for 2 years after resolution of issue, and then destroy records.",
"category": "Special Assessment",
"status": "Current",
"approved_date": "1989-03-01",
"document_types": ["protest", "special assessment document", "file"],
"keywords": ["improvement", "changes", "files", "governing", "assessed", "body", "protest", "cancels", "written", "hears", "districts", "special", "assessment", "district", "property", "owners", "protests", "approves"],
"related_items": ["948", "950", "953"]
}
-
Structured JSON Format
- Clear field names for direct access to information
- Consistent structure for all GRS items
-
Enhanced Metadata
grs_item_number
: Explicit field for the GRS item numberdocument_types
: List of document types this GRS item applies tokeywords
: Extracted meaningful terms for better semantic matchingrelated_items
: Cross-references to related GRS items
-
Benefits for AI Processing
- More accurate document classification
- Better matching of user queries to relevant GRS items
- Improved compliance determination
- Reduced need for excessive prompt engineering
The application uses AWS Bedrock's knowledge base capabilities with optimized configuration for better document matching:
We use Amazon Bedrock Data Automation as the parsing strategy to extract structured information from GRS documents, which helps identify key fields like GRS item numbers, document types, and retention periods.
We use Semantic Chunking with the following parameters:
-
Max Buffer Size for Comparing Sentence Groups: 1
- Allows for comparing adjacent sentence groups
- Helps maintain logical connections between related content
-
Max Token Size for a Chunk: 600
- Ensures complete GRS items stay together in a single chunk
- Prevents splitting related information across multiple chunks
-
Breakpoint Threshold for Sentence Group Similarity: 85
- Creates distinct chunks for different GRS items
- Keeps related information together within each GRS item
- Balances between too many small chunks and too few large chunks
- Better matching between documents and GRS items
- More accurate identification of document types
- Improved compliance determination
- Reduced errors in GRS item number assignment
The application automatically handles large documents by:
-
PDF Processing Limits
- Processes only the first 10 pages of PDF documents
- Extracts text up to 50,000 characters
- Provides metadata about truncation
-
DOCX Processing Limits
- Processes paragraphs up to 50,000 characters
- Tracks paragraph count and truncation information
-
Text File Limits
- Limits plain text files to 50,000 characters
- Provides metadata about truncation
-
User Interface Indicators
- Clear visual indicators when documents are truncated
- Information about how many pages/paragraphs were processed
- Transparency about document processing limitations
-
Prompt Engineering
- Informs the AI agent when working with truncated documents
- Focuses analysis on the most relevant portions of documents
This approach ensures that the compliance agent can efficiently process documents of any size while maintaining accuracy in classification and compliance determination.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ │ │ │ │ │
│ Web Interface │────▶│ Document │────▶│ AWS Bedrock │
│ (Streamlit) │ │ Processor │ │ Agent Runtime │
│ │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ │ │ │ │ │
│ User Input │ │ Document │ │ Knowledge │
│ Handler │ │ Analyzer │ │ Base (JSONL) │
│ │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
- User uploads document through Streamlit interface
- Document processor extracts and limits text content
- Document analyzer extracts metadata and key information
- Enhanced prompt is created with document content and metadata
- AWS Bedrock agent processes the document and determines compliance
- Results are parsed and displayed to the user
- User can ask follow-up questions through the chat interface
utah-compliance-evaluator/
├── compliance-agent.py # Main application file
├── convert_to_enhanced_jsonl.py # Conversion script
├── requirements.txt # Python dependencies
├── ScheduleItems.csv # Original GRS data
├── enhanced_compliance_records.jsonl # Enhanced knowledge base
├── README.md # This documentation
├── AmazonQ.md # Amazon Q integration guide
└── sm-monitoring/ # SageMaker monitoring components
-
Local Development:
- Make changes to the application code
- Test locally using
streamlit run compliance-agent.py
- Verify functionality with sample documents
-
Knowledge Base Updates:
- Update the CSV data as needed
- Run the conversion script to generate updated JSONL
- Test with the updated knowledge base
-
AWS Bedrock Agent Updates:
- Make changes to the agent configuration in AWS console
- Update the agent IDs in the application code
- Test the integration with the updated agent
- Push your code to a GitHub repository
- Connect your repository to Streamlit Cloud
- Configure the deployment settings
- Deploy the application
- Package the application as a SageMaker-compatible Docker image
- Upload the image to Amazon ECR
- Create a SageMaker endpoint configuration
- Deploy the endpoint
- Configure access permissions
For production deployments, use environment variables for sensitive configuration:
import os
agent_id = os.environ.get("BEDROCK_AGENT_ID")
agent_alias_id = os.environ.get("BEDROCK_AGENT_ALIAS_ID")
region = os.environ.get("AWS_REGION", "us-west-2")
-
AWS Authentication Errors:
- Verify AWS credentials are configured correctly
- Check IAM permissions for Bedrock access
- Ensure the region is set correctly
-
Document Processing Errors:
- Check if the document is password-protected
- Verify the document format is supported
- Try with a smaller document if processing fails
-
Agent Response Issues:
- Check the agent configuration in AWS Bedrock
- Verify the knowledge base is properly formatted and configured
- Review the knowledge base chunking parameters
- Ensure the document type is clearly identifiable in the document
If the agent is providing incorrect GRS item numbers:
-
Check Knowledge Base Configuration:
- Verify semantic chunking parameters are optimized
- Ensure the max token size is sufficient for complete GRS items
- Adjust the breakpoint threshold if related items are being split
-
Document Type Clarity:
- Ensure documents clearly state their type (e.g., "Health Inspection Report")
- Include key identifying information in the first few paragraphs
- For ambiguous documents, add more context about the document purpose
-
Test with Simple Queries:
- Try direct queries like "What is the GRS item for hotel inspection reports?"
- Verify the agent can correctly retrieve specific GRS items
- Use the results to diagnose knowledge base access issues
The application includes logging for troubleshooting:
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Example usage
logger.info("Processing document: %s", file_name)
logger.error("Error processing document: %s", str(e))
Contributions to the Utah Government Document Compliance Evaluator are welcome!
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
Please ensure your code follows the project's style guidelines and includes appropriate tests.
This project is licensed under the MIT License - see the LICENSE file for details.