A Python tool to parse bid-information PDFs (e.g. “Delavan PL”) and populate a standard bid spreadsheet template.
- PDF/Text extraction using pdfplumber (with optional OCR for scanned files)
- Regex & NLP–based field parsing
- Pandas-driven template filling and Excel/CSV output
- Command-line interface for batch processing
bid-extractor/ ├── src/ # Core modules │ ├── parser.py # PDF → raw text │ ├── extractor.py # raw text → field dict │ ├── templater.py # dict → Excel/CSV │ └── cli.py # entry-point script ├── tests/ # Unit tests (pytest) ├── data/ # Example PDFs & templates ├── .gitignore └── README.md
-
Clone & activate
git clone git@github.com:<your-org>/bid-extractor.git cd bid-extractor python3 -m venv venv && source venv/bin/activate pip install -r requirements.txt
-
Install dependencies pip install pdfplumber pandas openpyxl pytesseract spacy python -m spacy download en_core_web_sm
-
Run the CLI python -m src.cli --input data/DelavanPL.pdf
--template data/Bid\ information.xlsx
--output filled.xlsx