Skip to content

Latest commit

 

History

History
22 lines (16 loc) · 479 Bytes

README.md

File metadata and controls

22 lines (16 loc) · 479 Bytes

WEB_CRAWLING

pip install requests
pip install beautifulsoup4
pip install pymupdf

Then run main.py by invoking the following command

python main.py #script to run all code

To run the crawler and converter scripts seperately run the following commands in succession.

python crawler.py   #script to download the pdf files
python conversion.py #script to convert all the pdf files to .txt and .xml

Check the folder for the generated files