A simple python script that will collect and analyze the robots.txt file of websites.
-
Create a text file list of domains you wish to search
top-100.txtis included as an example, larger lists can be found in the Top-Site-Lists repo -
Run the command
./robotSpider.py -i YOURINPUTFILE -t NUMBEROFTHREADSfor example./robotSpider.py -i top-100.txt, number of threads defaults to 200 -
The files will be output into the current directory, prefixed with the name of your input file
##Interesting One Liners
- Find any rules with "test" in the rules file
cat top-100_rules.csv | grep "test" - Find any rules with "beta" in the rules file
cat top-100_rules.csv | grep "beta" - Find any rules with "admin" in the rules file
cat top-100_rules.csv | grep "admin" - Find any rules with ".pdf" or ".xls" or ".doc" in the rules file
cat top-100_rules.csv | grep ".pdf\|.xls\|.doc"
