This project contains the code for a web crawler and a search engine that can scrape a corpus and search for pages based on TfIdf indexing. This project is capable of storing data in memory using maps, or it can use an SQL relational database. It will adhear to a robots.txt file for every domain it crawls, and it crawls concurrently with a configurable worker regime. The main.go file is an example using a crawl of the www.usfca.edu website.
Build the program if you haven't already
go build -o ./demoRun using in-memory map database
./demoRun using on-disk SQL database
./demo "ondisk" "insert_database_name_here.db"web_crawler.go:
- Extract - returns the words and hrefs in maps from given html data
- CleanHref - returns the absolute path for a given href
- Download - returns the html data from a given absolute url
- Crawl - crawls a site and collects all the word frequencies for each site it can reach
search_engine.go:
- NewLocalSearchEngine - creates a file server containing a locally stored corpus, then crawls it and starts a search page
- NewSearchEngine - creates a search engine that crawls a pre-existing corpus and does not automatically start a server for a search page
- StartServer - creates a server containing a search page that can display pages ranked by their TfIdf scores
- RankIndexes - creates a sorted slice of indexes for a given multi-term query
- Search - searches the data from a crawl to find the frequency of a word stem on each page of the crawl
- TfIdf - creates a TfIdf score in the from of an index for a document and search term
queue_channel.go:
- Acts as a channel with a variable length buffer for use in crawler concurrency
helper.go:
- Contains helper methods for tests
- Contains helper methods for web_crawler.go and search_engine.go
main.go:
- Contains a demo which searches a Top10 file server
Crawl start
- Find robots.txt for domain and update permissions and crawl delay
- Call all go routines
Data Flow
- cleanHrefRoutine ---[clean hrefs]--> downloadRoutine
- downloadRoutine ---[raw html data]--> extractRoutine
- extractRoutine ---[raw hrefs]--> cleanHrefRoutine
- extractRoutine ---[doc-word info]--> indexRoutine
Features
- Allows cap removal for number of concurrent workers
- Channels are a custom QueueChannel class which has a variable buffer size
- Downloads share the same crawl delay, so increasing workers doesn't violate robots.txt
- Tracks end condition by incrementing a wait group every time a href is added to the download queue, and decrementing when the download fails or when an index finishes.