Skip to content

Amurr1/MultiThreadedSearchEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Search Engine and Web Crawler Project

This project contains the code for a web crawler and a search engine that can scrape a corpus and search for pages based on TfIdf indexing. This project is capable of storing data in memory using maps, or it can use an SQL relational database. It will adhear to a robots.txt file for every domain it crawls, and it crawls concurrently with a configurable worker regime. The main.go file is an example using a crawl of the www.usfca.edu website.

How to try out the demo

Build the program if you haven't already

go build -o ./demo

Run using in-memory map database

./demo

Run using on-disk SQL database

./demo "ondisk" "insert_database_name_here.db"

Organization

web_crawler.go:

  • Extract - returns the words and hrefs in maps from given html data
  • CleanHref - returns the absolute path for a given href
  • Download - returns the html data from a given absolute url
  • Crawl - crawls a site and collects all the word frequencies for each site it can reach

search_engine.go:

  • NewLocalSearchEngine - creates a file server containing a locally stored corpus, then crawls it and starts a search page
  • NewSearchEngine - creates a search engine that crawls a pre-existing corpus and does not automatically start a server for a search page
  • StartServer - creates a server containing a search page that can display pages ranked by their TfIdf scores
  • RankIndexes - creates a sorted slice of indexes for a given multi-term query
  • Search - searches the data from a crawl to find the frequency of a word stem on each page of the crawl
  • TfIdf - creates a TfIdf score in the from of an index for a document and search term

queue_channel.go:

  • Acts as a channel with a variable length buffer for use in crawler concurrency

helper.go:

  • Contains helper methods for tests
  • Contains helper methods for web_crawler.go and search_engine.go

main.go:

  • Contains a demo which searches a Top10 file server

Concurrency Design

Crawl start

  • Find robots.txt for domain and update permissions and crawl delay
  • Call all go routines

Data Flow

  • cleanHrefRoutine ---[clean hrefs]--> downloadRoutine
  • downloadRoutine ---[raw html data]--> extractRoutine
  • extractRoutine ---[raw hrefs]--> cleanHrefRoutine
  • extractRoutine ---[doc-word info]--> indexRoutine

Features

  • Allows cap removal for number of concurrent workers
  • Channels are a custom QueueChannel class which has a variable buffer size
  • Downloads share the same crawl delay, so increasing workers doesn't violate robots.txt
  • Tracks end condition by incrementing a wait group every time a href is added to the download queue, and decrementing when the download fails or when an index finishes.

About

This project contains the code for a web crawler and a search engine that can scrape a corpus and search for pages based on TfIdf indexing. This project is capable of storing data in memory using maps, or it can use an SQL relational database. It adhears to robots.txt files and has multi-threaded crawls for increased performance.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors