Search Engine and Web Crawler Project

This project contains the code for a web crawler and a search engine that can scrape a corpus and search for pages based on TfIdf indexing. This project is capable of storing data in memory using maps, or it can use an SQL relational database. It will adhear to a robots.txt file for every domain it crawls, and it crawls concurrently with a configurable worker regime. The main.go file is an example using a crawl of the www.usfca.edu website.

How to try out the demo

Build the program if you haven't already

go build -o ./demo

Run using in-memory map database

./demo

Run using on-disk SQL database

./demo "ondisk" "insert_database_name_here.db"

Organization

web_crawler.go:

Extract - returns the words and hrefs in maps from given html data
CleanHref - returns the absolute path for a given href
Download - returns the html data from a given absolute url
Crawl - crawls a site and collects all the word frequencies for each site it can reach

search_engine.go:

NewLocalSearchEngine - creates a file server containing a locally stored corpus, then crawls it and starts a search page
NewSearchEngine - creates a search engine that crawls a pre-existing corpus and does not automatically start a server for a search page
StartServer - creates a server containing a search page that can display pages ranked by their TfIdf scores
RankIndexes - creates a sorted slice of indexes for a given multi-term query
Search - searches the data from a crawl to find the frequency of a word stem on each page of the crawl
TfIdf - creates a TfIdf score in the from of an index for a document and search term

queue_channel.go:

Acts as a channel with a variable length buffer for use in crawler concurrency

helper.go:

Contains helper methods for tests
Contains helper methods for web_crawler.go and search_engine.go

main.go:

Contains a demo which searches a Top10 file server

Concurrency Design

Crawl start

Find robots.txt for domain and update permissions and crawl delay
Call all go routines

Data Flow

cleanHrefRoutine ---[clean hrefs]--> downloadRoutine
downloadRoutine ---[raw html data]--> extractRoutine
extractRoutine ---[raw hrefs]--> cleanHrefRoutine
extractRoutine ---[doc-word info]--> indexRoutine

Features

Allows cap removal for number of concurrent workers
Channels are a custom QueueChannel class which has a variable buffer size
Downloads share the same crawl delay, so increasing workers doesn't violate robots.txt
Tracks end condition by incrementing a wait group every time a href is added to the download queue, and decrementing when the download fails or when an index finishes.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
SearchEngine		SearchEngine
README.md		README.md
cleanhref_test.go		cleanhref_test.go
crawl_delay_test.go		crawl_delay_test.go
crawl_test.go		crawl_test.go
crawler.go		crawler.go
data_handler.go		data_handler.go
demo		demo
demo.db		demo.db
disallow_test.go		disallow_test.go
download_test.go		download_test.go
extract_test.go		extract_test.go
go.mod		go.mod
go.sum		go.sum
helper.go		helper.go
invidx.go		invidx.go
launch.json		launch.json
main.go		main.go
queue_channel.go		queue_channel.go
search_engine.go		search_engine.go
search_test.go		search_test.go
sql_database.go		sql_database.go
stop_test.go		stop_test.go
stopwords-en.json		stopwords-en.json
tfidf_test.go		tfidf_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search Engine and Web Crawler Project

How to try out the demo

Organization

Concurrency Design

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Search Engine and Web Crawler Project

How to try out the demo

Organization

Concurrency Design

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages