What is this?

This is the basic structure for a web crawler. It doesn't actually crawl, but everything is there to add it -- you just need to select new links from a page to download. However I'm not interested in doing that right now, so it's not there.

What is there is a way to start a crawler instance and update the HTML page with new result dynamically (over a websocket). There is also code to perform a search with the Bing API.

This code uses the Twisted framework, an asynchronous network engine which allows it to perform many requests in parallel, to serve the website and to communicate on websockets. The websocket protocol implementation comes from Autobahn.

Why?

Mainly wanted to try Twisted, but I'm not interested in the HTML handling & classification problems. However I might add that later on if I feel like it.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
crawler		crawler
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is this?

Why?

About

Releases

Packages

Languages

License

remram44/crawler-structure

Folders and files

Latest commit

History

Repository files navigation

What is this?

Why?

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages