github profile crawler

distributed web crawler for GitHub user profile pages

architecture

message queuing

message queuing of github profile crawler is implemented with MongoDB, which provides db.collection.findAndModify() to modify and return a single document atomically.

all the operations on database are wrapped in DatabaseAccessor.py. there are three queues/collections in the system:

queue_crawl - url of pages that should be downloaded
queue_page - downloaded content and classification of the pages
profile - parsed users' profile from the pages

worker

there are four type of workers, and all of them can work independently on separate machines.

crawler
- take crawling jobs from queue_crawl
- download the pages and store them in queue_page
assigner
- take newly downloaded pages from queue_page
- classify and mark the pages in queue_page
parser_follow
- take following/follower pages from queue_page
- parse url of profile pages and next following/follower pages from the pages
- add all the parsed urls to queue_crawl as crawling jobs
parser_profile
- take profile pages from queue_page
- parse users' profile from the pages
- store the users' profile in profile

utility

watchdog
- monitor and record the status of the database
- draw charts based on recent status records
reporter
- host a web page to render database status charts
exporter
- dump all the profiles in json and csv formats
launcher
- verify if the system works with minimal targets

workflow

usage

configuration

before you deploy, don't forget to change the database settings in config.py:

property	default	note
config_db_addr	127.0.0.1	ip of the database host
config_db_port	27017	port of the database host
config_db_name	gitcrawl	the database to authenticate
config_db_user	YOUR_USERNAME	the name of the user to authenticate
config_db_pass	YOUR_PASSWORD	the password of the user to authenticate

deployment

install the project's dependencies with:

pip3 install -r requirements.txt

and then you may verify if it works with python3 Launcher.py before you launch all the workers with screen or tmux:

python3 Crawler.py
python3 Assigner.py
python3 ParserFollow.py
python3 ParserProfile.py

also the utilities to monitor the progress:

python3 WatchDog.py
python3 Reporter.py

after it stopped, export the profiles:

python3 Exporter.py

number of workers

worker	minimum	suggested
crawler	1	6
assigner	1	1
parser_follow	1	2
parser_profile	1	1
watchdog	0	1
reporter	0	1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

github profile crawler

architecture

message queuing

worker

utility

workflow

usage

configuration

deployment

number of workers

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 245 Commits
templates		templates
.gitignore		.gitignore
.travis.yml		.travis.yml
Assigner.py		Assigner.py
BaseLogger.py		BaseLogger.py
Crawler.py		Crawler.py
DatabaseAccessor.py		DatabaseAccessor.py
Exporter.py		Exporter.py
LICENSE		LICENSE
Launcher.py		Launcher.py
ParserFollow.py		ParserFollow.py
ParserProfile.py		ParserProfile.py
README.md		README.md
Reporter.py		Reporter.py
WatchDog.py		WatchDog.py
config.py		config.py
favicon.ico		favicon.ico
notes.md		notes.md
requirements.txt		requirements.txt

License

vejuhust/github-profile-crawler

Folders and files

Latest commit

History

Repository files navigation

github profile crawler

architecture

message queuing

worker

utility

workflow

usage

configuration

deployment

number of workers

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages