Skip to content

neurosyntax/github-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

github-scraperGo gopher

Language: Go (Golang)

A tool for scraping repositories from GitHub and extracting source code function information using Golang and GitHub API v3. Assumes you're running Linux. If running on OSX, use the appropriate dependencies. This script relies on bash commands and so will not work on Windows.

Languages currently supported for function extraction: C, C++, C#, Erlang, Java, Javascript, Lisp, Lua, and Python

Setup/ Dependencies

Install MongoDB

Install MongoDB

Install Exuberant Ctags:

Ubuntu:
sudo apt install exuberant-ctags

Install MongoDB driver for Go:

go get gopkg.in/mgo.v2

Refer to mgo for further and more up-to-date instructions.

GOPATH setup

Linux:
sudo nano ~/.bashrc
export GOPATH=$HOME/<path to this repo>
source ~/.bashrc

Unix:
sudo nano ~/.bash_profile
export GOPATH=$HOME/<path to this repo>
touch ~/.bash_profile

If you get a message about GOPATH not able to find parse/parseFuncHeader.go, do the following:

cd github-scraper/src/parse
go build
go install

Basic usage:

go run gitscrape.go -q <search terms> -language <programming language>
username:
password:

This will search for all repositories that match the specified search terms. You will be prompted for your username and password. Github API's have a "rate limit", which only allows a public IP to make 60 request/hour if the request is not authenticated with a Github account. If the request is made with basic authentication or oauth, a client can make up to 5,000 request/hour. The per_page parameter is set to 100 items by default.

After authenticating, you will then be presented with a prompt for a directory name where all the repos will be cloned to.

directory to clone all repos to:

Github search API arguments you can give to the program. Look here for more details: Search code parameters

[-q, -i, -size, -forks, -forked, -created, -updated, -user, -repo, -lang, -stars -sort -order]

Github's API expects the flags to be ordered as in the list above for correct results.

Additional flags:

-all  Start search from April 10th, 2008 and use >=2008-04-10T00:00:00Z instead of -created flag.
-mc   Massively clone all repos returned by search. Haven't finished this feature yet.

Indefinite usage:

go run gitscrape.go -q <search terms> -language <programming language> -all
authenticate [y/n]: y
directory to clone all repos to: saved_repos
use most recent checkpoint [y/n]: n

Note: The -all flag is meant to try to circumvent the rate limit by searching through time. GitHub's Search API rate limit only allows 1,000 results per unique search and each search must contain some search term specified with -q (a single whitespace counts). This means, for example, if you want to search for all repositories written in Java, you will only the first 1,000 from the search even though there are 2,701,409 (at the time of writting this). I could specify a reverse order and get another 1,000 repos from that search, however, I still feel cheated. To get as many repos from a search as possible, specifying -all starts the search from the launch date of GitHub and looks for repos when they were created with the -created flag and makes the assumption that Java repos are created at a rate of about 1,000 Java repos/6 hours (rough estimate from observing the search api for one minute). Also from observing the difference in total_count from https://api.github.com/search/repositories?q=created:%3C=2016-04-21T00:00:00Z+language:java&per_page=100 and https://api.github.com/search/repositories?q=created:%3C=2016-04-21T06:00:00Z+language:java&per_page=100, you I observed 1808480-1807687 = 793 Java repos created in that 6 hours. Those dates were picked randomly. So query over 6 hours intervals seems appropriate. To be a bit more conservative, I cut that in half and only search by 3 hour intervals to get as many as possible. You can adjust assumption this as needed.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages