Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 2 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,21 +60,8 @@ with some examples.
## Make logs available
You will need to run the script on a computer where the log files you're trying to process are available on the file system for the script to access.

## Download the free IP to geolocation database
The geo-ip uses GeoLite2 data created by MaxMind and is available from
<a href="https://web.archive.org/web/20191222130401/https://dev.maxmind.com/geoip/geoip2/geolite2/" target="_blank">
Internet Archive</a>
(you only need the country database in binary database format).

GeoLite2 is a free IP geolocation database that must be installed. You can download the
database above. Choose the GeoLite2 Country database (binary, gzipped) and extract it to
the maxmind_geoip directory inside the application to use with default configuration,
or put it elsewhere and configure the path as mentioned below.

Newer versions of the database cannot be used with the current version of the script since
additional licensing terms are required such as registering for accounts, having an auto-update
functionality and ensuring it runs regularly. The script has not been updated to take these
additional requirements into account.
## Download the free IP to geolocation database (Optional)
The geo-ip uses GeoLite2 data created by MaxMind to assign views and downloads to specific countries. It can be installed <a href="https://guides.dataverse.org/en/6.6/installation/prerequisites.html#installing-geolite-country-database" target="_blank">as discussed in the Dataverse Guides</a>

## Set up the configuration file
The script takes a number of different configuration parameters in order to run correctly. See **config/config.yaml** for an example. To change the configuration you may edit it at config/config.yaml or you can put it at a different location and then specify it with an environment variable when starting the script like the example below.
Expand Down
8 changes: 4 additions & 4 deletions config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ def start_time(self):
def end_time(self):
return datetime.datetime.combine(self.end_date, datetime.datetime.min.time()) + datetime.timedelta(days=1)

# memorization of last day
# memoization of last day
def last_day(self):
"""The last day available in the period, either yesterday if in same month, or else last day of month if it has passed"""
if self.last_p_day is not None:
Expand All @@ -170,7 +170,7 @@ def last_day(self):
def month_complete(self):
return (self.run_date >= self.end_time())

# gets/memorizes the robots regexp
# gets/memoizes the robots regexp
def robots_regexp(self):
"""Get the list of robots/crawlers from a list that is one per line
from the URL and make a regular expression for the detection"""
Expand All @@ -184,7 +184,7 @@ def robots_regexp(self):
self.robots_reg = re.compile('|'.join(lines))
return self.robots_reg

# gets/memorizes the machines regexp
# gets/memoizes the machines regexp
def machines_regexp(self):
"""Get the list of machines from a list that is one per line
from the URL and make a regular expression for the detection"""
Expand All @@ -198,7 +198,7 @@ def machines_regexp(self):
self.machines_reg = re.compile('|'.join(lines))
return self.machines_reg

# gets/memorizes the hit-type regexp
# gets/memoizes the hit-type regexp
def hit_type_regexp(self):
"""Make hit type regular expressions for investigation vs request"""
if self.hit_type_reg is not None:
Expand Down
8 changes: 4 additions & 4 deletions config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,23 +33,23 @@ path_types:
# if something is a robot or machine user-agent. The text file has one regular expression per line
#robots_url: https://raw.githubusercontent.com/CDLUC3/Make-Data-Count/master/user-agents/lists/robot.txt
#machines_url: https://raw.githubusercontent.com/CDLUC3/Make-Data-Count/master/user-agents/lists/machine.txt
robots_url: https://raw.githubusercontent.com/IQSS/counter-processor/refs/heads/goto-gdcc/user-agents/lists/robots.txt
machines_url: https://raw.githubusercontent.com/IQSS/counter-processor/refs/heads/goto-gdcc/user-agents/lists/machine.txt
robots_url: https://raw.githubusercontent.com/IQSS/counter-processor/refs/heads/main/user-agents/lists/robots.txt
machines_url: https://raw.githubusercontent.com/IQSS/counter-processor/refs/heads/main/user-agents/lists/machine.txt

# the year and month for the report you are creating.
year_month: 2018-04

# Output formats are only json currently. tsv was planned but spec was never finalized.

# Don't put the filename extension, the code will tack on the tsv or json extension for you.
output_file: tmp/test_out
output_file: tmp/make-data-count-report
output_format: json

# Allows the report to have volume (size) info, which DataCite doesn't accept yet
output_volume: False

# the name of the platform that goes into your reports
platform: Harvard Dataverse
platform: Dataverse

# Don't put your api token in here if you're going to commit it, but put in separate secrets.yaml in same
# directory as the config or else set a environment variable when starting up in order to override the key.
Expand Down
2 changes: 1 addition & 1 deletion documentation/step-by-step-install.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
## Get the code

<pre>
git clone https://github.com/IQSS/counter-processor.git
git clone https://github.com/gdcc/counter-processor.git
cd counter-processor
git checkout <em>branch-or-tag</em>
</pre>
Expand Down
36 changes: 36 additions & 0 deletions user-agents/lists/machine.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# this file is generated from a master file, please modify it and regenerate it with the generate_lists.rb script
^ruby$
AddThis
aria2\/\d
CakePHP
ColdFusion
curl\/
^\%?default\%?$
Dispatch\/\d
EBSCO\sEJS\sContent\sServer
Fetch(\s|\+)API(\s|\+)Request
geturl
gvfs\/
HttpComponents\/1.1
http.?client
Indy Library
^java\/\d{1,2}.\d
libcurl
libhttp
libwww
lwp
Microsoft(\s|\+)URL(\s|\+)Control
Microsoft Office Existence Discovery
ng\/2\.
no_user_agent
pear.php.net
PHP\/
PycURL
python
rss
^undefined$
^unknown$
URL2File
urllib
Wget
wordpress
202 changes: 202 additions & 0 deletions user-agents/lists/robots.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# this file is generated from a master file, please modify it and regenerate it with the generate_lists.rb script
bot
spider
crawl
[^a]fish
^voyager\/
ADmantX
alexa
Alexandria(\s|\+)prototype(\s|\+)project
AllenTrack
almaden
appie
API[\+\s]scraper
Arachmo
architext
ArchiveTeam
arks
asterias
atomz
BDFetch
baidu
biglotron
BingPreview
binlar
Blackboard[\+\s]Safeassign
blaiz\-bee
bloglines
blogpulse
boitho\.com\-dc
bookmark\-manager
Brutus\/AET
BUbiNG
bwh3_user_agent
celestial
cfnetwork
checklink
checkprivacy
China\sLocal\sBrowse\s2\.6
cloakDetect
coccoc\/1\.0
collection@infegy.com
com\.plumanalytics
combine
contentmatch
ContentSmartz
convera
core
CoverScout
cursor
custo
DataCha0s\/2\.0
daumoa
DeuSu\/
Docoloc
docomo
DSurf
DTS Agent
easydl
EmailSiphon
EmailWolf
Embedly
EThOS\+\(British\+Library\)
facebookexternalhit\/
feedburner
FeedFetcher
feedreader
ferret
findlinks
Fulltext
Funnelback
G-i-g-a-b-o-t
Goldfire(\s|\+)Server
google
Grammarly
grub
gulliver
harvest
heritrix
holmes
htdig
htmlparser
HTTPFetcher
httrack
ia_archiver
ichiro
iktomi
ilse
^integrity\/\d
internetseer
intute
iSiloX
iskanie
jeeves
jobo
kyluka
larbin
lilina
link.?check
LinkLint-checkonly
^LinkParser\/
^LinkSaver\/
linkscan
LinkTiger
linkwalker
lipperhey
livejournal\.com
LOCKSS
ltx71
lycos[\_\+]
mail.ru
mediapartners\-google
megite
MetaURI[\+\s]API\/\d\.\d
mimas
mnogosearch
moget
motor
MuscatFerre
myweb
nagios
^NetAnts\/\d
netcraft
netluchs
Ning
nomad
nutch
^oaDOI$
ocelli
Offline(\s|\+)Navigator
onetszukaj
OurBrowser
panscient
parsijoo
EasyBib[\+\s]AutoCite[\+\s]
perman
pioneer
playmusic\.com
playstarmusic\.com
^Postgenomic(\s|\+)v2
powermarks
proximic
Qwantify
Readpaper
redalert
Riddler
robozilla
scan4mail
scientificcommons
scirus
scooter
Scrapy\/\d
^scrutiny\/\d
SearchBloxIntra
shoutcast
SkypeUriPreview
slurp
sogou
speedy
Strider
summify
sunrise
Sysomos
T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E
tailrank
Teleport(\s|\+)Pro
Teoma
titan
^Traackr\.com$
Trove
twiceler
ucsd
ultraseek
urlaliasbuilder
validator
virus.detector
voila
^voltron$
voyager\/
w3af.org
Wanadoo
Web(\s|\+)Downloader
WebCloner
webcollage
WebCopier
Webinator
weblayers
Webmetrics
webmirror
webmon
webreaper
WebStripper
WebZIP
worm
www.gnip.com
WWW\-Mechanize
xenu
y!j
yacy
yahoo
yandex
zeus
zyborg