Medium Crawler 🐜

This crawler has built up the following dataset based on Medium.com with ground truth sentence and review correspondence.

Dataset Statistics

You may contact Charlie Wu at [email protected] to obtain the dataset

Crawler environment requirement

python 3.6
postgresql environment on MacOS/Linux

Define constant.py, Download chromedriver for your environment here to project directory

To query data:

psql medium
\dt
select $field from $table where [conditions];
# select numLikes of highlights whose article is published before 2017 Dec 1.
SELECT (highlight.numLikes, article.postTime) from highlight LEFT JOIN article ON highlight.corrArticleID = article.articleID WHERE (article.postTime <= timestamp '2017-12-01 00:00:00' AND highlight.numLikes >= 0);
# select all paragraphs of an article
SELECT * from stn where corrArticleID = $articleID
# select number of comments associating with highlights
SELECT count(*) from comment where corrHighlightID = $highlightID
# select hightlight where number of comments associating with highlights is more than one
SELECT * from highlight where (SELECT count(*) from comment where corrHighlightID = highlightID) > 1 limit 1;

Datebase Table Structure:

article

Field	Type	Info
articleID	SERIAL PRIMARY KEY
mediumID	varchar(300)
title	text
recommends	int
tags	varchar(300)	list of tags
postTime	timestamp
numLikes	int
corrAuthorID	int	link to author

author

Field	Type	Info
authorID	SERIAL PRIMARY KEY
name	varchar(50)
mediumID	varchar(20)
username	varchar(50)
bio	text

topic

Field	Type	Info
topicID	SERIAL PRIMARY KEY
name	text
mediumID	varchar(20)
description	text

paragraph

Field	Type	Info
paragraphID	SERIAL PRIMARY KEY
mediumID	varchar(10)
content	text
corrArticleID	int	link to article

position in article ordered by its paragraphID

stn

Field	Type	Info
stnID	SERIAL PRIMARY KEY
paragraphID	int	link to paragraph
content	text
corrArticleID	int	link to article

position in paragraph ordered by its stnID

highlight

Field	Type	Info
highlightID	SERIAL PRIMARY KEY
content	text
numLikes	int
startOffset	int
endOffset	int
corrParagraphID	int	link to paragraph
corrArticleID	int	link to article

comment

Field	Type	Info
commentID	SERIAL PRIMARY KEY
selfArticleID	int	link to article
corrArticleID	int	link to article
corrHighlightID	int	link to highlight

The detailed info of a comment is stored inside an article model as field selfArticleID, so it features a tree node structure:

Disclaimer: The development is for academic use only. The developer shall not be responsible for any consequence from the user behavior of this program. For the use of dataset, acknowledgement would be appreciated.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
db		db
documentation		documentation
objects		objects
sample		sample
scripts		scripts
.gitignore		.gitignore
License		License
README.md		README.md
chromedriver		chromedriver
crawler.py		crawler.py
insert2DB.py		insert2DB.py
parser.py		parser.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Medium Crawler 🐜

Dataset Statistics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Charleo85/medium-crawler

Folders and files

Latest commit

History

Repository files navigation

Medium Crawler 🐜

Dataset Statistics

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages