Skip to content

Extract the editorial articles from newspaper from their RSS feed.

License

Notifications You must be signed in to change notification settings

Dr-N00B/Editorials

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Editorials

Extract Editorials from newspapers using the RSS feed provided by newspapers. Simply provide the tags which needs to be processed in a config file named "action_items.json".

Config file

Config file is a JSON file. Either pass using command line or use default file present in /config path.

Config parameters are:

"url_file"  : "path to the file where all tags related information is present",	"log_level":  "log level in python. Different log levels are DEBUG, INFO, WARNING, ERROR, CRITICAL",
"storage" : {
	"type" 		: "file or db, only one at a time",
	"path" 		: "path only applicable for file only",
	"host" 		: "hostname/ip address",
	"port" 		: "port number",
	"username" 	: "username, if username and password not given, it will try to connect without credential",
	"password" 	: "password, plaintext as of now",
	"database"	: "db name",
	"collection" 	: "collection name",
	"authdb"	: "authentication db, optional, default 'admin' will be used"
}

Action file

This file is a json file which will have all the information about the tags to process and the attributes of a tag to uniquely identify that tag. Action file will have below format.

"Newspaper title" : 
{	
	"url" : "url to the rss feed",
	"rss" : {
		"tags":["list of comma separated tags to process at rss page for each article"],
		"date_format" : "date format. Only month date and year format is to be supplied."
	},
	"page" : {
		"item" : {
			"tag" : "tag_name", 
			"attribute" : { "atttibute name" : "value" },
			"child_include" : ["comma separated tags to include within this tag to extract the content"],
			"child_exclude" : ["comma separated tags to exclude within this tag to extract the content"],
			"format" : "date format if the field is a date."
 		}
	}
}

All keys are compulsory except "child_include" , "child_exclude". and "format" has to be given if only the item is of type date.

.e.g

"The New York Time" : 
 {	
	"url" : "http://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/column/charles-m-blow/rss.xml",
	"rss" : {
		"tags":["description"],
		"date_format" : "%d %b %Y"
	},
	"page" : {
		"date" : {
			"tag" : "time", 
			"attribute" : {},
			"child_include" : ["none"],
			"format" : "%B %d %Y"
 		},
		"story" : {
			"tag" : "div",
			"attribute" : {"class":"css-53u6y8"},
			"child_include" : ["p"]
		}
	}
}

About

Extract the editorial articles from newspaper from their RSS feed.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages