Simple tool for parsing the contents of the 20_newsgroups.tar.gz archive. I've built it since I couldn't find such software for Java.
Consider an example document:
Newsgroups: sci.med
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!spdcc!dyer
From: [email protected] (Steve Dyer)
Subject: Re: food-related seizures?
Message-ID: <[email protected]>
Organization: S.P. Dyer Computer Consulting, Cambridge MA
References: <[email protected]> <[email protected]> <[email protected]>
Date: Mon, 19 Apr 1993 20:44:10 GMT
My comments about the Feingold Diet have no relevance to your
daughter's purported FrostedFlakes-related seizures. I can't imagine
why you included it.
--
Steve Dyer
[email protected] aka {ima,harvard,rayssd,linus,m2c}!spdcc!dyer
To parse it, just instantiate a NewsgroupParser
passing the path to the uncompressed tarball, and call the parse()
method on the parser.
After the files have been parsed, you can access them via getArticles()
.
This gives a key-value collection, with newsgroups labels as keys, and the lists of parsed articles as values.
Every article consists of the text and the key-value collection of headers.
NewsgroupParser parser = new NewsgroupParser("20_newsgroups");
parser.parse();
parser.getArticles().forEach((key, articles) -> {
System.out.println(key);
System.out.println(articles.size());
});