Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
152 commits
Select commit Hold shift + click to select a range
769ad82
Add additional heuristics to improve title extraction
andresp99999 Oct 12, 2015
57660e1
Add extraction test case
andresp99999 Oct 13, 2015
80ad961
Several improvements to date extraction, add additional supported dat…
andresp99999 Oct 14, 2015
a833138
Add new test cases
andresp99999 Oct 15, 2015
311d367
Include great-grand-children nodes in the weight calculation, refacto…
andresp99999 Oct 19, 2015
a7e9e3e
Add new test case
andresp99999 Oct 19, 2015
6791a27
Rename test file
andresp99999 Oct 19, 2015
0e753c2
Give more weight to grand and greatgrand children, add additional tes…
andresp99999 Oct 19, 2015
9e56da0
Remove comment
andresp99999 Oct 19, 2015
f0aeb55
Adapt title extraction of new tests
Aug 11, 2016
9d02aca
Add additional test cases for cnn.com
andresp99999 Oct 20, 2015
2f4420c
Add support to disable SSL verification
andresp99999 Oct 28, 2015
66468e3
Add new test case
andresp99999 Nov 5, 2015
f9e8291
Fix title extraction with HTML entities
Aug 11, 2016
2c4dc44
Add checks to support when dates are malformed
andresp99999 Dec 1, 2015
3ec9245
Add another case for date extraction
andresp99999 Dec 1, 2015
d1a9c35
Add support to detect partial dates in the URL
andresp99999 Dec 1, 2015
4f94e2a
Add more date patterns
andresp99999 Dec 1, 2015
9b175ae
Yet more date patterns
andresp99999 Dec 1, 2015
895c07f
Fix expected value in test
andresp99999 Dec 1, 2015
2ea6197
Add additional use cases for date extraction
andresp99999 Dec 2, 2015
15c0b77
Add additional use cases for date extraction
andresp99999 Dec 2, 2015
7167739
Add additional use cases for date extraction
andresp99999 Dec 2, 2015
d8e0dc0
Some fixes for date extraction
andresp99999 Dec 2, 2015
00edaee
Fix issue with some rules too generic and were matching random dates …
andresp99999 Dec 3, 2015
da6b726
Change rule to use less generic selectors
andresp99999 Dec 3, 2015
87c4bbd
More date extraction fixes
andresp99999 Dec 3, 2015
7e4bf90
Another tests fix as we extract title differently
Aug 11, 2016
5dfd227
Support month by names in URL date extractor
andresp99999 Dec 3, 2015
d2e2e75
Add another date pattern
andresp99999 Dec 3, 2015
3387734
Another date extraction case
andresp99999 Dec 3, 2015
070bebd
Fix for slashdot date and title extraction
andresp99999 Dec 3, 2015
6e420a6
Add another date extraction case
andresp99999 Dec 4, 2015
58765f2
Disable debug info
andresp99999 Dec 4, 2015
ed04781
Small fix to make sure all test pass
andresp99999 Dec 4, 2015
2a561f7
Add validation function to make sure the dates fit in an expected ran…
andresp99999 Dec 4, 2015
612ae5e
Add another date pattern
andresp99999 Dec 7, 2015
520a647
Add support to define domain specific rules to remove nodes from the …
andresp99999 Dec 7, 2015
befcaa3
Add another date pattern
andresp99999 Dec 8, 2015
2551d15
Handle case when the public suffix for the URL is not known
andresp99999 Dec 9, 2015
52b5e00
Add custom rule for cmo.com
andresp99999 Dec 10, 2015
265a9ee
Add custom extraction case
andresp99999 Dec 16, 2015
cc75a58
Add additional use case
andresp99999 Dec 17, 2015
55b3145
Support other date format
andresp99999 Jan 8, 2016
c4263a3
Throw a custom exception when the target page is not found
andresp99999 Jan 10, 2016
a35d340
Update unit tests
andresp99999 Jan 10, 2016
12f36d7
Add new test case, make some changes on heuristic for title
andresp99999 Jan 11, 2016
90bd547
Make sure of returning absolute canonical URLs
andresp99999 Jan 12, 2016
c3bfc84
Add custom extraction case
andresp99999 Jan 12, 2016
c159fde
Fix unit tests (different title extraction logic)
Aug 11, 2016
7006b1e
Add missing guava dependency
Aug 11, 2016
3085896
Disable old test, URL seems to blocking requests nows
andresp99999 Jun 8, 2016
287a26f
Add rules to remove unlikely nodes, modify getBestElement to return a…
andresp99999 Jan 18, 2016
e42b150
Add new extraction case
andresp99999 Jan 18, 2016
405eb45
Add another extraction case
andresp99999 Jan 21, 2016
888d53d
Add anothet test case, refactor SHelper a bit
andresp99999 Jan 21, 2016
ab62e1d
Add extraction case
andresp99999 Jan 22, 2016
49ccac7
Add missing test data file
andresp99999 Jan 22, 2016
265083e
Fix unit test - server response changed
andresp99999 Feb 4, 2016
4a733be
Add new date extraction case
andresp99999 Feb 4, 2016
05cb23c
Add custom site rule
andresp99999 Feb 4, 2016
9d08404
Add custom rule
andresp99999 Feb 5, 2016
30980aa
Add extraction case
andresp99999 Feb 22, 2016
2bec776
Another extraction case
andresp99999 Feb 22, 2016
9db3879
Add another extraction case
andresp99999 Feb 22, 2016
8b09318
Add new extraction case, upgrade Jsoup to avoid issue extracting cano…
andresp99999 Feb 26, 2016
1b721b9
Generalize title sanitization
Aug 12, 2016
627ace9
Change URL for test since old URL not longer return a 404
andresp99999 Mar 1, 2016
6421f14
Add support to detect when the extracted content contains HTML, in th…
andresp99999 Mar 1, 2016
9d2c6cc
Add support to define domain specific rules to select the bestElement
andresp99999 Mar 2, 2016
9fae6a0
Add additional test to test for several cases to extract canonical URL
andresp99999 Mar 3, 2016
a7286ba
Add couple of new extraction cases
andresp99999 Mar 9, 2016
ba2daf2
Add additional test case
andresp99999 Mar 17, 2016
3a51fb7
src/test/resources/de/jetwick/snacktory/macnn.html
andresp99999 Mar 17, 2016
96ab9d5
Add another test case
andresp99999 Mar 17, 2016
4b1cf87
Make sure of not returning empty links
andresp99999 Mar 17, 2016
b255f1b
Add function to extract only canonical
andresp99999 Mar 21, 2016
a6992da
Add version of extractCanonical where the document is already present
andresp99999 Mar 21, 2016
732ef4e
Add another date extraction case
andresp99999 Mar 24, 2016
19d3da1
Add another date case
andresp99999 Mar 24, 2016
3a353c4
Add special case
andresp99999 Apr 1, 2016
fedd488
Add another extraction case
andresp99999 Apr 5, 2016
1930e38
Add support for not using canonical urls from a different domain
andresp99999 Apr 7, 2016
d122d35
if canonical is empty don't use it
andresp99999 Apr 7, 2016
b7c6f2c
Add elements common in carousels to remove list
andresp99999 Apr 7, 2016
2fb4a09
Add test data
andresp99999 Apr 7, 2016
bbc93ad
Add new date extraction case
andresp99999 Apr 14, 2016
bad9ff9
Fix issue in date extraction
andresp99999 Apr 15, 2016
3684135
Add date extraction case
andresp99999 Apr 18, 2016
bd4a3e9
Fixes for author_desc
andresp99999 Apr 22, 2016
406cf5d
Add date extraction case
andresp99999 Apr 22, 2016
b254ee2
Add date extraction case
andresp99999 Apr 22, 2016
e6c258d
Add new date extraction case
andresp99999 Apr 25, 2016
e47076c
Add another date extraction case
andresp99999 Apr 25, 2016
59a9e4e
Yet another date extraction case
andresp99999 Apr 26, 2016
e087f4b
Add extraction case
andresp99999 May 4, 2016
ac15f6d
Fix unit tests as we have different title extraction logic
Aug 12, 2016
819d91f
Other extraction case, ignore image copyright messages
andresp99999 May 4, 2016
7f01a76
Remove common credit class
andresp99999 May 4, 2016
0d072e7
Add another extraction pattern, change patterns to be non case sensitive
andresp99999 May 5, 2016
e5d328a
Remove some other image credit patterns
andresp99999 May 5, 2016
762fc25
Fix testLeFigaroSport
Aug 12, 2016
0eabcb8
Revert lib update, seems to cause high cpu usage
andresp99999 May 6, 2016
797cdc1
Add another extraction case
andresp99999 May 18, 2016
da076c6
Add another extraction case
andresp99999 May 18, 2016
bf25720
Yet another extraction case
andresp99999 May 18, 2016
1ab0d3c
Yet another extraction case
andresp99999 May 19, 2016
74d5fc8
Handle exception: Not under a public suffix
andresp99999 May 19, 2016
93ba715
Add another extraction case
andresp99999 May 20, 2016
67580ee
Add another extraction case
andresp99999 May 24, 2016
86cf6cd
Add support to include ul and li tags in output - WIP
andresp99999 May 18, 2016
f9f610c
Make sure all tests pass
andresp99999 May 24, 2016
15f08de
Add a couple more extraction cases
andresp99999 May 25, 2016
157c5cd
Another extraction case
andresp99999 May 27, 2016
6082b52
Add another date extraction case
andresp99999 May 27, 2016
6e4f117
Add another date extraction case
andresp99999 May 27, 2016
5cfe2b2
Refactor getBestElements, now it returns a TreeMap of nodes sorted by…
andresp99999 May 31, 2016
86fa252
Add support to define custom list of css selectors for the OutputForm…
andresp99999 Jun 2, 2016
be61496
Add date extraction case
andresp99999 Jun 3, 2016
b4c2774
Add another extraction case
andresp99999 Jun 3, 2016
13fefbf
Limit the size of a URL returned in the Links array
andresp99999 Jun 7, 2016
68322ae
Add another extraction case for author_desc; add better logging for a…
andresp99999 Jun 8, 2016
4a7b4de
Add another extraction case
andresp99999 Jun 8, 2016
d8ac97e
Add another extraction case
andresp99999 Jun 13, 2016
6cbc1cd
Add another extraction case
andresp99999 Jun 16, 2016
5665ef8
Add new date extraction case
andresp99999 Jun 16, 2016
ce8ca75
Add a couple date extraction cases
andresp99999 Jun 16, 2016
83606ee
yet another extraction case
andresp99999 Jun 16, 2016
9f6fef3
Fix tests for title extraction
Aug 12, 2016
c93253c
yet another extraction case
andresp99999 Jun 28, 2016
8330a33
yet another extraction case
andresp99999 Jun 28, 2016
b55de36
Add another extraction case
andresp99999 Jul 5, 2016
69a80aa
Fix unit test for title extraction
Aug 12, 2016
5436fc7
Add extraction case
andresp99999 Jul 6, 2016
114f0ed
Add custom extraction case for theverge
andresp99999 Jul 14, 2016
52160d5
Add custom rules for iheart.com
andresp99999 Jul 14, 2016
852f284
Another title extraction test fix
Aug 12, 2016
4b6bf66
If the author name is smaller than 8 characters don't try to search f…
andresp99999 Jul 14, 2016
c2bc38a
Fix testBizJournal test
Aug 12, 2016
7580c7a
Fix testLeFigaroSport unit test
Aug 12, 2016
d168751
Fix testCNBC2 unit test (get the latest HTML from the website)
Aug 12, 2016
bba5ca2
Bump dependencies
Aug 12, 2016
b742424
Fixes for HtmlFetcherIntegrationTest
Aug 12, 2016
efbf1cf
Fix testBBCNoCSS (get most recent content from website)
Aug 12, 2016
7ac58c8
Fix testData6 extracted content
Aug 12, 2016
7c0ce3f
Fix testKdwb
Aug 12, 2016
42d74d7
Fix remaining failing unit tests
Aug 12, 2016
7990909
Better usage of assertion framework inside tests
Aug 12, 2016
d327692
Quick code cleanup
Aug 12, 2016
09e8b5f
Bump maven configuration
Aug 12, 2016
70bae52
Huge changes, good opportunity to work towards a 1.4.0 release
Aug 12, 2016
7f97aac
Merge branch 'smallrivers' into skyshard-merge
Nov 15, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 26 additions & 4 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,13 @@
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<prerequisites>
<maven>3.0</maven>
</prerequisites>

<groupId>org.clojars.smallrivers</groupId>
<artifactId>snacktory</artifactId>
<version>1.3.6-SNAPSHOT</version>
<version>1.4.0-SNAPSHOT</version>
<packaging>jar</packaging>

<name>Snacktory</name>
Expand Down Expand Up @@ -41,6 +45,24 @@
<version>2.6</version>
</dependency>

<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.4</version>
</dependency>

<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0</version>
</dependency>

<dependency>
<groupId>net.sourceforge.htmlcleaner</groupId>
<artifactId>htmlcleaner</artifactId>
<version>2.16</version>
</dependency>

<!-- only needed to make logging work during tests -->
<dependency>
<groupId>org.slf4j</groupId>
Expand All @@ -65,7 +87,7 @@
<inherited>true</inherited>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<version>3.5.1</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
Expand All @@ -74,15 +96,15 @@
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<version>2.6</version>
<version>2.7</version>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.18.1</version>
<version>2.19.1</version>
<configuration>
<systemProperties>
<property>
Expand Down
7 changes: 5 additions & 2 deletions project/Build.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,20 @@ import sbt._
import Keys._

object Dependencies {
val Jsoup = "org.jsoup" % "jsoup" % "1.7.2"
val Jsoup = "org.jsoup" % "jsoup" % "1.8.3"
val Slf4jApi = "org.slf4j" % "slf4j-api" % "1.6.6"
val Slf4jLog4j12 = "org.slf4j" % "slf4j-log4j12" % "1.6.6"
val CommonsLang = "commons-lang" % "commons-lang" % "2.6"
val CommonsLang3 = "org.apache.commons" % "commons-lang3" % "3.4"
val Log4j = "log4j" % "log4j" % "1.2.14"
val Guava = "com.google.guava" % "guava" % "19.0"
val HtmlCleaner = "net.sourceforge.htmlcleaner" % "htmlcleaner" % "2.16"
}

object SnacktoryBuild extends Build {
import Dependencies._

lazy val root = Project("snacktory", file("."),
settings = Defaults.defaultSettings ++
Seq(libraryDependencies ++= Seq(Jsoup, Slf4jApi, Slf4jLog4j12, CommonsLang, Log4j)))
Seq(libraryDependencies ++= Seq(Jsoup, Slf4jApi, Slf4jLog4j12, CommonsLang, CommonsLang3, Log4j, Guava, HtmlCleaner)))
}
Loading