Releases: HHN/crawler4j
v5.1.0
What's new?
This release introduces several key improvements and dependency upgrades, ensuring enhanced performance and compatibility. A major change is the switch from the heavier Apache Tika Standard Parser package to the lightweight HTML module. If your project requires parsing binary content, you will now need to manually add the Apache Tika Standard Parser dependency as follows:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard</artifactId>
<version>3.0.0</version>
</dependency>
Make sure to add this alongside your existing Crawler4j dependencies to maintain binary content parsing capabilities.
Please also note, that it now uses ´slf4j2` as logging facade.
Auto Generated Changelog
- Bump flyway-core from 9.18.0 to 9.19.4 by @dependabot in #223
- Bump jackson-core from 2.15.1 to 2.15.2 by @dependabot in #220
- Bump license-maven-plugin from 2.0.1 to 2.1.0 by @dependabot in #222
- Bump hsqldb from 2.7.1 to 2.7.2 by @dependabot in #219
- Bump maven-surefire-plugin from 3.1.0 to 3.1.2 by @dependabot in #221
- Bump commons-io from 2.12.0 to 2.13.0 by @dependabot in #225
- Bump versions-maven-plugin from 2.15.0 to 2.16.0 by @dependabot in #224
- Bump httpcore5 from 5.2.1 to 5.2.2 by @dependabot in #226
- Bump httpcore5-h2 from 5.2.1 to 5.2.2 by @dependabot in #228
- Bump license-maven-plugin from 2.1.0 to 2.2.0 by @dependabot in #229
- Bump flyway-core from 9.19.4 to 9.20.0 by @dependabot in #227
- Bump org.flywaydb:flyway-core from 9.20.0 to 9.21.0 by @dependabot in #233
- Bump org.junit.jupiter:junit-jupiter from 5.9.3 to 5.10.0 by @dependabot in #231
- Bump org.flywaydb:flyway-core from 9.21.0 to 9.21.1 by @dependabot in #234
- Bump com.helger:ph-css from 7.0.0 to 7.0.1 by @dependabot in #235
- Bump org.flywaydb:flyway-core from 9.21.1 to 9.22.0 by @dependabot in #241
- Bump org.flywaydb:flyway-core from 9.22.0 to 9.22.1 by @dependabot in #244
- Bump com.github.tomakehurst:wiremock-jre8 from 2.35.0 to 2.35.1 by @dependabot in #243
- Bump org.apache.maven.plugins:maven-enforcer-plugin from 3.3.0 to 3.4.1 by @dependabot in #242
- Bump apache.tika.version from 2.8.0 to 2.9.0 by @dependabot in #239
- Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.5.0 to 3.6.0 by @dependabot in #248
- Bump org.codehaus.mojo:versions-maven-plugin from 2.16.0 to 2.16.1 by @dependabot in #249
- Bump org.apache.httpcomponents.core5:httpcore5-h2 from 5.2.2 to 5.2.3 by @dependabot in #247
- Bump org.apache.httpcomponents.core5:httpcore5 from 5.2.2 to 5.2.3 by @dependabot in #246
- Bump commons-io:commons-io from 2.13.0 to 2.15.0 by @dependabot in #255
- Bump com.fasterxml.jackson.core:jackson-core from 2.15.2 to 2.16.0 by @dependabot in #258
- Bump org.flywaydb:flyway-core from 9.22.1 to 10.0.1 by @dependabot in #257
- Bump de.thetaphi:forbiddenapis from 3.5.1 to 3.6 by @dependabot in #251
- Bump apache.tika.version from 2.9.0 to 2.9.1 by @dependabot in #259
- Bump org.codehaus.mojo:versions-maven-plugin from 2.16.1 to 2.16.2 by @dependabot in #260
- Bump log4j.version from 2.20.0 to 2.22.0 by @dependabot in #261
- Bump org.junit.jupiter:junit-jupiter from 5.10.0 to 5.10.2 by @dependabot in #263
- Bump org.flywaydb:flyway-core from 10.0.1 to 10.8.1 by @dependabot in #268
- Bump org.apache.httpcomponents.core5:httpcore5-h2 from 5.2.3 to 5.2.4 by @dependabot in #267
- Bump com.fasterxml.jackson.core:jackson-core from 2.16.0 to 2.16.1 by @dependabot in #266
- Bump org.assertj:assertj-core from 3.24.1 to 3.25.3 by @dependabot in #264
- Bump log4j.version from 2.22.0 to 2.23.0 by @dependabot in #271
- Bump org.postgresql:postgresql from 42.6.0 to 42.7.2 by @dependabot in #269
- Bump org.apache.maven.plugins:maven-compiler-plugin from 3.11.0 to 3.12.1 by @dependabot in #270
- Bump slf4j.version from 1.7.36 to 2.0.12 by @dependabot in #274
- Bump commons-io:commons-io from 2.15.0 to 2.17.0 by @dependabot in #280
- Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.6.0 to 3.10.1 by @dependabot in #281
- Bump de.thetaphi:forbiddenapis from 3.6 to 3.8 by @dependabot in #284
- Bump com.fasterxml.jackson.core:jackson-core from 2.16.1 to 2.18.0 by @dependabot in #285
- Bump org.hsqldb:hsqldb from 2.7.2 to 2.7.3 by @dependabot in #283
- Bump org.apache.httpcomponents.core5:httpcore5 from 5.2.3 to 5.3 by @dependabot in #282
- Bump org.apache.httpcomponents.core5:httpcore5-h2 from 5.2.4 to 5.3 by @dependabot in #286
- Bump slf4j.version from 2.0.12 to 2.0.16 by @dependabot in #289
- Bump org.codehaus.mojo:versions-maven-plugin from 2.16.2 to 2.17.1 by @dependabot in #288
- Bump org.apache.maven.plugins:maven-surefire-plugin from 3.1.2 to 3.5.1 by @dependabot in #287
- Bump org.apache.maven.plugins:maven-archetype-plugin from 3.2.1 to 3.3.0 by @dependabot in #290
- Bump org.apache.maven.archetype:archetype-packaging from 3.2.1 to 3.3.0 by @dependabot in #292
- Bump log4j.version from 2.23.0 to 2.24.1 by @dependabot in #291
- Bump org.apache.httpcomponents.client5:httpclient5 from 5.2.1 to 5.4 by @dependabot in #293
- Bump org.assertj:assertj-core from 3.25.3 to 3.26.3 by @dependabot in #294
- Bump com.helger:ph-css from 7.0.1 to 7.0.3 by @dependabot in #295
- Bump org.awaitility:awaitility from 4.2.0 to 4.2.2 by @dependabot in #297
- Bump org.apache.maven.plugins:maven-source-plugin from 3.3.0 to 3.3.1 by @dependabot in #296
- Bump org.postgresql:postgresql from 42.7.2 to 42.7.4 by @dependabot in #300
- Bump org.apache.maven.plugins:maven-gpg-plugin from 3.1.0 to 3.2.7 by @dependabot in #298
- Bump com.github.tomakehurst:wiremock-jre8 from 2.35.1 to 2.35.2 by @dependabot in #299
- Bump org.apache.maven.plugins:maven-jar-plugin from 3.3.0 to 3.4.2 by @dependabot in #301
- Bump org.flywaydb:flyway-core from 10.8.1 to 10.20.0 by @dependabot in #306
- Bump org.codehaus.mojo:license-maven-plugin from 2.2.0 to 2.4.0 by @dependabot in #303
- Bump com.github.crawler-commons:urlfrontier-API from 2.3.1 to 2.4 by @dependabot in #305
- Bump com.mchange:c3p0 from 0.9.5.5 to 0.10.1 by @dependabot in #304
- Bump org.junit.jupiter:junit-jupiter from 5.10.2 to 5.11.3 by @dependabot in #308
- Bump org.jacoco:jacoco-maven-plugin from 0.8.10 to 0.8.12 by @dependabot in #310
- Bump com.zaxxer:HikariCP from 5.0.1 to 6.0.0 by @dependabot in #309
- Bump com.helger:ph-css from 7.0.1 to 7.0.3 by @dependabot in #307
- Bump org.apache.maven.plugins:maven-compiler-plugin from 3.12.1 to 3.13.0 by @dependabot in #311
- Bump org.apache.maven.plugins:maven-enforcer-plugin from 3.4.1 to 3.5.0 by @dependabot in #312
- Releae 5.1.0 by @rzo1 in #313
Full Changelog: v5.0.2...v5.1.0
v5.0.2
What's Changed
- Bump flyway-core from 9.10.0 to 9.10.1 by @dependabot in #164
- Bump maven-resources-plugin from 3.2.0 to 3.3.0 by @dependabot in #167
- Bump versions-maven-plugin from 2.13.0 to 2.14.1 by @dependabot in #168
- Bump archetype-packaging from 2.4 to 3.2.1 by @dependabot in #165
- Bump flyway-core from 9.10.1 to 9.11.0 by @dependabot in #171
- Bump assertj-core from 3.23.1 to 3.24.1 by @dependabot in #172
- Bump versions-maven-plugin from 2.14.1 to 2.14.2 by @dependabot in #170
- Bump flyway-core from 9.11.0 to 9.12.0 by @dependabot in #177
- Bump junit-jupiter from 5.9.1 to 5.9.2 by @dependabot in #173
- Bump maven-surefire-plugin from 3.0.0-M7 to 3.0.0-M8 by @dependabot in #174
- Bump httpcore5 from 5.2 to 5.2.1 by @dependabot in #176
- Bump httpcore5-h2 from 5.2 to 5.2.1 by @dependabot in #175
- Bump maven-enforcer-plugin from 3.1.0 to 3.2.1 by @dependabot in #182
- Bump postgresql from 42.5.1 to 42.5.3 by @dependabot in #181
- Bump apache.tika.version from 2.6.0 to 2.7.0 by @dependabot in #180
- Bump flyway-core from 9.12.0 to 9.14.1 by @dependabot in #183
- Bump ph-css from 6.5.0 to 7.0.0 by @dependabot in #184
- Bump flyway-core from 9.14.1 to 9.15.0 by @dependabot in #185
- Bump maven-javadoc-plugin from 3.4.1 to 3.5.0 by @dependabot in #188
- Bump maven-surefire-plugin from 3.0.0-M8 to 3.0.0-M9 by @dependabot in #186
- Bump postgresql from 42.5.3 to 42.5.4 by @dependabot in #187
- Bump versions-maven-plugin from 2.14.2 to 2.15.0 by @dependabot in #192
- Bump maven-compiler-plugin from 3.10.1 to 3.11.0 by @dependabot in #189
- Bump flyway-core from 9.15.0 to 9.15.1 by @dependabot in #190
- Bump flyway-core from 9.15.1 to 9.16.1 by @dependabot in #198
- Bump maven-surefire-plugin from 3.0.0-M9 to 3.0.0 by @dependabot in #195
- Bump postgresql from 42.5.4 to 42.6.0 by @dependabot in #194
- Bump log4j.version from 2.19.0 to 2.20.0 by @dependabot in #191
- Bump maven-resources-plugin from 3.3.0 to 3.3.1 by @dependabot in #201
- Bump flyway-core from 9.16.1 to 9.16.3 by @dependabot in #203
- Bump forbiddenapis from 3.4 to 3.5.1 by @dependabot in #199
- Bump jacoco-maven-plugin from 0.8.8 to 0.8.9 by @dependabot in #200
- Bump maven-enforcer-plugin from 3.2.1 to 3.3.0 by @dependabot in #202
- Bump junit-jupiter from 5.9.2 to 5.9.3 by @dependabot in #207
- Bump jacoco-maven-plugin from 0.8.9 to 0.8.10 by @dependabot in #205
- Bump flyway-core from 9.16.3 to 9.17.0 by @dependabot in #206
- Bump jackson-core from 2.14.1 to 2.15.0 by @dependabot in #204
- Bump maven-gpg-plugin from 3.0.1 to 3.1.0 by @dependabot in #211
- Bump maven-surefire-plugin from 3.0.0 to 3.1.0 by @dependabot in #210
- Bump license-maven-plugin from 2.0.0 to 2.0.1 by @dependabot in #209
- Bump apache.tika.version from 2.7.0 to 2.8.0 by @dependabot in #212
- Bump flyway-core from 9.17.0 to 9.18.0 by @dependabot in #213
- Bump commons-io from 2.11.0 to 2.12.0 by @dependabot in #215
- Bump jackson-core from 2.15.0 to 2.15.1 by @dependabot in #216
- Bump maven-source-plugin from 3.2.1 to 3.3.0 by @dependabot in #214
Full Changelog: v5.0.1...v5.0.2
v5.0.1
What's Changed
- [#160] Create a Maven archetype to bootstrap a simple crawler setup by @rzo1 in #163
- Bump jackson-core from 2.13.3 to 2.13.4 by @dependabot in #124
- Bump log4j.version from 2.18.0 to 2.19.0 by @dependabot in #128
- Bump apache.tika.version from 2.4.1 to 2.5.0 by @dependabot in #137
- Bump hsqldb from 2.7.0 to 2.7.1 by @dependabot in #140
- Bump httpcore5 from 5.1.4 to 5.2 by @dependabot in #146
- Bump httpcore5-h2 from 5.1.4 to 5.2 by @dependabot in #143
- Bump jackson-core from 2.13.4 to 2.14.0 by @dependabot in #145
- Bump httpclient5 from 5.1.3 to 5.2 by @dependabot in #148
- Bump apache.tika.version from 2.5.0 to 2.6.0 by @dependabot in #149
- Bump urlfrontier-API from 2.2 to 2.3.1 by @dependabot in #157
- Bump postgresql to 42.5.1 by @dependabot in #159
- Bump jackson-core from 2.14.0 to 2.14.1 by @dependabot in #156
- Bump httpclient5 from 5.2 to 5.2.1 by @dependabot in #161
- Bump flyway-core to 9.10.0 by @dependabot in #162
Full Changelog: v5.0.0...v5.0.1
v5.0.0
What's Changed
- Documentation updates by @brbog in #111
- Bump flyway-core from 9.1.3 to 9.1.6 by @dependabot in #116
- Bump postgresql from 42.4.1 to 42.4.2 by @dependabot in #115
- JUnit5 migration by @brbog in #117
- Groovy expulsion by @brbog in #120
Full Changelog: v4.10.1...v5.0.0
v4.10.1
What's Changed
- Bump groovy-all from 3.0.11 to 3.0.12 by @dependabot in #98
- Bump junit-jupiter from 5.8.2 to 5.9.0 by @dependabot in #99
- Bump crawler-commons from 1.2 to 1.3 by @dependabot in #100
- Bump flyway-core from 8.5.13 to 9.0.4 by @dependabot in #101
- Bump hsqldb from 2.6.1 to 2.7.0 by @dependabot in #102
- Move authentication logic out of the PageFetcher constructor to allow… by @brbog in #104
- Bump flyway-core from 9.0.4 to 9.1.2 by @dependabot in #105
- Bump postgresql from 42.4.0 to 42.4.1 by @dependabot in #106
- perform exception handling outside Parser.parse method by @brbog in #107
- Ignore css parse errors and allow more versatile parsing implementations by @brbog in #108
- Bump flyway-core from 9.1.2 to 9.1.3 by @dependabot in #109
- Bump maven-javadoc-plugin from 3.4.0 to 3.4.1 by @dependabot in #110
Full Changelog: v4.10.0...v4.10.1
v4.10.0
What's Changed
- Bump maven-enforcer-plugin from 3.0.0 to 3.1.0 by @dependabot in #83
- Bump assertj-core from 3.22.0 to 3.23.1 by @dependabot in #82
- Bump postgresql from 42.3.6 to 42.4.0 by @dependabot in #84
- Bump apache.tika.version from 2.4.0 to 2.4.1 by @dependabot in #85
- Bump flyway-core from 8.5.12 to 8.5.13 by @dependabot in #86
- Bump protobuf-java-util from 3.21.1 to 3.21.2 by @dependabot in #87
- Bump log4j.version from 2.17.2 to 2.18.0 by @dependabot in #88
- allow overriding the used parsers for binary, css, plain text by @brbog in #90
- use css parser instead of broken regex to extract urls by @brbog in #92
- Http status code "303 See Other" can return a non-absolute url, which… by @brbog in #94
- Upgrade to URL Frontier 2.2 (was 2.1)
- Update HTTP Components to 5.1.4
Full Changelog: v4.9.1...v4.10.0
v4.9.1
What's Changed
- Bump postgresql from 42.3.4 to 42.3.5 by @dependabot in #64
- Bump versions-maven-plugin from 2.10.0 to 2.11.0 by @dependabot in #68
- Bump flyway-core from 8.5.10 to 8.5.11 by @dependabot in #67
- Bump jackson-core from 2.13.2 to 2.13.3 by @dependabot in #65
- Bump protobuf-java-util from 3.20.1 to 3.21.1 by @dependabot in #70
- Bump postgresql from 42.3.5 to 42.3.6 by @dependabot in #69
- Documentation updates by @brbog in #71
- Bump flyway-core from 8.5.11 to 8.5.12 by @dependabot in #76
- Bump groovy-all from 3.0.10 to 3.0.11 by @dependabot in #75
- bug: a detected url: "http://" by @brbog in #72
- tests proving incorrect matching of data urls inside a real life use case by @brbog in #78
- Bump urlfrontier-API from 2.0 to 2.1 by @dependabot in #66
- Issue parse css containing absolute urls by @brbog in #81
New Contributors
Full Changelog: v.4.9.0...v4.9.1
v4.9.0
What's Changed
Breaking Change
- Removal of IOException in CrawlController.addSeed(String): #61
Dependency
- Bump tika from 2.3.0 to 2.4.0 by @rzo1
- Bump postgresql from 42.3.3 to 42.3.4 by @dependabot in #56
- Bump jacoco-maven-plugin from 0.8.7 to 0.8.8 by @dependabot in #54
- Bump flyway-core from 8.5.4 to 8.5.8 by @dependabot in #57
- Bump flyway-core from 8.5.8 to 8.5.9 by @dependabot in #58
- Bump maven-javadoc-plugin from 3.3.2 to 3.4.0 by @dependabot in #59
- Bump protobuf-java-util from 3.19.4 to 3.20.1 by @dependabot in #60
- Bump flyway-core from 8.5.9 to 8.5.10 by @dependabot in #63
- Bump urlfrontier-API from 1.2 to 2.0 by @dependabot in #62
Full Changelog: v4.8.3...v.4.9.0
v4.8.3
What's Changed
- Bump flyway-core from 8.5.0 to 8.5.1 by @dependabot in #43
- Bump maven-compiler-plugin from 3.10.0 to 3.10.1 by @dependabot in #48
- Bump flyway-core from 8.5.1 to 8.5.4 by @dependabot in #49
- Bump jackson-core from 2.13.1 to 2.13.2 by @dependabot in #44
- Bump groovy-all from 3.0.9 to 3.0.10 by @dependabot in #45
- Bump versions-maven-plugin from 2.9.0 to 2.10.0 by @dependabot in #50
- Bump urlfrontier-API from 1.0 to 1.2 by @dependabot in #47
- Bump forbiddenapis from 3.2 to 3.3 by @dependabot in #51
Full Changelog: v4.8.2...v4.8.3
v4.8.2
What's Changed
- Change of Default Behaviour: Politeness is now applied per host (rather than per request). To restore the "old" behavoir, you can use the
SimplePolitnessServer
as constructor parameter ofPageFetcher
- Bump postgresql from 42.3.2 to 42.3.3 by @dependabot in #41
- Bump flyway-core from 8.4.4 to 8.5.0 by @dependabot in #38
- Bump spock-core from 2.0-groovy-3.0 to 2.1-groovy-3.0 by @dependabot in #39
Full Changelog: v.4.8.1...v4.8.2