Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When a read timeout occurs, crawler4j doesn't try to visit the webpage once again #99

Open
wojtuch opened this issue Nov 19, 2015 · 1 comment · May be fixed by #437
Open

When a read timeout occurs, crawler4j doesn't try to visit the webpage once again #99

wojtuch opened this issue Nov 19, 2015 · 1 comment · May be fixed by #437

Comments

@wojtuch
Copy link

wojtuch commented Nov 19, 2015

Hello everyone,

first of all - thanks for this useful and amazing piece of software!

Unfortunately in my recent project it is important to crawl the whole website, so the URL's the crawler catches timeout on, should be rescheduled and visited once again. Googling brought me to the old homepage of the project (https://code.google.com/p/crawler4j/issues/detail?id=261) where I found out that crawler4j retries several times.

However, the URL's causing timeouts appear only once in my logfiles (which alone doesn't neccesarily mean the erroneous behaviour of the crawler -- it could as well be that they get succesfully fetched upon the first retry). Unfortunately the URL's can't be found in my database after the crawler terminates neither -- which ensures me that the retry didn't take place.

Could you help me with that?
Best,
Wojciech

@jasonbronson
Copy link

You could set the URL into a list and check each web page status code and if there is no status code you could re-run the crawler on those URLs.
shouldVisit(Page page, WebURL url) {

@dgoiko dgoiko linked a pull request Feb 3, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants