Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dead link detection more robust #1437

Open
vegetabill opened this issue Dec 31, 2020 · 12 comments
Open

Make dead link detection more robust #1437

vegetabill opened this issue Dec 31, 2020 · 12 comments
Assignees
Labels
EASY Quick or simple task good-first-issue Simple issue for those new to the repo or open-source in general

Comments

@vegetabill
Copy link
Collaborator

Recently we added a list of false positives:

https://github.com/Techtonica/curriculum/blob/main/meta/false-dead-links.md

I'm assuming these are caused by:

  • sites that block bots
  • since we have many links to the same sites we might be getting rate limited

Ideally when we run the report, it should be easy to see if we actually have dead links.

@vegetabill
Copy link
Collaborator Author

@gsong any ideas on this?

@alodahl
Copy link
Collaborator

alodahl commented Dec 31, 2020

You mean like running a diff automatically? That would be great - I haven't thought about a real script yet.

@alodahl
Copy link
Collaborator

alodahl commented Dec 31, 2020

I did add a line to be aware of the false positives in CONTRIBUTING.md, at least.

@alodahl
Copy link
Collaborator

alodahl commented Dec 31, 2020

Also @CoderCarrot do you have more insight on Bills first comment here?

@alodahl alodahl added good-first-issue Simple issue for those new to the repo or open-source in general EASY Quick or simple task labels Jan 1, 2021
@CoderCarrot
Copy link
Contributor

Also @CoderCarrot do you have more insight on Bills first comment here?

I was thinking about this. I do not have any immediate or straight-forward ideas on this, but it's something I could work on. Someone more experienced may come up with a quicker, more elegant solution, but I would be happy to look into it when I have time!

@stale
Copy link

stale bot commented Feb 19, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Feb 19, 2021
@stale stale bot closed this as completed Feb 26, 2021
@vegetabill
Copy link
Collaborator Author

I think this would be useful. I merged better StaleBot rules and am reopening this

@vegetabill vegetabill reopened this Feb 27, 2021
@manufacturedba
Copy link
Contributor

hey @alodahl @vegetabill, I'd be happy to take a stab at this with some of the available lint rule's config options

@manufacturedba
Copy link
Contributor

sites that block bots

Majority of false positives seem to fall under this category. The PR I put up skips the 2 highest offenders (codepen/github) and localhost. They were also inducing a lot more timeouts when combined with other config changes I tested. Obviously skipping is not the ideal route, but it cuts down enough noise to be trusted.

I think to get these domains back, they'll need to be checked by hand and/or checked far less often.

  1. Collect all links using remark
  2. Read existing link list
  3. Filter for links that either have no entry or have an expired timestamp
  4. Write back to link list with successful links and some future timestamp
  5. Output failed links

To stop the script from checking sites with bot protection, anyone can append a link themselves with whatever timestamp. Also this script takes an eternity as-is, so the side-effect should be a huge reduction in run-time.

@alodahl
Copy link
Collaborator

alodahl commented Sep 13, 2021

thanks for the insights, @manufacturedba . would you be open to adding these ideas as notes to our last section in https://github.com/Techtonica/curriculum/blob/main/CONTRIBUTING.md#L58 as part of the PR so the knowledge isn’t lost?

@manufacturedba
Copy link
Contributor

Yup, do you have thoughts on the suggested manual steps to be feasible for the team?

Mainly its the following

To stop the script from checking sites with bot protection, anyone can append a link themselves with whatever timestamp. Also this script takes an eternity as-is, so the side-effect should be a huge reduction in run-time.

I will include the steps with the PR that implements this. The current PR is only a stop-gap.

@alodahl
Copy link
Collaborator

alodahl commented Oct 3, 2021

Yup, do you have thoughts on the suggested manual steps to be feasible for the team?

Mainly its the following

To stop the script from checking sites with bot protection, anyone can append a link themselves with whatever timestamp. Also this script takes an eternity as-is, so the side-effect should be a huge reduction in run-time.

I will include the steps with the PR that implements this. The current PR is only a stop-gap.

sounds good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EASY Quick or simple task good-first-issue Simple issue for those new to the repo or open-source in general
Projects
None yet
Development

No branches or pull requests

7 participants