Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(academy): add advanced crawling section with sitemaps and search #1217

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

metalwarrior665
Copy link
Member

No description provided.

@metalwarrior665
Copy link
Member Author

metalwarrior665 commented Sep 17, 2024

Will process the lint issues soon

@fnesveda fnesveda added the t-academy Issues related to Web Scraping and Apify academies. label Sep 18, 2024
@metalwarrior665
Copy link
Member Author

@TC-MO If we change URL of an article, do I need to contact web team to set a hard redirect?

@TC-MO
Copy link
Contributor

TC-MO commented Sep 18, 2024

I think we do redirects in nginx.conf file not sure if there is any other way

@metalwarrior665
Copy link
Member Author

TODO redirect
..._web_scraping/scraping_paginated_sites.md → ...scraping/crawling/crawling-with-search.md

@honzajavorek
Copy link
Collaborator

Will review, but I think I will wait for @TC-MO's comments to be addressed first.

@metalwarrior665 metalwarrior665 requested a review from TC-MO February 1, 2025 11:44
@metalwarrior665
Copy link
Member Author

@TC-MO I addressed the remarks and rebased to master, pls take a look. I will see the tests

Copy link
Collaborator

@honzajavorek honzajavorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 I think the front matter contains double semicolons, which could be a syntax error or a silent error causing unexpected problems.

🟠 I made a few suggestions on improvements. My aim was to add to readability and flow of the sentences.

  • I steered the text towards being language-agnostic, but I didn't check the whole context of the course to see if it's necessary or not.
  • I suggested to remove certain emotional phrases, such as "fortunately".
  • Saying "next" together with "lessons" didn't feel right, so I suggest using "following lessons" and to keep "next" for just one lesson.
  • You mention framework several times, and while it's perfectly valid, my idea would be to keep that word only for Crawlee (or Scrapy and others), and find other words to call the rest, such as algorithm, tool, or approach. I think separating those two clearly will remove some amount of possible confusion.
  • When linking to content, I prefer to put actual text and names to the link text instead of "see this article", and I think it's also a web document accessibility rule.

⚪️ I could imagine a bit more polishing of the flow of the sentences, and sometimes they could be more brief, but I don't think that's something we should do in PR reviews, it's highly subjective and a lot of back and forth. For minimalistic Czenglish check I can highly recommend "Is this correct English?" prompt to ChatGPT, followed by a Markdown block of whatever you want to check. It keeps your writing as is, but provides basic grammar and flow fixes. The text then sounds like you, but is more smooth. I use it a lot.

Commit all suggestions from Honza

Co-authored-by: Honza Javorek <[email protected]>
@metalwarrior665
Copy link
Member Author

@TC-MO @honzajavorek Thanks for the suggestions, I hope I applied everything

Copy link
Contributor

@TC-MO TC-MO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-academy Issues related to Web Scraping and Apify academies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants