A collection of small programs that extract data from a website and packages it to be useful with the use of BeautifulSoup, a Python package for parsing HTML and XML documents. Once you retrive the raw HTML of a site, you can start to select and extract with BeautifulSoup, which parses raw HTML strings and produces an object that mirrors HTML documents' structure.
- Check a website's Term and Conditions before scraping it and read the statements about legal use of the data.
- Do not request data from the website too aggressiely and ensure that your program behaves in a reasonable manner.
- Revisit the website and rewrite code as needed as the layout of the site may change.