Skip to content

Web scraping vulnerability. A demo of potential threat to information security at SaaS.

Notifications You must be signed in to change notification settings

ramonrails/web_scraping_vulnerability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Tops Online Scraper

Web scraping vulnerability. A demo of potential threat to information security at SaaS.

This Python script scrapes product data from the Tops Online website (www dot tops dot co dot th) for specified categories. It uses Selenium to handle dynamic content rendered by Handlebars.

Disclaimer

Important: This script is designed for EDUCATIONAL PURPOSES ONLY. I DO NOT SUGGEST OR SUPPORT SCRAPING ANY WEBSITE OR ONLINE CONTENT AT ALL, without explicit permission from the owner of that content. Any such scraping after the required permission must also be for very limited data set and for specific educational purpose only. Scraping websites can be against their terms of service. Always review a website's terms of service and robots.txt file before scraping. Use this script responsibly and ethically. Be mindful of website load and avoid making excessive requests. The website structure may change at any time, which can break this script.

Therefore, read the following VERY CAREFULLY before using this script:

This script is provided for educational and informational purposes only. The author makes no warranties, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the script or the information, products, services, or related graphics contained in the script for any purpose. Any reliance you place on such information is therefore strictly at your own risk.

The author does not endorse or encourage the use of this script for any illegal or unethical activities. It is the sole responsibility of the user to ensure that any use of this script complies with all applicable laws, regulations, and website terms of service. This includes, but is not limited to, laws relating to data privacy, copyright, intellectual property, and computer misuse.

The author shall not be liable for any loss or damage whatsoever arising from the use of this script, including, without limitation, indirect or consequential loss or damage, or any loss or damage whatsoever arising from loss of data or profits arising out of, or in connection with, the use of this script.

The user acknowledges and agrees that they are solely responsible for any consequences that may arise from the use of this script. The author shall not be held responsible for any legal action, financial penalties, or other repercussions that may result from the use of this script.

By using this script, you agree to these terms and conditions. If you do not agree to these terms and conditions, do not use this script. It is strongly recommended that you consult with a legal professional before using this script to ensure compliance with all applicable laws and regulations.

Approach

This scraper uses the following approach:

  1. Selenium for Dynamic Content: The Tops Online website uses Handlebars to render product data dynamically. This means the initial HTML source code doesn't contain all the product information. We use Selenium to launch a headless Chrome browser, navigate to the category page, wait for the Handlebars templates to render the content, and then extract the fully rendered HTML.

  2. Beautiful Soup for Parsing: Once Selenium retrieves the fully rendered HTML, we use Beautiful Soup to parse the HTML and extract the desired data elements (product name, price, etc.) using CSS selectors.

  3. Data Storage: The extracted data is structured into a Python dictionary and then saved as a JSON file for each category. The script also saves a copy of the rendered HTML to a file for debugging purposes.

Prerequisites

  • Python 3.6+
  • requests library (Not directly used in the current implementation, but kept for potential future use.)
  • beautifulsoup4 library
  • selenium library
  • chromedriver - The ChromeDriver executable compatible with your Chrome browser version.

Installation

  1. Clone this repository:

    git clone [repository url, coming up...]
    cd tops_scraper
  2. Install the required Python packages:

    pip install requests beautifulsoup4 selenium
  3. Install ChromeDriver:

    • Download the ChromeDriver executable that is compatible with your installed Chrome browser version from https://chromedriver.chromium.org/downloads.
    • Extract the downloaded file.
    • Place the chromedriver executable in a directory that is in your system's PATH or specify the path to chromedriver directly in the tops_online_scraper.py script.

Usage

  1. Edit the script (Optional):

    • Open tops_online_scraper.py and modify the categories dictionary to include the category URLs you want to scrape.
    • Important: Verify the path to your chromedriver executable in the script. The line service = Service(executable_path='/opt/homebrew/bin/chromedriver') should point to the correct location.
  2. Run the script:

    python tops_online_scraper.py
  3. The script will create a directory named tops_co_th_data. Inside this directory, you'll find:

    • JSON files (e.g., OTOP.json, Only At Tops.json) containing the scraped product data for each category.
    • HTML files (e.g., OTOP.html, Only At Tops.html) containing the full HTML source code of the scraped pages, useful for debugging.

Important Notes

  • Website Structure: This script relies heavily on the specific HTML structure and JSON data format of the Tops Online website. Any changes to the website's layout will likely break the script. You will need to update the CSS selectors and data extraction logic accordingly. Examine the HTML files generated for debugging purposes.

  • Rate Limiting: While this script does not implement explicit rate limiting (a time.sleep() was available but not used for demonstration), it's crucial to be respectful to the website's server and prevent rate limiting. Consider adding a time.sleep() call inside the loop to introduce a delay between requests. If you encounter issues, you may need to increase the delay.

  • Terms of Service: Always respect the website's terms of service and robots.txt file.

  • Selenium Configuration: The script uses headless Chrome to avoid opening a visible browser window. You can remove the --headless argument from chrome_options if you want to see the browser during scraping.

  • Error Handling: The script includes basic error handling to catch exceptions during the scraping process. You can enhance the error handling to log errors to a file or implement retry mechanisms.

  • JSON Structure: The extracted data is stored in a JSON format, where each category has a list of product dictionaries. Each product dictionary contains fields like "Product Name", "Price", "Quantity", "Image URL" etc..

  • Sample Output Generation Separate HTML and JSON output files are saved for each category in tops_co_th_data folder. This structure improves readability and debuggability, making it easier to understand, analyse and modify extracted data.

Bot Protection Techniques

This script currently implements a basic approach and relies on a delay (can be added) to avoid overwhelming the server. More robust bot protection techniques may be needed, such as:

  • User-Agent Rotation: Changing the User-Agent header in each request to mimic different browsers.
  • Proxy Rotation: Using a pool of proxy servers to rotate IP addresses.
  • CAPTCHA Solving: Integrating a CAPTCHA solving service to bypass CAPTCHAs.

Important: Implementing these techniques can be complex and may violate the website's terms of service. Use them responsibly and ethically.

License

This project is licensed under the MIT License. By using this code or any part of it, you hold yourself fully responsible for any consequences whatsoever.

Indemnity

You agree to indemnify and hold harmless the author of this script from and against any and all claims, liabilities, damages, losses, and expenses (including reasonable attorneys' fees) arising out of or in any way connected with your use of this script, including, but not limited to, any claims that the use of this script infringes any intellectual property rights, privacy rights, or other rights of any third party, or violates any applicable law or regulation. This indemnification obligation will survive the termination of your use of the script.

Responsible behavior

Please immediately inform the author of this script, if you come across, notice or are aware of any breach of any term of the legal agreement attached with this script.

Author

Ram

About

Web scraping vulnerability. A demo of potential threat to information security at SaaS.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published