Web scraping vulnerability. A demo of potential threat to information security at SaaS.
This Python script scrapes product data from the Tops Online website (www dot tops dot co dot th) for specified categories. It uses Selenium to handle dynamic content rendered by Handlebars.
Important: This script is designed for EDUCATIONAL PURPOSES ONLY. I DO NOT SUGGEST OR SUPPORT SCRAPING ANY WEBSITE OR ONLINE CONTENT AT ALL, without explicit permission from the owner of that content. Any such scraping after the required permission must also be for very limited data set and for specific educational purpose only. Scraping websites can be against their terms of service. Always review a website's terms of service and robots.txt file before scraping. Use this script responsibly and ethically. Be mindful of website load and avoid making excessive requests. The website structure may change at any time, which can break this script.
Therefore, read the following VERY CAREFULLY before using this script:
This script is provided for educational and informational purposes only. The author makes no warranties, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the script or the information, products, services, or related graphics contained in the script for any purpose. Any reliance you place on such information is therefore strictly at your own risk.
The author does not endorse or encourage the use of this script for any illegal or unethical activities. It is the sole responsibility of the user to ensure that any use of this script complies with all applicable laws, regulations, and website terms of service. This includes, but is not limited to, laws relating to data privacy, copyright, intellectual property, and computer misuse.
The author shall not be liable for any loss or damage whatsoever arising from the use of this script, including, without limitation, indirect or consequential loss or damage, or any loss or damage whatsoever arising from loss of data or profits arising out of, or in connection with, the use of this script.
The user acknowledges and agrees that they are solely responsible for any consequences that may arise from the use of this script. The author shall not be held responsible for any legal action, financial penalties, or other repercussions that may result from the use of this script.
By using this script, you agree to these terms and conditions. If you do not agree to these terms and conditions, do not use this script. It is strongly recommended that you consult with a legal professional before using this script to ensure compliance with all applicable laws and regulations.
This scraper uses the following approach:
-
Selenium for Dynamic Content: The Tops Online website uses Handlebars to render product data dynamically. This means the initial HTML source code doesn't contain all the product information. We use Selenium to launch a headless Chrome browser, navigate to the category page, wait for the Handlebars templates to render the content, and then extract the fully rendered HTML.
-
Beautiful Soup for Parsing: Once Selenium retrieves the fully rendered HTML, we use Beautiful Soup to parse the HTML and extract the desired data elements (product name, price, etc.) using CSS selectors.
-
Data Storage: The extracted data is structured into a Python dictionary and then saved as a JSON file for each category. The script also saves a copy of the rendered HTML to a file for debugging purposes.
- Python 3.6+
requests
library (Not directly used in the current implementation, but kept for potential future use.)beautifulsoup4
libraryselenium
librarychromedriver
- The ChromeDriver executable compatible with your Chrome browser version.
-
Clone this repository:
git clone [repository url, coming up...] cd tops_scraper
-
Install the required Python packages:
pip install requests beautifulsoup4 selenium
-
Install ChromeDriver:
- Download the ChromeDriver executable that is compatible with your installed Chrome browser version from https://chromedriver.chromium.org/downloads.
- Extract the downloaded file.
- Place the
chromedriver
executable in a directory that is in your system'sPATH
or specify the path tochromedriver
directly in thetops_online_scraper.py
script.
-
Edit the script (Optional):
- Open
tops_online_scraper.py
and modify thecategories
dictionary to include the category URLs you want to scrape. - Important: Verify the path to your
chromedriver
executable in the script. The lineservice = Service(executable_path='/opt/homebrew/bin/chromedriver')
should point to the correct location.
- Open
-
Run the script:
python tops_online_scraper.py
-
The script will create a directory named
tops_co_th_data
. Inside this directory, you'll find:- JSON files (e.g.,
OTOP.json
,Only At Tops.json
) containing the scraped product data for each category. - HTML files (e.g.,
OTOP.html
,Only At Tops.html
) containing the full HTML source code of the scraped pages, useful for debugging.
- JSON files (e.g.,
-
Website Structure: This script relies heavily on the specific HTML structure and JSON data format of the Tops Online website. Any changes to the website's layout will likely break the script. You will need to update the CSS selectors and data extraction logic accordingly. Examine the HTML files generated for debugging purposes.
-
Rate Limiting: While this script does not implement explicit rate limiting (a
time.sleep()
was available but not used for demonstration), it's crucial to be respectful to the website's server and prevent rate limiting. Consider adding atime.sleep()
call inside the loop to introduce a delay between requests. If you encounter issues, you may need to increase the delay. -
Terms of Service: Always respect the website's terms of service and robots.txt file.
-
Selenium Configuration: The script uses headless Chrome to avoid opening a visible browser window. You can remove the
--headless
argument fromchrome_options
if you want to see the browser during scraping. -
Error Handling: The script includes basic error handling to catch exceptions during the scraping process. You can enhance the error handling to log errors to a file or implement retry mechanisms.
-
JSON Structure: The extracted data is stored in a JSON format, where each category has a list of product dictionaries. Each product dictionary contains fields like "Product Name", "Price", "Quantity", "Image URL" etc..
-
Sample Output Generation Separate HTML and JSON output files are saved for each category in
tops_co_th_data
folder. This structure improves readability and debuggability, making it easier to understand, analyse and modify extracted data.
This script currently implements a basic approach and relies on a delay (can be added) to avoid overwhelming the server. More robust bot protection techniques may be needed, such as:
- User-Agent Rotation: Changing the User-Agent header in each request to mimic different browsers.
- Proxy Rotation: Using a pool of proxy servers to rotate IP addresses.
- CAPTCHA Solving: Integrating a CAPTCHA solving service to bypass CAPTCHAs.
Important: Implementing these techniques can be complex and may violate the website's terms of service. Use them responsibly and ethically.
This project is licensed under the MIT License. By using this code or any part of it, you hold yourself fully responsible for any consequences whatsoever.
You agree to indemnify and hold harmless the author of this script from and against any and all claims, liabilities, damages, losses, and expenses (including reasonable attorneys' fees) arising out of or in any way connected with your use of this script, including, but not limited to, any claims that the use of this script infringes any intellectual property rights, privacy rights, or other rights of any third party, or violates any applicable law or regulation. This indemnification obligation will survive the termination of your use of the script.
Please immediately inform the author of this script, if you come across, notice or are aware of any breach of any term of the legal agreement attached with this script.
Ram