This project is designed to scrape job-related data from VietnamWorks, TopCV, and LinkedIn. The data is extracted using Scrapy and Selenium, and for LinkedIn, it leverages the ScrapingDog API to bypass restrictions. After scraping, the data can be stored in a SQL Server Database using pyodbc for further analysis.
- Scrape job postings and associated data from VietnamWorks, TopCV, and LinkedIn.
- Leverage ScrapingDog API to handle LinkedIn data with minimal restrictions.
- Store the scraped data into a SQL Server Database for structured processing.
-
Install dependencies from
requirements.txt
:pip install -r requirements.txt
-
Create a free account on ScrapingDog and obtain an API key for LinkedIn data scraping.
-
Ensure you have a valid SQL Server database setup with appropriate credentials if you wish to store data in the database.
- Clone the repository and navigate to the project directory. Create a new virtual environment, for example:
conda create -n <name_env> python=3.12.7 conda activate <name_env>
- Update the Scrapy settings in
settings.py
to enable the pipeline:ITEM_PIPELINES = { "data_job_vn__analyze.pipelines.DataJobVnAnalyzePipeline": 300, }
Each website has its own spider. Use the following commands to run the crawlers:
-
For VietnamWorks:
scrapy crawl vnworks -o data_output/vnworks.json
-
For TopCV:
scrapy crawl topcv -o data_output/topcv.json
-
For LinkedIn (using ScrapingDog):
scrapy crawl linkedin -o data_output/linkedin.json
Note: The XPath expressions used to extract elements rely on the HTML class names, which are subject to change. Ensure you update these expressions in the spiders for accurate scraping.
The pipelines.py
file handles data insertion into the SQL Server database using pyodbc. Ensure the database connection details are correctly configured in pipelines.py
.
- Dynamic Elements: Many websites (e.g., VietnamWorks, TopCV) use dynamic content. If elements are not found during scraping, update the
XPath
orCSS selectors
in the spiders. - API Key for LinkedIn: Replace the placeholder API key in the LinkedIn spider with your ScrapingDog API key.
spiders/
: Contains individual spiders for VietnamWorks, TopCV, and LinkedIn.pipelines.py
: Handles data processing and storage in SQL Server.settings.py
: Contains project settings. EnsureITEM_PIPELINES
is enabled for data processing.requirements.txt
: List of dependencies.
- Extend support for more job platforms.
For a detailed explanation of how to use the crawled data for analysis, check out my blog post here.
Feel free to raise issues or contribute to this project. Your suggestions are always welcome!