🤖 General Web Crawler

Getting Started For Random Walk Web Crawler

Using a Conda Environment

Set up the conda environment according to requirement.txt. Install any missing packages as needed.
Download Chrome and Chrome Driver for your platform from the official website. Chrome is typically added to your system path during installation, but note the Chrome Driver path.
Run: python url_2_data_process.py --driver_path {chrome driver path} --url_list_path {url path} --root_dir {result path}. See the file for additional parameter options.

Using Docker Environment

Docker setup is not recommended at present. If you must use Docker（host can't install chrome driver）, follow these steps:

Obtain a Chrome Driver image either from the official website or request one from the administrator.
Set up the conda environment according to requirement.txt and install any missing packages.
Activate the environment before running bash container/run_url2data_docker.sh. You can modify parameters in the script file as needed. Pay special attention to the network environment within Docker - if external network access is unavailable, you'll need to configure a proxy.

Getting Started For Semi-Automated Web Crawler

Local Deployment (Using Docker)

Clone this repository and ensure Docker is installed in your environment.
If your network doesn't require a proxy, use the script/run_navi_collection_wo_proxy.sh file. If a proxy is needed, add your proxy settings to script/run_navi_collection_with_proxy.sh before using it.
Modify the root_dir property in the .sh startup file to the root directory location where you want to save trajectory data.
Run the bash script to start Gradio.
Access the URL provided by Gradio to begin collection. Refer to the "Usage Methods" section within Gradio for detailed collection specifications.

Local Deployment (Without Docker)

python web_semi_data_collection_async_gradio.py --root_dir {your_root_dir}

Note: If you want to collect WebArena data without logging in (assuming your host can connect to node 49), set the environment variable USE_STORAGE=1

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
script		script
url_folder		url_folder
util		util
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md
chrome_manager.py		chrome_manager.py
data_statistics.py		data_statistics.py
requirements.txt		requirements.txt
url_2_data_process.py		url_2_data_process.py
web_page_analyzer.py		web_page_analyzer.py
web_semi_data_collection_async_gradio.py		web_semi_data_collection_async_gradio.py
window_2_linux_format.py		window_2_linux_format.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 General Web Crawler

Getting Started For Random Walk Web Crawler

Using a Conda Environment

Using Docker Environment

Getting Started For Semi-Automated Web Crawler

Local Deployment (Using Docker)

Local Deployment (Without Docker)

About

Uh oh!

Releases

Packages

Languages

YangYzzzz/ScaleCUAWebCrawler

Folders and files

Latest commit

History

Repository files navigation

🤖 General Web Crawler

Getting Started For Random Walk Web Crawler

Using a Conda Environment

Using Docker Environment

Getting Started For Semi-Automated Web Crawler

Local Deployment (Using Docker)

Local Deployment (Without Docker)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages