Skip to content

Web crawler script for project "ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data"

Notifications You must be signed in to change notification settings

YangYzzzz/ScaleCUAWebCrawler

Repository files navigation

🤖 General Web Crawler

Getting Started For Random Walk Web Crawler

Using a Conda Environment

  1. Set up the conda environment according to requirement.txt. Install any missing packages as needed.
  2. Download Chrome and Chrome Driver for your platform from the official website. Chrome is typically added to your system path during installation, but note the Chrome Driver path.
  3. Run: python url_2_data_process.py --driver_path {chrome driver path} --url_list_path {url path} --root_dir {result path}. See the file for additional parameter options.

Using Docker Environment

Docker setup is not recommended at present. If you must use Docker(host can't install chrome driver), follow these steps:

  1. Obtain a Chrome Driver image either from the official website or request one from the administrator.
  2. Set up the conda environment according to requirement.txt and install any missing packages.
  3. Activate the environment before running bash container/run_url2data_docker.sh. You can modify parameters in the script file as needed. Pay special attention to the network environment within Docker - if external network access is unavailable, you'll need to configure a proxy.

Getting Started For Semi-Automated Web Crawler

Local Deployment (Using Docker)

  1. Clone this repository and ensure Docker is installed in your environment.
  2. If your network doesn't require a proxy, use the script/run_navi_collection_wo_proxy.sh file. If a proxy is needed, add your proxy settings to script/run_navi_collection_with_proxy.sh before using it.
  3. Modify the root_dir property in the .sh startup file to the root directory location where you want to save trajectory data.
  4. Run the bash script to start Gradio.
  5. Access the URL provided by Gradio to begin collection. Refer to the "Usage Methods" section within Gradio for detailed collection specifications.

Local Deployment (Without Docker)

python web_semi_data_collection_async_gradio.py --root_dir {your_root_dir}

Note: If you want to collect WebArena data without logging in (assuming your host can connect to node 49), set the environment variable USE_STORAGE=1

About

Web crawler script for project "ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published