- Set up the conda environment according to
requirement.txt. Install any missing packages as needed. - Download Chrome and Chrome Driver for your platform from the official website. Chrome is typically added to your system path during installation, but note the Chrome Driver path.
- Run:
python url_2_data_process.py --driver_path {chrome driver path} --url_list_path {url path} --root_dir {result path}. See the file for additional parameter options.
Docker setup is not recommended at present. If you must use Docker(host can't install chrome driver), follow these steps:
- Obtain a Chrome Driver image either from the official website or request one from the administrator.
- Set up the conda environment according to
requirement.txtand install any missing packages. - Activate the environment before running
bash container/run_url2data_docker.sh. You can modify parameters in the script file as needed. Pay special attention to the network environment within Docker - if external network access is unavailable, you'll need to configure a proxy.
- Clone this repository and ensure Docker is installed in your environment.
- If your network doesn't require a proxy, use the
script/run_navi_collection_wo_proxy.shfile. If a proxy is needed, add your proxy settings toscript/run_navi_collection_with_proxy.shbefore using it. - Modify the
root_dirproperty in the .sh startup file to the root directory location where you want to save trajectory data. - Run the bash script to start Gradio.
- Access the URL provided by Gradio to begin collection. Refer to the "Usage Methods" section within Gradio for detailed collection specifications.
python web_semi_data_collection_async_gradio.py --root_dir {your_root_dir}Note: If you want to collect WebArena data without logging in (assuming your host can connect to node 49), set the environment variable USE_STORAGE=1