Skip to content

Latest commit

 

History

History
184 lines (142 loc) · 5.74 KB

OSWorld.md

File metadata and controls

184 lines (142 loc) · 5.74 KB

Deplying Agent-S in OSWorld

Step 1: Environment Setup

Assuming you've followed the guide in the README.md, your repository structure should look like:

parent/
  └── Agent-S/

The next step is to follow the set up instructions for OSWorld: https://github.com/xlang-ai/OSWorld.git.

To easily run Agent-S on OSWorld locally, We recommend moving your OSWorld local repository to the parent directory of Agent-S.

parent/
  ├── Agent-S/
  └── OSWorld/

We suggest creating a separate conda environment for each repository to avoid dependency conflicts.

Step 2: Modifying OSWorld run.py

After completing the setup instructions, import the GraphSearchAgent into the run.py file in OSWorld. The GraphSearchAgent is the parent agent used in the Agent S2 framework.

from gui_agents.s2.agents.grounding import OSWorldACI
from gui_agents.s2.agents.agent_s import GraphSearchAgent

Replace the PromptAgent on line 138 in the test() method with the Graph Search Agent. Specify engine params and instantiate the agent as shown:

parser.add_argument("--vm_version", type=str, default="new")

...

if args.model.startswith("claude"):
  engine_type = "anthropic"
elif args.model.startswith("gpt"):
  engine_type = "openai"
else:
  engine_type = "vllm"

engine_params = {
  "engine_type": engine_type,
  "model": args.model,
}

engine_params_for_grounding = {
    "engine_type": "huggingface",
    "endpoint_url": "<endpoint_url>/v1/",
}

current_platform = "ubuntu"

grounding_agent = OSWorldACI(
    platform=current_platform,
    engine_params_for_generation=engine_params,
    engine_params_for_grounding=engine_params_for_grounding
)

agent = GraphSearchAgent(
  engine_params,
  grounding_agent,
  platform=current_platform,
  action_space="pyautogui",
  observation_type="mixed",
  search_engine="Perplexica"
)

We support all multimodal models from OpenAI, Anthropic, and vLLM. For more information, refer to models.md.

We have set the latest Agent S2 to use the latest Ubuntu VM image from OSWorld. However, our experiments are based on the older version of the VM. To reproduce the results, set the vm_version argument to 'old' while instantiating the agent.

Step 3: Best Practices

At this point, you will have set up the Agent-S and OSWorld environments and the VMWare Workstation Pro application. Below, we'll list some best practices, and common problems and their fixes.


from desktop_env.desktop_env import DesktopEnv

example = {
    "id": "94d95f96-9699-4208-98ba-3c3119edf9c2",
    "instruction": "I want to install Spotify on my current system. Could you please help me?",
    "config": [
        {
            "type": "execute",
            "parameters": {
                "command": [
                    "python",
                    "-c",
                    "import pyautogui; import time; pyautogui.click(960, 540); time.sleep(0.5);"
                ]
            }
        }
    ],
    "evaluator": {
        "func": "check_include_exclude",
        "result": {
            "type": "vm_command_line",
            "command": "which spotify"
        },
        "expected": {
            "type": "rule",
            "rules": {
                "include": ["spotify"],
                "exclude": ["not found"]
            }
        }
    }
}

env = DesktopEnv(action_space="pyautogui")

obs = env.reset(task_config=example)
obs, reward, done, info = env.step("pyautogui.rightClick()")

The code above will boot up a VM and restart it. If, for whatever reason, running the starter code below leads to an infinitely long run time, cancel out of the VM. You should then see:

parent/
  Agent-S/
  OSWorld/
    vmware_vm_data/
      Ubuntu0/
        *.lck
        *.vmem
        ...
      ...
      UbuntuX/

If you happen to have any *.lck folder in your VM's folder, be sure to delete them. Every time you are powering on the VM from creating a new DesktopEnv instance, you need to delete the *.lck folders first. If your VM is already powered on, and your session (in a Jupyter Notebook, for example) crashes, you can keep the *.lck files and just re-instantiate the DesktopEnv instance. I'd also suggest using just a single VM (as a VM takes up a lot of space!).


If even after rerunning the code and deleting the *.lck files don't work, then you should try passing in the path_to_vm explicitly to the DesktopEnv class.

env = DesktopEnv(action_space="pyautogui", headless=False, require_terminal=True, path_to_vm=<absolute_path>)

Pass the absolute path to your VM's (Ubuntu0) .vmx file. This file is located here:

parent/
  Agent-S/
  OSWorld/
    vmware_vm_data/
      Ubuntu0/
        *.lck
        *.vmem
        ...
        *.vmx
      ...
      UbuntuX/

📌 Note: If you are testing on the os domain, there is an issue with pyautogui. A hacky way to solve this is to, inside the VM, locate where the pyautogui module is installed and open the __init__.py located under the pyautogui folder and remove the "<" in the set(...) within the following function:

def isShiftCharacter(character):
    """
    Returns True if the ``character`` is a keyboard key that would require the shift key to be held down, such as
    uppercase letters or the symbols on the keyboard's number row.
    """
    # NOTE TODO - This will be different for non-qwerty keyboards.
    return character.isupper() or character in set('~!@#$%^&*()_+{}|:"<>?')

📌 Note: If in case, your VM encounters an issue with "The root file system on requires a manual fsck", reset the VM to the previous snapshot.

With these changes, you should be able to get up and running with VMWare, DesktopEnv, and OSWorld! 😊