Browser AI Agent

This is a personal automation agent that lets you control a browser using natural language. You tell it what to do in plain English, and it figures out how to do that by parsing your request into a plan, executing it step-by-step in a real browser, and reporting back the results.

No special commands. Just type something like:

Search for "machine learning" on Wikipedia and take a screenshot

And it will:

Open Wikipedia
Fill in the search box
Press Enter
Scroll and wait if needed
Capture the page

🎥 Demo Video

Watch the full product demo here:

⚙️ Installation Walkthrough

Step-by-step install guide: Installation on Loom

How it works

The app is made of two parts:

1. The Brain (Backend)

Built with FastAPI and Playwright
Uses Ollama + Mistral to turn natural language into action plans (like fill, click, goto)
Keeps a live browser session open using Chromium
Parses the DOM into structured elements
Can solve captchas (like Amazon image captchas) using Tesseract OCR
Every action is executed one by one with clear logs returned

2. The Face (Frontend)

Built with React + TailwindCSS
Simple text box to enter commands
Scrollable log of everything the agent is doing
Handles errors, failed clicks, selectors not found, etc.

Installation

This assumes you're running on macOS with Python 3.10+ and Node.js 18+ installed.

1. Clone the repo

git clone https://github.com/aarshitaacharya/browser-agent.git
cd browser-agent

2. Set up Python backend

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
playwright install

Install Tesseract if you want CAPTCHA solving:

brew install tesseract

Start the backend:

PYTHONPATH=$(pwd) python -m uvicorn main:app --reload

3. Set up React frontend

cd frontend
npm install
npm start

Make sure this runs on port 3000.

Usage

Once both frontend and backend are running:

Open http://localhost:3000
Type a command like:

Log in to saucedemo.com with username standard_user and password secret_sauce

Watch the steps show up in the log box as the agent acts

You can:

Take screenshots
Solve captchas (Amazon image-based ones)
Scroll, click, fill forms
Use ordinal commands like "click the second product"
Click links, images, buttons, etc.

Project structure

browser-agent/
├── actions/            # All atomic browser actions (click, fill, scroll...)
├── agents/             # Command parser + execution controller
├── api/                # FastAPI routes (like /interact, /extract)
├── browser_session.py  # Browser session manager
├── frontend/           # React frontend (UI)
├── utils/              # Logger and helper functions
├── main.py             # FastAPI app entry point
├── requirements.txt    # Python dependencies
├── build.sh            # Dev script to launch frontend + backend

Why this is fun

This app is fun because it's not just a chatbot. It actually does stuff. It clicks. It scrolls. It fills. It sees a CAPTCHA and tries to solve it. It's a little browser assistant that doesn't need hand-holding.

The coolest part? You can teach it new behaviors by just improving the LLM prompt or extracting more structure from the page.

And yeah, it's still a work in progress. But it's real. And it works.

Coming soon

Better CAPTCHA solving
Memory of previous pages
Keyboard shortcuts
Screenshot gallery

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Browser AI Agent

🎥 Demo Video

⚙️ Installation Walkthrough

How it works

1. The Brain (Backend)

2. The Face (Frontend)

Installation

1. Clone the repo

2. Set up Python backend

3. Set up React frontend

Usage

Project structure

Why this is fun

Coming soon

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
actions		actions
agents		agents
api		api
frontend		frontend
prompts		prompts
utils		utils
.gitignore		.gitignore
README.md		README.md
browser_session.py		browser_session.py
build.sh		build.sh
demo-thumb.png		demo-thumb.png
main.py		main.py
requirements.txt		requirements.txt

aarshitaacharya/browser-agent

Folders and files

Latest commit

History

Repository files navigation

Browser AI Agent

🎥 Demo Video

⚙️ Installation Walkthrough

How it works

1. The Brain (Backend)

2. The Face (Frontend)

Installation

1. Clone the repo

2. Set up Python backend

3. Set up React frontend

Usage

Project structure

Why this is fun

Coming soon

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages