This is a short field note on how we used Codex while building Pozify, our small-model workout form coach. Pozify takes a short exercise video and turns it into a structured form-review report: pose analysis, exercise routing, rep counting, issue markers, annotated clips, and a grounded coach summary.
That kind of product has a lot of moving pieces. It is not only a UI. It is not only a model. It has data preparation, computer vision, small-model inference, deterministic rules, safety wording, training scripts, deployment constraints, and docs. Codex helped because it could move across those layers with the repo open, while still letting us stay in control of product direction.
Codex did not replace engineering judgment. It made the loop between idea, implementation, review, and documentation much tighter.
The biggest advantage was context. A normal chatbot can answer a question, but Codex can inspect the
actual project: app.py, src/pozify/, web/, scripts/, configs/, tests/, and docs/.
That matters because the best answer for Pozify is usually not the most generic answer. It has to fit
the current pipeline.
For example, if we want to improve push-up feedback, the right starting point is not "write a new fitness AI feature." The right starting point is:
- read the existing push-up analyzer
- read the shared rep counter and issue-marker helpers
- check the tests that define current behavior
- understand the JSON contracts used by the UI and coach summary
- then propose the smallest useful change
Codex is good at that kind of grounded work. It can keep the current codebase in view while it brainstorms, implements, reviews, and documents.
Early in the project, many ideas started rough:
- Can the app explain bad reps better?
- Should unsupported exercises be rejected or forced into the closest label?
- How should we show confidence without making medical claims?
- What is the smallest useful coach summary model we can ship?
Codex was useful because we could ask it to brainstorm inside the project constraints. A good prompt was not just "give me ideas." It was closer to:
Read the current Pozify pipeline and brainstorm three ways to improve form feedback. Keep the ideas
compatible with the existing analyzer structure and avoid medical claims.
The result was more useful than a blank-page brainstorm. Codex could separate product ideas from engineering tasks, identify likely files, and call out risk. That helped us avoid turning every idea into a large rewrite.
The best brainstorming output usually had this shape:
- what the user problem is
- what the smallest version could be
- which files would change
- what tests would prove it works
- what wording needs human review
That is why Codex is good for early-stage product work: it can turn messy intent into a concrete engineering path without pretending the path is risk-free.
Pozify touches several systems that change over time: Hugging Face Spaces, Modal, MediaPipe, Gradio, small-model inference, and model publishing. We used Codex for deep research when we needed to understand a tool before changing code.
The useful pattern was to ask Codex to separate facts from recommendations:
Research the current deployment constraints that matter for this Gradio app. Prefer official
sources. Summarize the facts, explain what they mean for Pozify, then recommend changes only if they
are justified.
That made research actionable. We did not want a long pile of links. We wanted to know what affected the project:
- Does this runtime support the dependency we need?
- Should this model run through hosted inference or local inference?
- Does this training job belong locally, in CI, or on Modal?
- Which setting affects GPU time, startup time, or reliability?
Codex was good here because it could connect research back to the repo. It could say, "this affects the provider code," or "this belongs in the Modal script," or "this should be documented in the training report."
We still treated research as research. Codex could make recommendations, but we converted those recommendations into scoped implementation tasks before changing the project.
Codex was also useful as a collaboration tool. When multiple people touch a fast-moving project, the hard part is often not writing code. The hard part is remembering why a change exists, what is still untested, and what another teammate needs to know.
We used Codex to create handoff notes like:
Summarize the current branch for another contributor. Include what changed, why it changed, how to
test it, and what still needs review.
That was especially helpful around training and deployment work. A branch might include a script change, a config update, a docs update, and a model artifact note. Codex could inspect the diff and turn it into a readable handoff.
The collaboration rule we kept was simple: Codex can help explain and organize work, but it should
not overwrite another member's changes. Before editing, we ask it to inspect git status --short and
read relevant diffs. That keeps the workflow respectful of everyone else's worktree.
One of the best uses of Codex was code review. Not a replacement for human review, but a fast second pass before asking someone else to look.
The review prompt we used most often was direct:
Review the current diff. Focus on correctness, regressions, missing tests, grounding, and user
safety. Put findings first with file and line references.
That framing matters. We did not ask Codex to nitpick style. We asked it to look for things that could break the product:
- a pipeline contract changed but the UI still expects the old field
- a fallback path no longer works when a model provider fails
- a test fixture covers only the happy path
- a coach summary can say more than the structured evidence supports
- a deployment setting works locally but not on Hugging Face Spaces
Codex is good at review because it can inspect related files quickly. If a change touches
src/pozify/steps/coach_summary.py, it can also check the verifier, fallback summary, provider
tests, and docs. That is the kind of cross-file attention that catches practical regressions.
For implementation, Codex was most helpful when the task was scoped. "Improve the app" is too broad. "Update the result view to show summary provider metadata and add a focused test" is a good Codex task.
For Python pipeline work, we ask Codex to follow existing project structure:
- pipeline steps live under
src/pozify/steps/ - exercise logic lives under
src/pozify/exercises/ - shared contracts live in
src/pozify/contracts.py - training and publishing workflows live in
scripts/andconfigs/ - behavior should be covered in
tests/
For UI work, we ask it to inspect both the Gradio entrypoint and static assets:
app.pyweb/index.htmlweb/app.jsweb/report.jsweb/styles.css
The strongest Codex implementation loop looks like this:
- Read the relevant files.
- Explain the smallest safe change.
- Make the edit.
- Add or update focused tests.
- Run the relevant checks.
- Summarize what changed and what remains uncertain.
That loop is where Codex feels different from autocomplete. It is not only producing lines of code. It is helping maintain the whole change: code, tests, docs, and verification.
One thing that made Codex more effective was using plugins for tasks that needed more than plain code editing. The value was not "more tools for the sake of tools." The value was staying in one development flow while Codex used the right capability at the right time.
For Pozify, the most useful plugin pattern was UI verification. When we changed the app interface, Codex could edit the frontend code, start the local app, open it in a browser, inspect the result, and then come back to the code with a concrete fix. That is much better than only reading CSS and guessing whether the page looks right.
Plugins also helped with artifact-heavy work. Pozify has reports, model-card style docs, demo notes, and training writeups. When the output is a document, presentation, spreadsheet, screenshot, or PDF, it is useful for Codex to work with the artifact directly instead of treating everything like raw text.
The practical lesson was simple: use a plugin when the task has a real environment or artifact to inspect.
- For UI work, use browser inspection instead of trusting code alone.
- For docs and reports, use document-aware workflows when layout or structure matters.
- For product design work, use design-oriented workflows before jumping into implementation.
- For generated artifacts, ask Codex to render or verify the result when possible.
That made Codex feel less like a detached assistant and more like a teammate sitting inside the same workspace.
Skills were useful for a different reason. A plugin gives Codex a capability. A skill gives Codex a way of working.
In this project, we used skills as repeatable playbooks for work that needed a consistent standard. For example, documentation should not be a random dump of notes. It should have a clear audience, scope, and structure. UI work should not only "compile"; it should be checked for layout, responsive behavior, and product fit. Code review should start with bugs and regressions, not style opinions.
Skills helped encode those expectations. Instead of re-explaining the standard every time, we could ask Codex to use the relevant skill and then let it follow that workflow:
- documentation skills for clear project docs, reports, and handoff notes
- frontend/design skills for UI changes that need visual quality and responsive behavior
- code review behavior for focused review comments and missing-test analysis
- product or research skills when we needed to compare options before implementation
The important habit was to invoke the skill before the work starts. That makes Codex read the right instructions first, then inspect the project, then act. The result is more consistent than asking for a one-off answer each time.
For a fast project like Pozify, that consistency mattered. We were moving between model training, UI, docs, deployment, and tests. Skills helped keep the quality bar stable while the task type kept changing.
Pozify has repeated workflows: running fast tests, preparing data, training routers, training coach summaries, publishing artifacts, and keeping docs in sync. Codex helped us turn some of those manual steps into explicit scripts and checklists.
Automation is a good Codex task because the desired behavior can be made concrete:
Add a script that runs the fast Pozify validation checks before a PR. Reuse existing commands, avoid
network-dependent steps, and document how to run it.
The important part is that automation should be boring. It should log clearly, fail clearly, and avoid surprising side effects. For this project, anything involving credentials, model uploads, dataset publishing, or GPU spend still needs human approval.
Codex is good at automation because it can inspect how the project already runs. It can reuse
uv run pytest, uv run ruff check ., Modal scripts, existing config files, and docs instead of
inventing a parallel workflow.
We also used Codex to decide what should not be automated. Some actions are too expensive or risky to run silently: uploading a model, publishing a dataset, spending GPU time, changing public demo behavior, or rewriting safety wording. For those, the better automation is a checklist or a command with an explicit approval step.
That split made automation more useful:
- automate local checks that are cheap and repeatable
- script data and training setup when the inputs and outputs are clear
- document manual approval points for publishing and public claims
- use reminders or handoff notes for follow-up work that should not block a coding session
The best automations were small. A good script saved a few minutes every time and made failure obvious. A good checklist prevented a risky release mistake. Codex helped build both.
The most effective Codex workflow combined all three.
Plugins gave Codex access to the working surface. Skills gave it the right operating style. Automation made the repeated parts cheap.
For example, a UI change could flow like this:
- Use a frontend or product-design skill to frame the change.
- Ask Codex to inspect
app.pyandweb/. - Implement the smallest UI update.
- Use the browser plugin to open the local app and check the rendered result.
- Run focused tests or linting.
- Update docs or write a handoff note.
A training workflow looked different:
- Use Codex to research or review the training goal.
- Inspect
scripts/,configs/, and the relevant training docs. - Update the script or config in a scoped way.
- Automate only the safe local checks.
- Keep model upload, dataset publishing, and GPU-heavy runs behind human approval.
- Record metrics and artifact paths in the docs.
That is where Codex became especially effective. It was not one magic prompt. It was a repeatable system: choose the right playbook, use the right tool, automate the boring part, and keep human judgment on the decisions that matter.
For this project, Codex was good for eight practical reasons.
First, it works with the real repo. It can read the current files, not just guess from a description. That makes its suggestions more grounded.
Second, it moves across layers. Pozify needs Python, web UI, ML scripts, configs, tests, and docs. Codex can connect those pieces in one task.
Third, it is good at turning ambiguity into a plan. When an idea is vague, Codex can propose options, tradeoffs, affected files, and a smallest useful version.
Fourth, it is good at review. It can look at a diff and check related files faster than a human can manually scan the whole repo.
Fifth, it helps preserve momentum. Instead of stopping to remember the exact test command, docs location, or helper API, we can ask Codex to inspect and continue.
Sixth, it improves documentation while the context is still fresh. After implementing a change, Codex can update the relevant docs and write a handoff note before details are forgotten.
Seventh, plugins let it inspect real outputs. That is important for UI, documents, generated artifacts, and local app behavior.
Eighth, skills and automation make the workflow repeatable. The team does not have to rebuild the same process from memory each time.
The biggest lesson is that Codex works best when responsibility stays clear.
Humans still own:
- product direction
- user safety language
- fitness and health-related claims
- dataset choices and licensing judgment
- public model and dataset publishing
- final code review and merge approval
- whether a feature is actually useful to users
Codex helps us move faster, but it does not decide what Pozify should be. That distinction matters a lot for a product that gives workout feedback. The app should be grounded in evidence, and the development process should be grounded too.
The workflow we settled into is simple:
- Use Codex to inspect the repo and understand the current shape.
- Brainstorm or research with project constraints in view.
- Pick a small, human-approved direction.
- Ask Codex to implement the scoped change.
- Ask Codex to review the diff.
- Run tests, linting, local app checks, or browser checks.
- Update docs and write a handoff note.
That workflow made Codex valuable throughout the build. It helped us think, research, collaborate, review, implement, and automate without turning the project into a black box.
The short version: Codex is good because it compresses the distance between intent and verified change. For Pozify, that meant more time spent on product judgment and less time lost to mechanical work, context switching, and stale documentation.