Codebase Audits

An agent must build a working knowledge of a given codebase.

The knowledge will change over time.

The onetime bulk collection of knowledge about a codebase we will call an Audit.

Before an agent takes any action in a codebase it must perform an Audit.

Audits will adapt the concept of reflections from the Generative Agents paper, to derive higher level insights from observations.

For example, an agent may observe:

A file/folder hierarchy including potentially all file contents
A history of GitHub commits
A list of outstanding or solved GitHub issues & pull requests
(Eventually) out-of-repo knowledge like external documentation

Any and all such knowledge may be relevant to a given action taken by an agent, but most of which would not fit in the context window of a given LLM call.

To retrieve relevant material to feed into an LLM call's context, we will vector-embed all of the above, as well as the reflections upon those observations.

Reflections may include statements describing a code author's inferred intent.

It may for example become clear strictly from GitHub commit messages (for example "try to fix this fucking thing" and "try it another way"), paired with the diffs for those commits, what a user is struggling with and what would be a much appreciated fix.

Zooming out: the cadence of features a user adds to codebase, paired with knowledge derived from perhaps a README about the overall intent of the project, may enable an agent to suggest highly specific & desired advancements in a codebase.

First we need to generate those higher level observations.

Let's revisit the Generative Agents paper (edited slightly to reflect our autodev use case instead of the original RPG example). First for memory and retrieval:

1. Memory and Retrieval

Challenge: Creating generative agents that can simulate human behavior requires reasoning about a set of experiences that is far larger than would should be described in a prompt, as the full memory stream can distract the model and does not even currently fit into the limited context window. Consider an autodev agent answering the question, "What are the most important lessons you learned?" First summarizing all of the agent's experiences to fit in the limited context window of the language model produces an uninformative response, where the agent discusses topics such as observations on particular coding files or its review of a human coder's work. Instead of summarizing, the memory stream described below surfaces relevant memories, resulting in a more informative and specific response that mentions the agent's lessons learned based on higher-level reflections.

Approach: The memory stream maintains a comprehensive record of the agent's experience. It is a list of memory objects, where each object contains a natural language description, a creation timestamp and a most recent access timestamp. The most basic element of the memory stream is an observation, which is an event directly perceived by an agent. Common observations include behaviors performed by the agent themselves, or behaviors that agents perceive being performed by other agents or non-agent objects or people. For instance, Faerie, who is the lead AI developer of OpenAgents, might accrue the following observations over time: 1) Ren submitted a GitHub pull request to fix bugs and introduce a new feature; 2) Erik commented on Ren's pull request; 3) the PR's automated CI/CD unit tests failed; 4) Erik and Ren discussed the PR in Discord; 5) one of the issues under discussion had a bug fix listed on StackOverflow; 6) that same issue was discussed on Twitter.

Our architecture implements a retrieval function that takes the agent's current situation as input and returns a subset of the memory stream to pass on to the language model. There are many possible implementations of a retrieval function, depending on what it is important that the agent consider when deciding how to act. In our context, we focus on three main components that together produce effective results.

Recency assigns a higher score to memory objects that were recently accessed, so that events from a moment ago or this morning are likely to remain in the agent's attentional sphere. In our implementation, we treat recency as an exponential decay function over the number of sandbox game hours since the memory was last retrieved. Our decay factor is 0.99.

Importance distinguishes mundane from core memories, by assigning a higher score to those memory objects that the agent believes to be important. For instance, a mundane event such as observing an "Okay" message in Slack from one developer to another would yield a low importance score, whereas discovery of a catastrophic bug related to an issue the agent was tasked to fix would yield a high score. There are again many possible implementations of an importance score; we find that directly asking the language model to output an integer score is effective.

Relevance assigns a higher score to memory objects that are related to the current situation. What is relevant depends on the answer to, "Relevant to what?", so we condition relevance on a query memory. If the query, for example, is that an agent is deciding what files to inspect to address a bug with Nostr relay subscriptions, memory objects about conversation between human developers should have low relevance, whereas memory objects about how Nostr relays work and which files in our codebase relate to that should have high relevance. In our implementation, we use the language model to generate an embedding vector of the text description of each memory. Then, we calculate relevance as the cosine similarity between the memory's embedding vector and the query memory's embedding vector.

To calculate the final retrieval score, we normalize the recency, relevance, and importance scores to the range of [0, 1] by min-max scaling. The retrieval function scores all memories as a weighted combination of the three elements. The top-ranked memories that fit in the language model's context window are then included in the prompt.

The important part for us, building on memory and retrieval, is reflection:

4.2 Reflection

Challenge: Generative agents, when equipped with only raw observational memory, struggle to generalize or make inferences. Consider a scenario in which Klaus Mueller is asked by the user: “If you had to choose one person of those you know to spend an hour with, who would it be?" With access to only observational memory, the agent simply chooses the person with whom Klaus has had the most frequent interactions: Wolfgang, his college dorm neighbor. Unfortunately, Wolfgang and Klaus only ever see each other in passing, and do not have deep interactions. A more desirable response requires that the agent generalize from memories of Klaus spending hours on a research project to generate a higher-level reflection that Klaus is passionate about research, and likewise recognize Maria putting in effort into her own research (albeit in a different field), enabling a reflection that they share a common interest. With the approach below, when Klaus is asked who to spend time with, Klaus chooses Maria instead of Wolfgang.

Approach: We introduce a second type of memory, which we call a reflection. Reflections are higher-level, more abstract thoughts generated by the agent. Because they are a type of memory, they are included alongside other observations when retrieval occurs. Reflections are generated periodically; in our implementation, we generate reflections when the sum of the importance scores for the latest events perceived by the agents exceeds a threshold (150 in our implementation). In practice, our agents reflected roughly two or three times a day. The first step in reflection is for the agent to determine what to reflect on, by identifying questions that can be asked given the agent’s recent experiences. We query the large language model with the 100 most recent records in the agent’s memory stream (e.g., “Klaus Mueller is reading a book on gentrification”, “Klaus Mueller is conversing with a librarian about his research project”, “desk at the library is currently unoccupied”) and prompt the language model, “Given only the information above, what are 3 most salient high-level questions we can answer about the subjects in the statements?” The model’s response generates candidate questions: for example, What topic is Klaus Mueller passionate about? and What is the relationship between Klaus Mueller and Maria Lopez? We use these generated questions as queries for retrieval, and gather relevant memories (including other reflections) for each question. Then we prompt the language model to extract insights and cite the particular records that served as evidence for the insights. The full prompt is as follows:

Statements about Klaus Mueller

Klaus Mueller is writing a research paper

Klaus Mueller enjoys reading a book on gentrification

Klaus Mueller is conversing with Ayesha Khan about exercising [...]

What 5 high-level insights can you infer from the above statements? (example format: insight (because of 1, 5, 3))

This process generates statements such as Klaus Mueller is dedicated to his research on gentrification (because of 1, 2, 8, 15). We parse and store the statement as a reflection in the memory stream, including pointers to the memory objects that were cited. Reflection explicitly allows the agents to reflect not only on their observations but also on other reflections: for example, the second statement about Klaus Mueller above is a reflection that Klaus previously had, not an observation from his environment. As a result, agents generate trees of reflections: the leaf nodes of the tree represent the base observations, and the non-leaf nodes represent thoughts that become more abstract and higher-level the higher up the tree they are.

Audit flow

Given a repo we'll first pull from the GitHub API the file/folder hierarchy of the root folder.

Basic observed data will be logged as observations. Then we'll make reflections based on those observations. We can make further reflections based on reading a given GitHub issue and assessing relevance of any observed data to the GitHub issue. Then we can empower the agent to decide which folder or file to look more deeply at to gather more information needed to solve the issue or complete a general audit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codebase Audits

1. Memory and Retrieval

Audit flow

Clone this wiki locally