-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][WIP] Common: Add an Initial Chat Memory Interface/Implementation #12698
base: master
Are you sure you want to change the base?
Conversation
29a9f5f
to
236241c
Compare
@ggerganov @ngxson If you would be willing, I'd like to hear any thoughts you have. I may dramatically change the backend memory implementation, but I want to make sure the way I'm interacting with main.cpp and server.cpp is reasonable. |
Signed-off-by: Mark Nelson <[email protected]>
Signed-off-by: Mark Nelson <[email protected]>
Signed-off-by: Mark Nelson <[email protected]>
Signed-off-by: Mark Nelson <[email protected]>
236241c
to
02d643b
Compare
@markhpc I am not familiar with the "ChatGPT memories" feature and how it works. And briefly looking at the implementation I still don't know what it is (excuse me if it is something obvious). But I would go out on a limb and say that most likely this is something we don't want to implement in the |
Agree with @ggerganov . This feature is a cool UX but will be very difficult to maintain. I would categorize such features as "prompt engineering" and not actually an inference feature. Indeed, before chatgpt even have the memory feature, I also implemented myself this feature in my own llama.cpp private fork using both prompts and llama_kv shifting API. It works for a while, but very tricky, doesn't work with all kinds of models. I think in the future, with the addition of MCP in server web UI, this could be implemented in a more generic way. All cool things that people talk about like MCP, agent, tool calling, RAG are just prompt engineering anyway, just matter of how to organize the code. |
@ggerganov @ngxson Thank you both for your quick feedback! FWIW, the goal here isn’t to replicate ChatGPT’s memory feature as a UX layer or only via prompting. My goal is to introduce an interface for directly interacting with inference at a deeper level. Right now that's to provide access to structured, namespaced data storage (key/value in this case). The demo here is just a std::map, but it could easily be sqlite3, S3, or Ceph. In the future I want to do more. I want to be able to eventually enable mid-stream behavioral constraint. That's why I tried to keep the implementation (ChatMemorySimple) separated from the interface that enables it (which I should probably rename, it's really an inference hook). The long term goal is to support external governance scaffolding: tools for hallucination recovery, telos tracking, violation logging, and long-term reasoning...in addition to storing user memories. I suspect that without these kinds of structures, persistent memory features will always be fragile unless reinforced through fine-tuning or runtime constraint. This is an attempt to prototype a real runtime cognition layer, not just simulate memory within the model’s weights. This is my first stab at trying to move some of this from model-level simulation into real code using real storage. If this is something you guys think might be interested in, I would love figure out a lightweight way to tie into the inference loop. That's the key piece I believe I need as I'm not sure I can do everything completely externally. |
Aren't these just prompt engineering? |
Which missing API calls from |
@ngxson Thank you! The basic idea is to eventually implement runtime analysis and feedback loops over inference. Track violations, dynamic constraint enforcement, and telos based reasoning adjustments. IE, be able to shape future inference based on analysis of current and past inference rather than just injecting static prompts. That’s why I started with the memory scaffolding in this PR. Long term though, I want this to be a route toward behavior auditing, hallucination reduction, and model drift control. My background is in storage and this is my first dive into the llama.cpp code, so I confess I'm still working to understand exactly what I need. I believe it might look something like this though:
Most of this is already there and being used in the ChatMemory interface in this PR. I reflect, it might be better to rename this to something like "InferenceHook". The core idea here is to see if this kind of inference-aware runtime behavior shaping could be an optional path forward. I 100% agree that it needs to be opt-in and lightweight though. My hope is that this could allow for a huge amount of flexibility for future developers: memory systems, constraint engines, even alignment logic, all without having to modify the core inference path. |
Update: In parallel I'm working on using the same interface for a governance model where I re-inject feedback into the next prompt based on the previous response. This works, but at least with Gemma3 doesn't consistently override undesirable behavior so I'm now working to learn how logit biases work and where I could potentially modify them. My current goal is to create per-session in-order tracking of tasks so I can then do things like compare responses, set up dynamic logit biases, look at drift, etc. I believe I can do this from within |
This is a rough proof-of-concept for implementing a chat memory interface inspired by ChatGPT's memories feature. It is separated into 3 parts:
A key goal for this POC was to minimize the changes to main/server and keep as much of the logic in the chat-memory classes as possible. One specific change that was necessary for instance was to pass the conv_id from the webui back to the server so that each session has its own memory. Potentially per-user or per-group memories could be implemented as well. A future goal for this project would be to allow integration with local databases, S3, and Ceph to store these memories persistently.
The simple implementation has a lot of code dedicated to trying to keep the model from hallucinating about the state of the memory, with limited success. The model used for testing this is Gemma 3 4B Q8 and it aggressively wants to trust its own training to make up fake statistics. It's possible larger or other models may behave better, however this will need active work and may require specialized training to work consistently.
In addition to the above issue (among others!), this POC has several deficiencies:
My goal before taking this any further is to solicit feedback from ggml and the greater community to see if this project merits continued development. While the vast majority of the code is in ChatMemorySimple, the more important pieces to focus on IMHO is the interface, base class, and modifications to the existing code.