You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! Firstly, thank you for such a wonderful library.
I am currently developing a custom prompt auto-optimization tool that requires support for multimodal multi-turn conversations. As part of the optimization prompt for teacher model, there will be a section containing examples of student model request processing. The structure of these examples will look something like this:
## Example N ##
<message_ID>__<var_name> = <var_value>
...
<message_ID>__<var_name> = <var_value>
Predicted answer: <pred>
Ground truth answer: <gtruth>
Conclusion: <conclusion>
Here:
Each message can be of any role (system, user, assistant)
Variables can represent either text or image data
From my understanding of the code here
image tokens are currently appended to the beginning of the text prompt
def_get_full_multimodal_text_prompt(placeholder_counts: Dict[str, int],
text_prompt: str) ->str:
"""Combine multimodal prompts for a multimodal language model."""
...
...
return"\n".join(missing_placeholders+ [text_prompt])
This approach stacks all image tokens together at the start of the prompt, which could potentially impact the performance of the teacher model. I would like to know if it is possible to implement fully interleaved support for multimodal prompts?
I have implemented some custom modifications to the code, but I am uncertain whether these changes are compatible with all VLM architectures or if they might affect subsequent processing stages:
IMAGE_PLACEHOLDER="<##IMAGE##>"
..............
def_get_full_multimodal_text_prompt(placeholder_counts: Dict[str, int],
texts: List[str],
interleave: bool) ->str:
"""Combine multimodal prompts for a multimodal language model."""return (
_full_multimodal_text_prompt_interleave(placeholder_counts, texts) ifinterleaveelse_full_multimodal_text_prompt_simple(placeholder_counts, "\n".join(texts))
)
def_full_multimodal_text_prompt_simple(placeholder_counts: Dict[str, int],
text_prompt: str) ->str:
# Look through the text prompt to check for missing placeholdersmissing_placeholders: List[str] = []
forplaceholderinplaceholder_counts:
# For any existing placeholder in the text prompt, we leave it as isplaceholder_counts[placeholder] -=text_prompt.count(placeholder)
ifplaceholder_counts[placeholder] <0:
raiseValueError(
f"Found more '{placeholder}' placeholders in input prompt than ""actual multimodal data items.")
missing_placeholders.extend([placeholder] *placeholder_counts[placeholder])
# NOTE: For now we always add missing placeholders at the front of# the prompt. This may change to be customizable in the future.return"\n".join(missing_placeholders+ [text_prompt])
def_list_replace(obj: list, old: Any, new: Any, n: int=1) ->None:
assertn>0idx=0whilen!=0andidx!=len(obj):
ifobj[idx] ==old:
obj[idx] =newn-=1idx+=1def_full_multimodal_text_prompt_interleave(placeholder_counts: Dict[str, int],
texts: List[str]) ->str:
forplaceholder, ninplaceholder_counts.items():
_list_replace(texts, IMAGE_PLACEHOLDER, placeholder, n)
return"\n".join(texts)
..............
def_parse_chat_message_content_parts(
role: str,
parts: Iterable[ChatCompletionContentPartParam],
mm_tracker: BaseMultiModalItemTracker,
*,
wrap_dicts: bool,
interleave: bool=True
) ->List[ConversationMessage]:
content=list[_ContentPart]()
mm_parser=mm_tracker.create_parser()
forpartinparts:
parse_res=_parse_chat_message_content_part(
part,
mm_parser,
wrap_dicts=wrap_dicts,
interleave=interleave
)
ifparse_res:
content.append(parse_res)
ifwrap_dicts:
# Parsing wraps images and texts as interleaved dictionariesreturn [ConversationMessage(role=role,
content=content)] # type: ignoretexts=cast(List[str], content)
mm_placeholder_counts=mm_parser.mm_placeholder_counts()
ifmm_placeholder_counts:
text_prompt=_get_full_multimodal_text_prompt(mm_placeholder_counts,
texts,
interleave=interleave)
else:
text_prompt="\n".join(texts)
return [ConversationMessage(role=role, content=text_prompt)]
def_parse_chat_message_content_part(
part: ChatCompletionContentPartParam,
mm_parser: BaseMultiModalContentParser,
*,
wrap_dicts: bool,
interleave: bool
) ->Optional[_ContentPart]:
....
ifpart_type=="image_url":
str_content=cast(str, content)
mm_parser.parse_image(str_content)
return {'type': 'image'} ifwrap_dictselse (IMAGE_PLACEHOLDERifinterleaveelseNone)
....
If these modifications are ok, I'm ready to submit a properly prepared PR
Alternatives
No response
Additional context
No response
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
Hello! Firstly, thank you for such a wonderful library.
I am currently developing a custom prompt auto-optimization tool that requires support for multimodal multi-turn conversations. As part of the optimization prompt for teacher model, there will be a section containing examples of student model request processing. The structure of these examples will look something like this:
Here:
From my understanding of the code here
image tokens are currently appended to the beginning of the text prompt
This approach stacks all image tokens together at the start of the prompt, which could potentially impact the performance of the teacher model. I would like to know if it is possible to implement fully interleaved support for multimodal prompts?
I have implemented some custom modifications to the code, but I am uncertain whether these changes are compatible with all VLM architectures or if they might affect subsequent processing stages:
If these modifications are ok, I'm ready to submit a properly prepared PR
Alternatives
No response
Additional context
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: