[Feature]: Fully interleaved support for multimodal prompts #12885

Dekakhrone · 2025-02-07T10:29:59Z

🚀 The feature, motivation and pitch

Hello! Firstly, thank you for such a wonderful library.

I am currently developing a custom prompt auto-optimization tool that requires support for multimodal multi-turn conversations. As part of the optimization prompt for teacher model, there will be a section containing examples of student model request processing. The structure of these examples will look something like this:

## Example N ##
<message_ID>__<var_name> = <var_value>
...
<message_ID>__<var_name> = <var_value>

Predicted answer: <pred>
Ground truth answer: <gtruth>
Conclusion: <conclusion>

Here:

Each message can be of any role (system, user, assistant)
Variables can represent either text or image data

From my understanding of the code here
image tokens are currently appended to the beginning of the text prompt

def _get_full_multimodal_text_prompt(placeholder_counts: Dict[str, int],
                                     text_prompt: str) -> str:
    """Combine multimodal prompts for a multimodal language model."""
    ...
    ...
    return "\n".join(missing_placeholders + [text_prompt])

This approach stacks all image tokens together at the start of the prompt, which could potentially impact the performance of the teacher model. I would like to know if it is possible to implement fully interleaved support for multimodal prompts?

I have implemented some custom modifications to the code, but I am uncertain whether these changes are compatible with all VLM architectures or if they might affect subsequent processing stages:

IMAGE_PLACEHOLDER = "<##IMAGE##>"

..............

def _get_full_multimodal_text_prompt(placeholder_counts: Dict[str, int],
                                     texts: List[str],
                                     interleave: bool) -> str:
    """Combine multimodal prompts for a multimodal language model."""
    return (
        _full_multimodal_text_prompt_interleave(placeholder_counts, texts) if interleave else 
        _full_multimodal_text_prompt_simple(placeholder_counts, "\n".join(texts))
        )


def _full_multimodal_text_prompt_simple(placeholder_counts: Dict[str, int],
                                        text_prompt: str) -> str:
    # Look through the text prompt to check for missing placeholders
    missing_placeholders: List[str] = []
    for placeholder in placeholder_counts:

        # For any existing placeholder in the text prompt, we leave it as is
        placeholder_counts[placeholder] -= text_prompt.count(placeholder)

        if placeholder_counts[placeholder] < 0:
            raise ValueError(
                f"Found more '{placeholder}' placeholders in input prompt than "
                "actual multimodal data items.")

        missing_placeholders.extend([placeholder] *
                                    placeholder_counts[placeholder])

    # NOTE: For now we always add missing placeholders at the front of
    # the prompt. This may change to be customizable in the future.
    return "\n".join(missing_placeholders + [text_prompt])


def _list_replace(obj: list, old: Any, new: Any, n: int = 1) -> None:
    assert n > 0

    idx = 0
    while n != 0 and idx != len(obj):
        if obj[idx] == old:
            obj[idx] = new
            n -= 1
        idx += 1


def _full_multimodal_text_prompt_interleave(placeholder_counts: Dict[str, int],
                                            texts: List[str]) -> str:
    for placeholder, n in placeholder_counts.items():
        _list_replace(texts, IMAGE_PLACEHOLDER, placeholder, n)
    
    return "\n".join(texts)

..............

def _parse_chat_message_content_parts(
    role: str,
    parts: Iterable[ChatCompletionContentPartParam],
    mm_tracker: BaseMultiModalItemTracker,
    *,
    wrap_dicts: bool,
    interleave: bool = True
) -> List[ConversationMessage]:
    content = list[_ContentPart]()

    mm_parser = mm_tracker.create_parser()

    for part in parts:
        parse_res = _parse_chat_message_content_part(
            part,
            mm_parser,
            wrap_dicts=wrap_dicts,
            interleave=interleave
        )
        if parse_res:
            content.append(parse_res)

    if wrap_dicts:
        # Parsing wraps images and texts as interleaved dictionaries
        return [ConversationMessage(role=role,
                                    content=content)]  # type: ignore
    texts = cast(List[str], content)
    
    mm_placeholder_counts = mm_parser.mm_placeholder_counts()
    if mm_placeholder_counts:
        text_prompt = _get_full_multimodal_text_prompt(mm_placeholder_counts,
                                                       texts,
                                                       interleave=interleave)
    else:
        text_prompt = "\n".join(texts)

    return [ConversationMessage(role=role, content=text_prompt)]


def _parse_chat_message_content_part(
    part: ChatCompletionContentPartParam,
    mm_parser: BaseMultiModalContentParser,
    *,
    wrap_dicts: bool,
    interleave: bool
) -> Optional[_ContentPart]:
....
if part_type == "image_url":
        str_content = cast(str, content)
        mm_parser.parse_image(str_content)
        return {'type': 'image'} if wrap_dicts else (IMAGE_PLACEHOLDER if interleave else None)
....

If these modifications are ok, I'm ready to submit a properly prepared PR

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

robertgshaw2-redhat · 2025-02-08T20:49:46Z

cc @ywang96 @DarkLight1337 FYI

DarkLight1337 · 2025-02-09T03:57:54Z

OpenAI-format chat templates currently do support interleaved text and multimodal inputs, see: #12740

DarkLight1337 · 2025-02-09T03:59:36Z

Nevertheless it would be nice to add this support for string chat templates as well, feel free to open a PR!

Dekakhrone added the feature request New feature or request label Feb 7, 2025

Dekakhrone changed the title ~~Fully interleaved support for multimodal prompts~~ [Feature]: Fully interleaved support for multimodal prompts Feb 7, 2025

Dekakhrone linked a pull request Feb 28, 2025 that will close this issue

[Misc] Add fully interleaved support for multimodal 'string' content format #14047

Open

KuntaiDu mentioned this issue Mar 1, 2025

[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818

Open

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Fully interleaved support for multimodal prompts #12885

[Feature]: Fully interleaved support for multimodal prompts #12885

Dekakhrone commented Feb 7, 2025

robertgshaw2-redhat commented Feb 8, 2025

DarkLight1337 commented Feb 9, 2025

DarkLight1337 commented Feb 9, 2025

[Feature]: Fully interleaved support for multimodal prompts #12885

[Feature]: Fully interleaved support for multimodal prompts #12885

Comments

Dekakhrone commented Feb 7, 2025

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

robertgshaw2-redhat commented Feb 8, 2025

DarkLight1337 commented Feb 9, 2025

DarkLight1337 commented Feb 9, 2025