Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Fully interleaved support for multimodal prompts #12885

Open
1 task done
Dekakhrone opened this issue Feb 7, 2025 · 3 comments · May be fixed by #14047
Open
1 task done

[Feature]: Fully interleaved support for multimodal prompts #12885

Dekakhrone opened this issue Feb 7, 2025 · 3 comments · May be fixed by #14047
Labels
feature request New feature or request

Comments

@Dekakhrone
Copy link

🚀 The feature, motivation and pitch

Hello! Firstly, thank you for such a wonderful library.

I am currently developing a custom prompt auto-optimization tool that requires support for multimodal multi-turn conversations. As part of the optimization prompt for teacher model, there will be a section containing examples of student model request processing. The structure of these examples will look something like this:

## Example N ##
<message_ID>__<var_name> = <var_value>
...
<message_ID>__<var_name> = <var_value>

Predicted answer: <pred>
Ground truth answer: <gtruth>
Conclusion: <conclusion>

Here:

  • Each message can be of any role (system, user, assistant)
  • Variables can represent either text or image data

From my understanding of the code here
image tokens are currently appended to the beginning of the text prompt

def _get_full_multimodal_text_prompt(placeholder_counts: Dict[str, int],
                                     text_prompt: str) -> str:
    """Combine multimodal prompts for a multimodal language model."""
    ...
    ...
    return "\n".join(missing_placeholders + [text_prompt])

This approach stacks all image tokens together at the start of the prompt, which could potentially impact the performance of the teacher model. I would like to know if it is possible to implement fully interleaved support for multimodal prompts?

I have implemented some custom modifications to the code, but I am uncertain whether these changes are compatible with all VLM architectures or if they might affect subsequent processing stages:

IMAGE_PLACEHOLDER = "<##IMAGE##>"

..............

def _get_full_multimodal_text_prompt(placeholder_counts: Dict[str, int],
                                     texts: List[str],
                                     interleave: bool) -> str:
    """Combine multimodal prompts for a multimodal language model."""
    return (
        _full_multimodal_text_prompt_interleave(placeholder_counts, texts) if interleave else 
        _full_multimodal_text_prompt_simple(placeholder_counts, "\n".join(texts))
        )


def _full_multimodal_text_prompt_simple(placeholder_counts: Dict[str, int],
                                        text_prompt: str) -> str:
    # Look through the text prompt to check for missing placeholders
    missing_placeholders: List[str] = []
    for placeholder in placeholder_counts:

        # For any existing placeholder in the text prompt, we leave it as is
        placeholder_counts[placeholder] -= text_prompt.count(placeholder)

        if placeholder_counts[placeholder] < 0:
            raise ValueError(
                f"Found more '{placeholder}' placeholders in input prompt than "
                "actual multimodal data items.")

        missing_placeholders.extend([placeholder] *
                                    placeholder_counts[placeholder])

    # NOTE: For now we always add missing placeholders at the front of
    # the prompt. This may change to be customizable in the future.
    return "\n".join(missing_placeholders + [text_prompt])


def _list_replace(obj: list, old: Any, new: Any, n: int = 1) -> None:
    assert n > 0

    idx = 0
    while n != 0 and idx != len(obj):
        if obj[idx] == old:
            obj[idx] = new
            n -= 1
        idx += 1


def _full_multimodal_text_prompt_interleave(placeholder_counts: Dict[str, int],
                                            texts: List[str]) -> str:
    for placeholder, n in placeholder_counts.items():
        _list_replace(texts, IMAGE_PLACEHOLDER, placeholder, n)
    
    return "\n".join(texts)

..............

def _parse_chat_message_content_parts(
    role: str,
    parts: Iterable[ChatCompletionContentPartParam],
    mm_tracker: BaseMultiModalItemTracker,
    *,
    wrap_dicts: bool,
    interleave: bool = True
) -> List[ConversationMessage]:
    content = list[_ContentPart]()

    mm_parser = mm_tracker.create_parser()

    for part in parts:
        parse_res = _parse_chat_message_content_part(
            part,
            mm_parser,
            wrap_dicts=wrap_dicts,
            interleave=interleave
        )
        if parse_res:
            content.append(parse_res)

    if wrap_dicts:
        # Parsing wraps images and texts as interleaved dictionaries
        return [ConversationMessage(role=role,
                                    content=content)]  # type: ignore
    texts = cast(List[str], content)
    
    mm_placeholder_counts = mm_parser.mm_placeholder_counts()
    if mm_placeholder_counts:
        text_prompt = _get_full_multimodal_text_prompt(mm_placeholder_counts,
                                                       texts,
                                                       interleave=interleave)
    else:
        text_prompt = "\n".join(texts)

    return [ConversationMessage(role=role, content=text_prompt)]


def _parse_chat_message_content_part(
    part: ChatCompletionContentPartParam,
    mm_parser: BaseMultiModalContentParser,
    *,
    wrap_dicts: bool,
    interleave: bool
) -> Optional[_ContentPart]:
....
if part_type == "image_url":
        str_content = cast(str, content)
        mm_parser.parse_image(str_content)
        return {'type': 'image'} if wrap_dicts else (IMAGE_PLACEHOLDER if interleave else None)
....

If these modifications are ok, I'm ready to submit a properly prepared PR

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@Dekakhrone Dekakhrone added the feature request New feature or request label Feb 7, 2025
@Dekakhrone Dekakhrone changed the title Fully interleaved support for multimodal prompts [Feature]: Fully interleaved support for multimodal prompts Feb 7, 2025
@robertgshaw2-redhat
Copy link
Collaborator

cc @ywang96 @DarkLight1337 FYI

@DarkLight1337
Copy link
Member

OpenAI-format chat templates currently do support interleaved text and multimodal inputs, see: #12740

@DarkLight1337
Copy link
Member

Nevertheless it would be nice to add this support for string chat templates as well, feel free to open a PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants