Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 26 additions & 7 deletions internal/live/proxy.go
Original file line number Diff line number Diff line change
Expand Up @@ -274,17 +274,36 @@ func (p *Proxy) handleServerContent(content *genai.LiveServerContent) {
p.sendBinary(part.InlineData.Data)
}
if part.Text != "" && !part.Thought {
// Capture non-thinking transcript for analyze_user context.
// Capture non-thinking transcript for tool context (analyze_user).
// Browser display uses OutputTranscription to avoid duplicates.
p.toolHandler.AddTranscript("model", part.Text)
// Forward transcript as JSON (skip model thinking/reasoning text).
p.sendJSON(map[string]any{
"type": "transcript",
"role": "model",
"text": part.Text,
})
}
}
}

// Forward input transcription (what the user said).
if content.InputTranscription != nil && content.InputTranscription.Text != "" {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Forward empty finished transcription chunks to clients

handleServerContent drops transcription updates whenever Text is empty, so a Finished=true terminal chunk with no text is never forwarded. In that case the frontend never receives the finalize signal it needs to clear/commit the pending bubble (it already has explicit empty-finalize handling), and the backend also skips AddTranscript("user", ...) because that is only executed on finished events inside this same non-empty guard. This can leave stale/concatenated chat turns and lose user utterances from tool context for analyze_user.

Useful? React with ๐Ÿ‘ย / ๐Ÿ‘Ž.

// Only persist finalized user speech to tool context.
if content.InputTranscription.Finished {
p.toolHandler.AddTranscript("user", content.InputTranscription.Text)
}
p.sendJSON(map[string]any{
Comment on lines +285 to +290

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist user transcription only when chunk is finished

AddTranscript("user", ...) runs for every InputTranscription update, including partial chunks (Finished == false). Because analyze_user uses only a bounded recent transcript buffer, long in-progress utterances can fill the buffer with fragments and evict real prior turns, degrading analysis quality. Coalesce partial input transcription updates and persist only completed user turns to the transcript store.

Useful? React with ๐Ÿ‘ย / ๐Ÿ‘Ž.

"type": "transcript",
"role": "user",
"text": content.InputTranscription.Text,
"finished": content.InputTranscription.Finished,
})
}

// Forward output transcription (what the model said, as text).
if content.OutputTranscription != nil && content.OutputTranscription.Text != "" {
p.sendJSON(map[string]any{
"type": "transcript",
"role": "model",
"text": content.OutputTranscription.Text,
"finished": content.OutputTranscription.Finished,
Comment on lines +299 to +304

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop sending duplicate model transcripts to the client

This new branch emits a second transcript stream for role: "model" even though handleServerContent already emits model text from ModelTurn.Parts above. When OutputAudioTranscription is enabled, Live messages can include both sources for the same utterance, and web/app/page.tsx currently merges chunks by role into a single pending message, which leads to duplicated/garbled chat text in the HUD. Use one canonical model transcript source (or a distinct message type) so the frontend does not interleave two model streams.

Useful? React with ๐Ÿ‘ย / ๐Ÿ‘Ž.

})
}
Comment on lines +299 to +306

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This new block for forwarding the model's output transcription appears to duplicate existing logic. The loop over content.ModelTurn.Parts on lines 271-286 already sends a transcript message with the model's text. This will likely result in duplicate chat messages being displayed in the UI. Since this new OutputTranscription path correctly provides the finished flag, which the new UI logic relies on, the sendJSON call within the ModelTurn loop should probably be removed to resolve the duplication.

Comment on lines +284 to +306

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

โš ๏ธ Potential issue | ๐ŸŸ  Major

์ „์‚ฌ ์ด๋ฒคํŠธ๋ฅผ ์ด์ค‘/๋ถ€๋ถ„ ๋ˆ„์ ์œผ๋กœ ๋ณด๋‚ด์„œ ์ฑ„ํŒ… ์ค‘๋ณต๊ณผ ์ปจํ…์ŠคํŠธ ์˜ค์—ผ์ด ์ƒ๊ธธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Line 276 ๊ฒฝ๋กœ(part.Text)์™€ Line 301 ๊ฒฝ๋กœ(OutputTranscription.Text)๊ฐ€ ๋™์‹œ์— model transcript๋ฅผ ๋‚ด๋ณด๋‚ด๋ฉด ๋™์ผ ๋ฐœํ™”๊ฐ€ ์ค‘๋ณต ๋ Œ๋”๋ง๋ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ Line 291์€ finished ์ด์ „ partial๋„ AddTranscript์— ๋„ฃ์–ด tool ๋ฌธ๋งฅ์ด ๋ถˆํ•„์š”ํ•˜๊ฒŒ ๋ถ€ํ’€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿงฉ ์ œ์•ˆ ํŒจ์น˜
-  // Forward input transcription (what the user said).
+  // Forward input transcription (what the user said).
   if content.InputTranscription != nil && content.InputTranscription.Text != "" {
-    p.toolHandler.AddTranscript("user", content.InputTranscription.Text)
+    if content.InputTranscription.Finished {
+      p.toolHandler.AddTranscript("user", content.InputTranscription.Text)
+    }
     p.sendJSON(map[string]any{
       "type":     "transcript",
       "role":     "user",
       "text":     content.InputTranscription.Text,
       "finished": content.InputTranscription.Finished,
     })
   }

-  // Forward output transcription (what the model said, as text).
-  if content.OutputTranscription != nil && content.OutputTranscription.Text != "" {
+  // Forward output transcription (what the model said, as text).
+  // Prefer a single model transcript source to avoid duplicates with ModelTurn.Part.Text.
+  if content.OutputTranscription != nil && content.OutputTranscription.Text != "" {
+    p.toolHandler.AddTranscript("model", content.OutputTranscription.Text)
     p.sendJSON(map[string]any{
       "type":     "transcript",
       "role":     "model",
       "text":     content.OutputTranscription.Text,
       "finished": content.OutputTranscription.Finished,
     })
   }
๐Ÿ“ Committable suggestion

โ€ผ๏ธ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Forward input transcription (what the user said).
if content.InputTranscription != nil && content.InputTranscription.Text != "" {
p.toolHandler.AddTranscript("user", content.InputTranscription.Text)
p.sendJSON(map[string]any{
"type": "transcript",
"role": "user",
"text": content.InputTranscription.Text,
"finished": content.InputTranscription.Finished,
})
}
// Forward output transcription (what the model said, as text).
if content.OutputTranscription != nil && content.OutputTranscription.Text != "" {
p.sendJSON(map[string]any{
"type": "transcript",
"role": "model",
"text": content.OutputTranscription.Text,
"finished": content.OutputTranscription.Finished,
})
}
// Forward input transcription (what the user said).
if content.InputTranscription != nil && content.InputTranscription.Text != "" {
if content.InputTranscription.Finished {
p.toolHandler.AddTranscript("user", content.InputTranscription.Text)
}
p.sendJSON(map[string]any{
"type": "transcript",
"role": "user",
"text": content.InputTranscription.Text,
"finished": content.InputTranscription.Finished,
})
}
// Forward output transcription (what the model said, as text).
// Prefer a single model transcript source to avoid duplicates with ModelTurn.Part.Text.
if content.OutputTranscription != nil && content.OutputTranscription.Text != "" {
p.toolHandler.AddTranscript("model", content.OutputTranscription.Text)
p.sendJSON(map[string]any{
"type": "transcript",
"role": "model",
"text": content.OutputTranscription.Text,
"finished": content.OutputTranscription.Finished,
})
}
๐Ÿค– Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/live/proxy.go` around lines 289 - 308, Avoid duplicating and
accumulating partial transcripts by only forwarding and adding to tool context
once the transcript is final: change the logic around AddTranscript and sendJSON
for content.InputTranscription/part.Text and content.OutputTranscription so that
partial (unfinished) segments are not passed to p.toolHandler.AddTranscript, and
ensure you do not emit the same model utterance twice when both part.Text and
content.OutputTranscription.Text are present (prefer the final
OutputTranscription when Finished==true or dedupe by skipping part.Text if
OutputTranscription exists). In short, gate AddTranscript and sendJSON on the
Finished flag and add a check so model transcripts from part.Text are suppressed
when content.OutputTranscription is present to prevent duplicate render/context
pollution.

}

// handleToolCall executes a tool and sends the response back to Live API.
Expand Down
6 changes: 5 additions & 1 deletion internal/session/manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,8 @@ func (m *Manager) BuildOnboardingConfig() *genai.LiveConnectConfig {
},
},
},
InputAudioTranscription: &genai.AudioTranscriptionConfig{},
OutputAudioTranscription: &genai.AudioTranscriptionConfig{},
Proactivity: &genai.ProactivityConfig{
ProactiveAudio: &enableProactive,
},
Expand Down Expand Up @@ -231,7 +233,9 @@ func (m *Manager) BuildReunionConfig() *genai.LiveConnectConfig {
},
},
},
EnableAffectiveDialog: &enableAffective,
InputAudioTranscription: &genai.AudioTranscriptionConfig{},
OutputAudioTranscription: &genai.AudioTranscriptionConfig{},
EnableAffectiveDialog: &enableAffective,
Proactivity: &genai.ProactivityConfig{
ProactiveAudio: &enableProactive,
},
Expand Down
16 changes: 16 additions & 0 deletions internal/session/manager_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,14 @@ func TestManager_StartOnboarding_Config(t *testing.T) {
t.Fatalf("expected English greeting in system instruction")
}

// Must have audio transcription enabled.
if cfg.InputAudioTranscription == nil {
t.Fatal("expected InputAudioTranscription config")
}
if cfg.OutputAudioTranscription == nil {
t.Fatal("expected OutputAudioTranscription config")
}

// Must have tools declared.
if len(cfg.Tools) == 0 {
t.Fatal("expected tools")
Expand Down Expand Up @@ -150,6 +158,14 @@ func TestManager_BuildReunionConfig(t *testing.T) {
t.Fatalf("expected personality in system instruction")
}

// Must have audio transcription.
if cfg.InputAudioTranscription == nil {
t.Fatal("expected InputAudioTranscription in reunion config")
}
if cfg.OutputAudioTranscription == nil {
t.Fatal("expected OutputAudioTranscription in reunion config")
}

// Must have tools.
if len(cfg.Tools) == 0 || len(cfg.Tools[0].FunctionDeclarations) == 0 {
t.Fatal("expected reunion tools")
Expand Down
128 changes: 58 additions & 70 deletions web/app/page.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,22 @@ import SceneDisplay from '../components/SceneDisplay';
import SessionTransition from '../components/SessionTransition';
import OnboardingFlow, { type OnboardingStage } from '../components/OnboardingFlow';
import BGMPlayer from '../components/BGMPlayer';
import ChatPanel, { type ChatMessage } from '../components/ChatPanel';
import StatusHUD from '../components/StatusHUD';
import ActionsHUD from '../components/ActionsHUD';
import type { YouTubeVideo } from '../components/YouTubeGrid';
import type { Highlight } from '../components/HighlightCard';

type TransitionPhase = 'idle' | 'transitioning' | 'ready';

const CONNECTION_COLORS: Record<string, string> = {
connected: '#4ade80',
connecting: '#fbbf24',
disconnected: '#ef4444',
error: '#ef4444',
};

export default function Home() {
const [started, setStarted] = useState(false);
const [previewSrc, setPreviewSrc] = useState<string | null>(null);
const [finalSrc, setFinalSrc] = useState<string | null>(null);
const [transition, setTransition] = useState<TransitionPhase>('idle');
const [transcript, setTranscript] = useState<string>('');
const [chatMessages, setChatMessages] = useState<ChatMessage[]>([]);
const pendingMsgRef = useRef<{ model: string | null; user: string | null }>({ model: null, user: null });
const msgIdRef = useRef(0);
const readyTimerRef = useRef<ReturnType<typeof setTimeout> | null>(null);

// Onboarding state
Expand All @@ -38,7 +36,7 @@ export default function Home() {
const [analysisPercent, setAnalysisPercent] = useState(0);
const [bgmUrl, setBgmUrl] = useState<string | null>(null);

const { initAudioContext, playPCM, cleanup: cleanupAudio } = useAudio();
const { initAudioContext, playPCM, isPlaying, cleanup: cleanupAudio } = useAudio();
const mic = useMicrophone();

const handleMessage = useCallback((msg: ServerMessage) => {
Expand All @@ -60,9 +58,48 @@ export default function Home() {
setTransition('ready');
setOnboardingStage('reunion');
break;
case 'transcript':
setTranscript(stripMarkdown(msg.text));
case 'transcript': {
const role = (msg as { role: string }).role as 'model' | 'user';
const text = stripMarkdown(msg.text);
const finished = (msg as { finished?: boolean }).finished ?? false;

if (finished) {
// Finalize: flush pending partial text into a completed message.
const pending = pendingMsgRef.current[role];
const finalText = pending ? pending + text : text;
pendingMsgRef.current[role] = null;
if (finalText) {
const id = String(msgIdRef.current++);
setChatMessages((prev) => {
// Remove the in-progress placeholder for this role if present.
const cleaned = prev.filter(
(m) => !(m.role === role && !m.finished),
);
return [...cleaned, { id, role, text: finalText, finished: true }];
});
} else {
// Empty finalize โ€” just clean up the placeholder.
setChatMessages((prev) =>
prev.filter((m) => !(m.role === role && !m.finished)),
);
}
} else {
Comment on lines +66 to +86

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a potential bug in how finished transcript messages are handled. If a finished: true message arrives and the resulting finalText is empty, the if (finalText) condition prevents setChatMessages from being called. This means an in-progress message for that role could get stuck on the screen, as it's never cleared. The logic should be restructured to ensure the pending message is always removed when a finished message is processed, regardless of whether the final text is empty.

        if (finished) {
          // Finalize: flush pending partial text into a completed message.
          const pending = pendingMsgRef.current[role];
          const finalText = pending ? pending + text : text;
          if (finalText) {
            const id = String(msgIdRef.current++);
            setChatMessages((prev) => {
              // Remove the in-progress placeholder for this role if present.
              const cleaned = prev.filter(
                (m) => !(m.role === role && !m.finished),
              );
              return [...cleaned, { id, role, text: finalText, finished: true }];
            });
          } else {
            // If final text is empty, just remove the pending message from the UI.
            setChatMessages((prev) =>
              prev.filter((m) => !(m.role === role && !m.finished)),
            );
          }
          pendingMsgRef.current[role] = null;
        }

// Streaming partial: accumulate and show placeholder.
const accumulated = (pendingMsgRef.current[role] ?? '') + text;
pendingMsgRef.current[role] = accumulated;
const id = `pending-${role}`;
setChatMessages((prev) => {
const cleaned = prev.filter(
(m) => !(m.role === role && !m.finished),
);
return [
...cleaned,
{ id, role, text: accumulated, finished: false },
];
});
}
break;
}
case 'youtube_videos':
setVideos(msg.videos as YouTubeVideo[]);
setOnboardingStage('youtube_grid');
Expand Down Expand Up @@ -143,7 +180,8 @@ export default function Home() {
setPreviewSrc(null);
setFinalSrc(null);
setTransition('idle');
setTranscript('');
setChatMessages([]);
pendingMsgRef.current = { model: null, user: null };
setOnboardingStage('welcome');
setVideos([]);
setPersonCrops([]);
Expand Down Expand Up @@ -371,64 +409,14 @@ export default function Home() {
onSelectPerson={handleSelectPerson}
/>

{/* Connection indicator */}
<div
style={{
position: 'absolute',
top: '1rem',
right: '1rem',
display: 'flex',
alignItems: 'center',
gap: '0.5rem',
zIndex: 10,
}}
>
<div
style={{
width: 8,
height: 8,
borderRadius: '50%',
background: CONNECTION_COLORS[state] ?? '#ef4444',
}}
/>
<span style={{ fontSize: '0.75rem', color: 'var(--color-muted)' }}>
{state}
</span>
{mic.isRecording && (
<div
style={{
width: 8,
height: 8,
borderRadius: '50%',
background: '#ef4444',
animation: 'pulse 1.5s infinite',
}}
title="Microphone active"
/>
)}
</div>

{/* Transcript overlay */}
{transcript && (
<div
style={{
position: 'absolute',
bottom: '6rem',
left: '50%',
transform: 'translateX(-50%)',
maxWidth: '80%',
padding: '0.75rem 1.5rem',
background: 'rgba(0,0,0,0.6)',
borderRadius: '1rem',
color: 'var(--color-text)',
fontSize: '1rem',
textAlign: 'center',
zIndex: 10,
}}
>
{transcript}
</div>
)}
<StatusHUD
connection={state}
isRecording={mic.isRecording}
isPlaying={isPlaying}
sessionState={onboardingStage}
/>
<ActionsHUD sessionState={onboardingStage} />
<ChatPanel messages={chatMessages} />

{/* Stop button */}
<button
Expand Down
58 changes: 58 additions & 0 deletions web/components/ActionsHUD.tsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
'use client';

type ActionsHUDProps = {
sessionState: string;
};

type ActionItem = {
label: string;
hint: string;
};

const ONBOARDING_ACTIONS: ActionItem[] = [
{ label: 'Talk', hint: 'Tell missless who you miss' },
{ label: 'Share Video', hint: 'Paste a YouTube link' },
];

const REUNION_ACTIONS: ActionItem[] = [
{ label: 'Talk', hint: 'Have a conversation' },
{ label: 'Scene', hint: '"Paint me a picture of..."' },
{ label: 'Music', hint: '"Play something peaceful"' },
{ label: 'Memory', hint: '"Remember when we..."' },
{ label: 'Album', hint: '"Save this moment"' },
];

export default function ActionsHUD({ sessionState }: ActionsHUDProps) {
const actions = sessionState === 'reunion' ? REUNION_ACTIONS : ONBOARDING_ACTIONS;

return (
<div
style={{
position: 'absolute',
top: '1rem',
right: '1rem',
background: 'rgba(0,0,0,0.5)',
backdropFilter: 'blur(12px)',
borderRadius: '0.75rem',
padding: '0.625rem 0.875rem',
display: 'flex',
flexDirection: 'column',
gap: '0.375rem',
fontSize: '0.75rem',
color: 'var(--color-text)',
zIndex: 20,
minWidth: '140px',
}}
>
<div style={{ fontWeight: 600, fontSize: '0.8125rem', marginBottom: '0.125rem' }}>
You can...
</div>
{actions.map((a) => (
<div key={a.label} style={{ display: 'flex', flexDirection: 'column', gap: '0.0625rem' }}>
<span style={{ fontWeight: 500 }}>{a.label}</span>
<span style={{ color: 'var(--color-muted)', fontSize: '0.6875rem' }}>{a.hint}</span>
</div>
))}
</div>
);
}
Loading
Loading