The backend has solid cost and operational observability:
message_logstable - tokens in/out, model, cost, timestamp per calldaily_coststable - aggregated spend with kill switch at $100/daylogger.geminiCall()- logs slow requests (>1s) to console- Sentry - error monitoring with 20% perf sample rate
The gap: none of this captures what Sofia actually said or why a response felt wrong. You can see a call cost $0.0004 and took 1.2s - but you can't see that she opened with "¡Hola! ¡Muy bien!" when the prompt explicitly forbids that.
The current prompt is in src/config/constants.ts → DEFAULT_SYSTEM_PROMPT. It has real specificity:
NEVER open with exclamatory praise like "¡Perfecto\!", "¡Muy bien\!" or greetings like "¡Hola\!"
NEVER write "haha", "jaja", "lol"
No asterisks, markdown, or formatting. Plain text only.
These are clearly rules that got added because Sofia broke them in production. Without traces, the feedback loop is:
- User complains Sofia sounds weird
- You guess what happened
- You patch the prompt
- You wait for more complaints
With traces, it's:
- User complains
- You pull the exact conversation in Langfuse
- You see the rule that broke and when
- You fix it and A/B the new prompt version
Langfuse is purpose-built for this. Free tier is generous, self-hostable, and the integration is ~20 lines.
Resources:
- Langfuse homepage
- TypeScript SDK guide
- Prompt management - store/version
DEFAULT_SYSTEM_PROMPThere instead of in code - Scores/evals - add thumbs up/down in the app, surfaces in Langfuse UI
- Hosted cloud - free up to 50k observations/month
npm install langfuseLANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_BASE_URL=https://cloud.langfuse.com
The entire integration goes in src/services/geminiService.ts, wrapping the existing chat() method. No other files need to change.
import Langfuse from 'langfuse';
// Add to constructor or as module-level singleton
const langfuse = new Langfuse({
secretKey: process.env.LANGFUSE_SECRET_KEY,
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
});
// In chat() / chatStream() / chatWithAudio(), wrap the fetch call:
async chat(messages, systemPrompt, userId) {
const trace = langfuse.trace({
name: 'sofia-chat',
userId,
metadata: { model: GEMINI_CONFIG.model },
});
const generation = trace.generation({
name: 'gemini-completion',
model: GEMINI_CONFIG.model,
input: messages,
systemPrompt,
});
try {
// ... existing fetch logic ...
generation.end({
output: result.message,
usage: {
input: result.tokensInput,
output: result.tokensOutput,
},
});
return result;
} catch (error) {
generation.end({ level: 'ERROR', statusMessage: error.message });
throw error;
}
}If the app gets a thumbs-down button, wire it to:
langfuse.score({
traceId: traceId, // pass this back in the API response
name: 'user-feedback',
value: -1, // -1 bad, 1 good
});This lets you filter Langfuse to "only show bad responses" and see exactly what prompted them.
If adding Langfuse feels like too much right now, the smallest useful change is logging message content to a separate Supabase table:
create table conversation_samples (
id uuid primary key default gen_random_uuid(),
user_id text,
session_id text,
system_prompt text,
user_message text,
assistant_response text,
tokens_input int,
tokens_output int,
created_at timestamptz default now()
);Then in api/chat.ts after a successful response, insert a row. No PII beyond what's already in message_logs. This gives you a queryable history to grep when something breaks.
Once integrated, these views will be immediately useful:
- Traces → filter by proficiency level - does "beginner" vs "intermediate" change response quality?
- Generations → sort by latency - are there prompt/message combos that consistently run slow?
- Prompt versions - migrate
DEFAULT_SYSTEM_PROMPTout of code and into Langfuse's prompt manager. Then you can iterate without deploys. - Sessions - group by
userIdto see full conversation arcs, not just individual turns.
chatWithAudio() in geminiService.ts has the most complex prompt - it injects a 8-point instruction list as the user's message to force verbatim transcription and pronunciation feedback. This is the most likely source of weird Sofia behavior. Traces here will show whether Gemini is actually following those instructions or hallucinating transcriptions.