Observability for Polyglot (Sofia AI Tutor) - Prompt Tuning Focus

What's already there

The backend has solid cost and operational observability:

message_logs table - tokens in/out, model, cost, timestamp per call
daily_costs table - aggregated spend with kill switch at $100/day
logger.geminiCall() - logs slow requests (>1s) to console
Sentry - error monitoring with 20% perf sample rate

The gap: none of this captures what Sofia actually said or why a response felt wrong. You can see a call cost $0.0004 and took 1.2s - but you can't see that she opened with "¡Hola! ¡Muy bien!" when the prompt explicitly forbids that.

Why this matters for prompt tuning

The current prompt is in src/config/constants.ts → DEFAULT_SYSTEM_PROMPT. It has real specificity:

NEVER open with exclamatory praise like "¡Perfecto\!", "¡Muy bien\!" or greetings like "¡Hola\!"
NEVER write "haha", "jaja", "lol"
No asterisks, markdown, or formatting. Plain text only.

These are clearly rules that got added because Sofia broke them in production. Without traces, the feedback loop is:

User complains Sofia sounds weird
You guess what happened
You patch the prompt
You wait for more complaints

With traces, it's:

User complains
You pull the exact conversation in Langfuse
You see the rule that broke and when
You fix it and A/B the new prompt version

Recommended: Add Langfuse

Langfuse is purpose-built for this. Free tier is generous, self-hostable, and the integration is ~20 lines.

Resources:

Langfuse homepage
TypeScript SDK guide
Prompt management - store/version DEFAULT_SYSTEM_PROMPT here instead of in code
Scores/evals - add thumbs up/down in the app, surfaces in Langfuse UI
Hosted cloud - free up to 50k observations/month

Install

npm install langfuse

Env vars to add (`.env` / Vercel dashboard)

LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_BASE_URL=https://cloud.langfuse.com

Integration point

The entire integration goes in src/services/geminiService.ts, wrapping the existing chat() method. No other files need to change.

import Langfuse from 'langfuse';

// Add to constructor or as module-level singleton
const langfuse = new Langfuse({
  secretKey: process.env.LANGFUSE_SECRET_KEY,
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
});

// In chat() / chatStream() / chatWithAudio(), wrap the fetch call:
async chat(messages, systemPrompt, userId) {
  const trace = langfuse.trace({
    name: 'sofia-chat',
    userId,
    metadata: { model: GEMINI_CONFIG.model },
  });

  const generation = trace.generation({
    name: 'gemini-completion',
    model: GEMINI_CONFIG.model,
    input: messages,
    systemPrompt,
  });

  try {
    // ... existing fetch logic ...
    
    generation.end({
      output: result.message,
      usage: {
        input: result.tokensInput,
        output: result.tokensOutput,
      },
    });

    return result;
  } catch (error) {
    generation.end({ level: 'ERROR', statusMessage: error.message });
    throw error;
  }
}

Add feedback scores (optional but high value)

If the app gets a thumbs-down button, wire it to:

langfuse.score({
  traceId: traceId, // pass this back in the API response
  name: 'user-feedback',
  value: -1, // -1 bad, 1 good
});

This lets you filter Langfuse to "only show bad responses" and see exactly what prompted them.

Minimum viable alternative (no new dependency)

If adding Langfuse feels like too much right now, the smallest useful change is logging message content to a separate Supabase table:

create table conversation_samples (
  id uuid primary key default gen_random_uuid(),
  user_id text,
  session_id text,
  system_prompt text,
  user_message text,
  assistant_response text,
  tokens_input int,
  tokens_output int,
  created_at timestamptz default now()
);

Then in api/chat.ts after a successful response, insert a row. No PII beyond what's already in message_logs. This gives you a queryable history to grep when something breaks.

What to look at first in Langfuse

Once integrated, these views will be immediately useful:

Traces → filter by proficiency level - does "beginner" vs "intermediate" change response quality?
Generations → sort by latency - are there prompt/message combos that consistently run slow?
Prompt versions - migrate DEFAULT_SYSTEM_PROMPT out of code and into Langfuse's prompt manager. Then you can iterate without deploys.
Sessions - group by userId to see full conversation arcs, not just individual turns.

The audio path needs extra attention

chatWithAudio() in geminiService.ts has the most complex prompt - it injects a 8-point instruction list as the user's message to force verbatim transcription and pronunciation feedback. This is the most likely source of weird Sofia behavior. Traces here will show whether Gemini is actually following those instructions or hallucinating transcriptions.

tomfuertes/polyglot-observability.md

Select an option

No results found

Select an option

No results found

Observability for Polyglot (Sofia AI Tutor) - Prompt Tuning Focus

What's already there

Why this matters for prompt tuning

Recommended: Add Langfuse

Install

Env vars to add (`.env` / Vercel dashboard)

Integration point

Add feedback scores (optional but high value)

Minimum viable alternative (no new dependency)

What to look at first in Langfuse

The audio path needs extra attention

tomfuertes/polyglot-observability.md

Observability for Polyglot (Sofia AI Tutor) - Prompt Tuning Focus

What's already there

Why this matters for prompt tuning

Recommended: Add Langfuse

Install

Env vars to add (.env / Vercel dashboard)

Integration point

Add feedback scores (optional but high value)

Minimum viable alternative (no new dependency)

What to look at first in Langfuse

The audio path needs extra attention

Env vars to add (`.env` / Vercel dashboard)