Blog

October 28, 2025
2 min read

The New Bottleneck in Agentic Systems: How to best capture domain knowledge

LLMs removed the label bottleneck but not the knowledge bottleneck for industry application
Annotation moved from “pre-model” (labels) to “post-model” (prompts, tools, guardrails, evals), but our tooling didn’t.
Hypothesis: If teams collect feedback for all agents in one place, they’ll improve them faster and scale to 10+ use-cases with the same resources.

flowchart LR
    classDef node fill:#111827,stroke:#00bcd4,color:#e5e7eb,rx:12,ry:12,border-width:1.5px;
    classDef edge color:#9ca3af;

    user[User / SME]:::node --> ui[Chat Interface]:::node

    ui -->|User Message + History| A1[Agent A]:::node
    ui --> A2[Agent B]:::node
    ui --> A3[Agent C]:::node

    A1 -->|Response + Feedback Schema|ui
    A2 --> ui
    A3 --> ui

    ui -->|structured general and custom feedback| evals[Evaluations & Telemetry]:::node
    evals --> update[Prompt / Tool Update]:::node
    update --> release[New Version Released]:::node
    release -->|Pilot cohort sees updated agents instantly| user

Why Now

LLMs shorten build cycles from months to days and relaxed our dependence on labeled data, but they didn’t eliminate the need of domain knowledge. That expertise still lives in people’s heads and scattered systems. To bridge the gap, developers often spin up one-off pilots (Streamlit, Gradio) to collect feedback. It’s tend to work fine for one use-case, but it falls apart at ten.

New bottleneck: With models easy to prototype, the constraint is knowledge capture and iteration speed, not training.
Shifted annotation: What used to happen pre-model (labels) now happens post-model (in prompts, tools, guardrails, and evals), but the tooling hasn’t caught up.
Feedback chaos: Input often comes from email/Zoom/DMs or bespoke UIs. There’s no standard path from feedback to prompt/tool change, to measurable lift.

Bottleneck

Knowledge is trapped in SMEs and pilot chats.
Feedback is fragmented across bespoke UIs and channels.
Nothing scales past a handful of use-cases: duplicated components, inconsistent telemetry, hard-to-compare experiments.
No closed loop: it’s unclear which prompt/tool change improved what metric for which cohort.

What good looks like?

Single conversation surface where users/SMEs can try any agent.
Structured feedback primitives (thumbs + reason, rubric scores, error tags, suggested responses, tool/run traces).
Assignment & cohorts (pilot groups, roles, data slices).
Experiment tracking (prompt/tool versioning tied to offline/online metrics).
Portable across use-cases (no per-app UI rebuild).

July 26, 2025
3 min read

The Missing 20%: Why Agentic Systems Need Built-In Control

Disclaimer: The views expressed here are my own and do not represent those of my employer.

The word agentic is everywhere, and for good reason — its potential to transform existing processes is unparalleled. However, what’s commonly thought of as agentic architecture typically gets you only 80% of the way there. In highly regulated industries, where risk tolerance is low and accuracy thresholds are closer to 95% or higher, that’s often not good enough. In this post, I want to highlight what I believe is missing to truly unlock the benefits of agentic systems in such environments.

First, it's worth level-setting on the definition of agentic. Different schools of thought exist on what agentic systems actually are — ranging from deterministic, tool-using workflows powered by LLMs to fully autonomous agents where the LLM plans and executes a series of actions on its own. Personally, I lean towards the autonomous spectum.

Now, while Anthropic suggests (1) customer chatbots can benefit from agentic systems by enabling more natural, flexible conversations compared to rigid workflows, in highly regulated industries like healthcare and finance, the risks outweigh the benefits.

Imagine interacting with an agentic medical chatbot that mistakenly decides to prescribe or send incorrect medication. That’s why controllability isn’t optional in these settings; it’s a prerequisite. Building agentic systems that are safe, predictable, and auditable is essential to meeting compliance standards and earning regulators' trust.

You might ask, “Why not just instruct the LLM not to take risky actions?” That assumes the LLM reliably follows instructions and is not susceptible to adversarial attacks — an assumption that doesn’t hold up in practice. If we were to take a look at situtations where LLM behaves against its built-in safety control, otherwise known as jailbreaking, GPT-4 is unable to block such attack 68.5% of the times (2). Its inability to follow instruction can also been seen in FaithEval paper (3) that shows that gpt-4o is only 47.5% accurate in answering questions with context that contradicts with the training dataset. This is often the case in industry settings where context comes from internal systems that differ from commonly held assumptions or publicly available knowledge. (More to come on this topic).

I have built a simple illustrating example of health assistant that can either tell you to consult a doctor or prescribe you medicine. Despite adding explicit instruction (twice!) to always consult the doctor first, it can be easily bypassed (and I am not a good hacker at all..):

(1) Safety Instructions Embedded in the Prompt

(2) Successfully Bypassed Safety Instruction

Instead of relying on the LLM to execute safety measures via instructions, why not bake them into the architecture? This means building controlled agents — ones that leverage LLMs for language understanding and small-scale planning, while introducing strong guarantees through mechanisms like human-in-the-loop, hard-coded constraints, and/or explicit routing. That said, there’s a fine balance to strike between explicitly defining every scenario and allowing the LLM to plan flexibly.

I hope to see better native support of building agentic systems with controllability in popular frameworks, such as google-adk (better than extending BaseAgent), making it easier to go from Figure (a) to (b) — the shift to controllable agents.