AI is everywhere right now — on roadmaps, in boardrooms, and across product conversations. But once you look past the headlines, the questions that matter are still very practical: What’s actually working in real deployments? Where is AI creating measurable value today? And what are teams getting wrong when they try to ship AI fast?
To get a grounded, experience-backed view — without the hype — I interviewed Tao An, Founder and CEO of Beijing Feimo Network Technology (FIM Tech). I’m Sanjay Singhania, a Project Manager at Capital Numbers, and this interview was conducted in a written Q&A format: I shared the questions first, and Tao provided detailed responses in writing.
In this conversation, we cover what “AI is changing everything” really means today, which trends are delivering real ROI, how companies can aim for impact in 3–6 months, and where key technical decisions often go wrong — like choosing larger models vs improving the system, picking RAG vs fine-tuning, and deciding whether AI agents are helpful or just unnecessary complexity. We also address how to explain hallucinations to non-technical stakeholders, where costs surprise teams, and what Tao expects to improve most over the next 12 months.
If you’re building AI products or exploring adoption in 2026, I think you’ll find this perspective refreshingly direct — and immediately useful.
(And if you’re looking for hands-on execution support, my team at Capital Numbers helps businesses build production-ready solutions through our AI development services — from LLM applications and RAG to evaluation and scalable engineering.)
About the Expert
Tao An is the founder and CEO of Beijing Feimo Network Technology (FIM Tech), an AI solutions company that’s been deploying systems for government and enterprise clients since 2021. He’s currently completing his MS in Artificial Intelligence at Hawaii Pacific University and has published research on LLM cognitive architectures. His company specializes in document intelligence, RAG systems, and AI-powered contract management for provincial government departments and state-owned enterprises.
Interview Questions & Responses
Sanjay: When you hear “AI is changing everything,” what’s the most accurate version of that statement today?
Tao: Honestly? “AI is very good at specific boring tasks that used to take humans a lot of time.”
The narrative about AI replacing everything is exhausting. After deploying these systems for three years across government departments and enterprises, here’s what I actually see: AI is really good at grinding through documents, extracting structured data, routing things to the right people. It’s terrible at making judgments when stakes are high.
Take our medical policy platform. It can process thousands of pages of health regulations in seconds. But do we let it make treatment recommendations? Hell no. A doctor still looks at the output and makes the call.
The change is real. We’re moving faster, getting more done. Think of it like going from bicycles to motorcycles. You’re still steering, you’re just covering way more ground.
But there’s one area that IS changing fast: coding. Tools like Claude Code and similar agent SDK implementations actually work. Not perfectly, but well enough to change how people code. You describe what you want, the agent writes it, tests it, debugs it. This is the first mainstream use case where true agentic behavior makes sense.
And the competition is heating up. Chinese model makers are going all-in on coding capabilities. GLM-4, DeepSeek V4, they’re all racing to build better code models. Why? Because unlike most “AI agent” hype, coding agents have a clear feedback loop: code either runs or it doesn’t. That makes them actually useful instead of just impressive demos.
This is what real AI progress looks like. Focused tools that solve specific problems well, with measurable outcomes.
Sanjay: Which AI trends are genuinely delivering value in real deployments?
Tao: I’ll tell you what’s actually working in production.
Document processing is the unsexy winner. We cut contract review time from 3 days to 4 hours for a provincial government department. The AI is maybe 80% accurate. That means human experts can focus on the 20% that actually requires expertise instead of reading boilerplate for hours.
RAG systems when done properly. And I stress “properly” because most RAG implementations I’ve seen are garbage. Our policy analysis system works because we built 13 specialized knowledge bases. We didn’t just throw documents into a vector database and call it a day. Government clients need to know exactly which document section informed each answer. Generic LLMs can’t do that.
Workflow automation with AI components. Not “agentic AI,” just good old deterministic workflows enhanced with LLMs for specific steps. Generating contract first drafts, routing approvals, checking compliance boxes. We deployed a seal management system. Sounds boring, right? Saved one client 15 hours a week. That’s the real ROI. Nobody writes articles about seal management, everyone wants to write about chatbots.
What’s NOT working? Most chatbot projects. Marketing hype about “autonomous AI agents” that are just workflows in disguise. Customer service bots that make people want to throw their phones.
Sanjay: If a company wants measurable impact in 3–6 months, where should they start?
Tao: Start with pain. Start with what makes people miserable.
Don’t walk into a meeting and say “what could we do with AI?” Ask “what process makes your team want to quit?” Find something that’s high-volume, repetitive, and drives people crazy.
Then pilot with ONE motivated team. One team that actually wants this problem solved. Pick the team that’s most frustrated with the current process.
Use existing tools. Claude 3.5 Sonnet, o1, DeepSeek V3, whatever. Don’t build custom models. Don’t spend six months on “AI strategy.” Just solve the damn problem.
For us, the contract generation system delivered ROI in two months. We started with lawyers spending 3 hours formatting documents. We automated the boring parts. The technology was straightforward. The value came from solving a real pain point.
Here’s what doesn’t work: “Let’s explore AI use cases” (translation: endless meetings with no outcome). Building infrastructure before you know what you’re building for. Waiting for the “perfect” solution.
Sanjay: What’s your decision rule for “use a bigger model” vs. “improve the system around it”?
Tao: Fix the plumbing before buying a bigger pump.
I’ve watched teams throw o1 at problems that Claude Haiku could handle with decent prompts. It’s like buying a Ferrari when you need to learn to drive first.
My hierarchy:
- Write better prompts (takes 2 hours, costs nothing)
- Fix your data and retrieval (takes 2 days, saves thousands)
- Add reasoning structures (takes 2 weeks, still cheaper than bigger models)
- Upgrade the model (last resort, highest cost)
Real example: We started building a document review system and almost fine-tuned a massive model. Then we stopped and rebuilt our retrieval architecture instead. Proper knowledge base segmentation, better chunking strategy, hybrid search. Same quality, 90% lower cost.
Bigger models ARE appropriate when you’ve genuinely maxed out everything else. Or when you’re dealing with truly complex reasoning that smaller models can’t handle. But that’s maybe 10% of use cases.
Sanjay: When should teams choose retrieval (RAG) over training or fine-tuning?
Tao: RAG should be your default. Like, seriously, just start with RAG.
Fine-tuning makes sense for:
- Style/format consistency (you want outputs formatted exactly a certain way)
- Very specialized domain language where even Sonnet struggles
- Cases where you absolutely cannot have retrieval latency
Everything else? RAG.
Why we use RAG for government clients:
- Policies update constantly. Retraining models every week is insane
- They need audit trails. “Which document said this?” is a legal requirement
- We work across multiple domains: health policy, procurement rules, contract law
- We don’t have budgets to fine-tune large models every time regulations change
The big mental shift: RAG makes models accountable. When a government official asks “why did the system say this?” we can point to page 47 of document XYZ. Fine-tuned models just give you an answer and a shrug.
Sanjay: When do AI agents actually help, and when do they become unnecessary complexity?
Tao: First, let’s be clear about what we mean by “agent.” I’m talking about systems with autonomous planning capabilities, like Claude’s Agent SDK or similar frameworks that can break down goals, choose tools dynamically, and iterate based on intermediate results. That’s different from workflows.
Workflows vs Agents:
- Workflows: Predefined steps with conditional branching. Think “if contract type A, use template X, then route to department Y.” You can code this as a decision tree.
- True agents: Dynamic planning, tool selection, and strategy adjustment based on what they discover. They decompose goals into subgoals and iterate.
For government and enterprise deployments, workflows win almost every time:
- Auditable: You can trace every decision
- Deterministic: Same input, same process
- Predictable failures: You know where it breaks
- Compliance-friendly: Regulators understand decision trees
We’ve built dozens of systems for government clients. They’re all sophisticated workflows, not agents. Contract generation, document routing, compliance checking. These look “intelligent” to users but are hardcoded logic under the hood.
When true agents make sense:
- Coding tasks: Claude Code, Cursor, and similar tools work because code has immediate feedback. Write code → test → fix → iterate. The agent can self-correct.
- Exploratory research where you can’t predict the path
- Complex problem-solving requiring strategy adjustment
- Tasks where the “right approach” depends on what you find
When they don’t (most cases):
- Production systems requiring consistency
- Regulated industries needing audit trails
- Anywhere debugging agent reasoning exceeds the value
- Tasks without clear verification methods (unlike code that either runs or doesn’t)
I spent months researching cognitive architectures for LLMs. Memory management, reasoning loops, all of it. Published a paper on it. My conclusion? True agents are fascinating research. For production? Start with workflows. 90% of what people call “AI agents” should be workflows with good prompts.
The coding domain is the exception that proves the rule. It works because you have objective success criteria and fast feedback loops.
Sanjay: What’s the most honest way to explain hallucinations to a non-technical stakeholder?
Tao: “The AI is making educated guesses. Sometimes it guesses wrong and sounds right.”
I tell people: imagine someone who’s read every book in a library but took no notes and has a terrible memory. They can have intelligent conversations about almost anything. Sometimes they’re brilliant. Sometimes they confidently tell you that Stockholm is the capital of Norway because it sounds right.
That’s how technology works. We can’t fix that.
What we CAN do:
- Use RAG so answers come from actual documents
- Show confidence scores (though these are unreliable too)
- Force structured outputs instead of free-form text
- Have humans review anything high-stakes
For our government systems, we NEVER deploy LLMs without grounding in retrieved documents. The question is “how do we catch hallucinations before they reach users?”
Sanjay: If you could require every team to do one thing before launching AI, what would it be?
Tao: Test your failure modes. Seriously test them.
Everyone optimizes for the happy path when the AI works great. But AI fails in weird, unpredictable ways. It’ll confidently give you completely wrong legal advice. It’ll miss critical contract clauses. It’ll route urgent documents to the wrong department.
Before we launch anything:
- We deliberately feed it ambiguous inputs
- We give it contradictory information
- We test edge cases that shouldn’t happen but will
- We ask “what happens when this goes wrong?”
Most teams don’t do this. Then they’re shocked when production fails in ways they never imagined.
The successful AI projects have the best error handling. They assume failure and plan for it.
Sanjay: Where do AI costs most commonly surprise teams?
Tao: Three places, consistently:
API costs scaling up. You prototype with $100/month. Great! Then you deploy to 1,000 users and suddenly it’s $10,000/month. Whoops. Solution: aggressive caching, use smaller models for simple tasks, set user quotas. We use Haiku for 70% of queries, only hitting Sonnet or o1 for complex ones.
Human review overhead. AI shifts work around. You still need lawyers reviewing contracts, just reviewing AI outputs instead of blank pages. If you thought AI would cut headcount, you’re wrong. You still need the same people doing different work.
Everything around the model. The LLM is maybe 20% of your work. The other 80%: building data pipelines, setting up retrieval, designing UI, monitoring performance, maintaining the system. Teams see the API price and think that’s the cost. Wrong. That’s just the beginning.
Hidden cost: opportunity cost of failed experiments. Most AI projects don’t ship. Budget for failure. Kill projects quickly when they’re not working. Don’t let sunk cost fallacy keep bad projects alive.
Sanjay: What do you expect to improve the most in the next 12 months: models, tooling, data pipelines, or evaluation?
Tao: Evaluation and monitoring will make the biggest leap because it’s the current bottleneck.
We have powerful models. We have decent tooling. But we’re still debugging AI systems like it’s 2010: manually reviewing outputs, using vibes to assess quality, gut feelings about whether performance is degrading.
This is unsustainable at scale.
What’s coming:
- Automated quality metrics that actually mean something (beyond perplexity scores)
- Real-time monitoring that catches problems before users do
- Systematic evaluation frameworks instead of cherry-picked examples
- Tools that help us understand WHY models fail
Models will incrementally improve. Opus 5, o3, whatever comes next. But evaluation infrastructure will unlock value we’re already sitting on.
You can’t optimize what you can’t measure. Right now, measurement is primitive. That’s changing fast.
For practitioners: invest in logging, monitoring, evaluation frameworks NOW. That infrastructure will matter more than waiting for the next model release.
Key Takeaways
- AI accelerates specific tasks but doesn’t replace judgment. Coding agents (Claude Code, Cursor) are the exception where true agentic behavior works due to immediate feedback loops.
- Three things actually deliver ROI: Document processing at 80% accuracy beats 100% manual work. Properly implemented RAG with specialized knowledge bases provides accountability. Workflow automation with AI components solves real pain points.
- Start with pain, not technology. Find one high-volume, frustrating process. Pilot with a motivated team. Use existing tools (Claude 3.5 Sonnet, o1, DeepSeek V3). Ship in 3–6 months.
- Optimize the system before upgrading the model. Hierarchy: better prompts → improved retrieval → reasoning structures → bigger models. Most problems don’t need Opus when Haiku with good architecture works.
- RAG is the default for enterprise. Fine-tuning makes sense for style consistency. RAG wins for accountability, frequent updates, and audit trails that government and enterprise require.
- Workflows beat agents in production. True agents (with dynamic planning and tool selection) work for coding. Everything else? Use deterministic workflows with LLMs at specific steps. 90% of “AI agents” should be workflows.
- Hallucinations are features of the technology, not bugs. Manage them with RAG grounding, confidence scores, structured outputs, and human review. Never deploy LLMs without document retrieval for high-stakes decisions.
- Test failure modes before launch. Feed ambiguous inputs, contradictory information, and edge cases. Success comes from error handling, not model quality.
- Hidden costs surprise teams: API costs scaling 100x, human review overhead that doesn’t disappear, and integration work comprising 80% of the effort. Most AI projects fail — budget for it.
- Evaluation infrastructure is the next breakthrough. We have powerful models but primitive measurement capabilities. Invest in monitoring, logging, and systematic evaluation now.
Final Thoughts of Tao
The AI hype cycle is exhausting. Every week there’s a new “breakthrough” that will “change everything.” Most of it is noise.
What actually works? Boring, practical applications. Document processing. Information extraction. Workflow automation. Systems that do one thing well.
The companies succeeding with AI solve real problems, measure results, and build sustainable systems. They have the best monitoring and error handling, even if their models are nothing special.
Start small. Solve real problems. Measure everything. Scale what works.
That’s the truth about AI right now.
It was a great one-to-one session with Tao
What I appreciated most about Tao’s answers is how consistently they point back to reality: in 2026, winning with AI is rarely about chasing the biggest model. It’s about making smart architecture choices, investing in evaluation, and building reliable systems that can operate under real-world constraints — cost, latency, data quality, and user expectations.
This interview also reinforced something I see often in delivery work: teams don’t fail because AI “doesn’t work.” They struggle because the problem isn’t clearly defined, success metrics aren’t measurable, or the system around the model isn’t designed for production. Whether it’s deciding between RAG and fine-tuning, knowing when agents truly help, or setting the right expectations around hallucinations — disciplined thinking beats excitement every time.
I’m grateful to Tao An for sharing such candid insights and for taking the time to answer in depth. I hope this interview helps product leaders, founders, and engineering teams cut through the noise and make decisions that lead to real outcomes.
