Start with the data question, not the model
Before I pick a model, I map the data. What information does this feature need access to? Where does it live? Who is allowed to see it? What’s the sensitivity class — public, internal, PII, PHI? Can it leave your environment, and under what terms? The answers here dictate everything downstream. A healthcare intake assistant and a public FAQ bot look similar on the surface; their data diagrams are completely different projects.
Choose models deliberately
Frontier models (Claude, GPT, Gemini) are the default for quality, but they send data to a vendor and carry per-token costs. Smaller open-source models (Llama, Mistral, and their specialized descendants) can be self-hosted for full data control, at the cost of more infrastructure work. The right choice depends on data sensitivity, latency requirements, volume economics, and whether a BAA is needed. I don’t have a favorite vendor — I have a decision framework, and I walk through it with you.
Ground everything in your sources
Letting a model respond from its training data is how you get confident-sounding fabrications. For production use, I prefer retrieval from your actual content and source references where possible, rather than letting a model answer from general training data alone. This is more work up front, but it makes review, correction, and user trust much easier.
Log inputs, outputs, and uncertainties
AI interactions should produce an audit trail appropriate to the context — without unnecessarily storing sensitive content in logs. For regulated workflows, logging and retention need to be designed with your compliance team. For lower-risk workflows, practical logging still makes debugging and review far easier.
Test for failure modes before shipping
An AI feature is not ready to ship the first time it produces a plausible answer. Before release I run it against adversarial inputs, out-of-distribution queries, prompt-injection attempts, ambiguous inputs where the right answer is “I don’t know,” and content that should trigger a refusal or escalation. The evaluation isn’t perfect — no AI evaluation is — but it’s designed to catch the failure modes that are embarrassing, harmful, or legally significant.
Keep humans in the loop where stakes are real
For clinical triage, benefits determinations, legal drafting, and similarly consequential workflows, the right architecture is usually AI-assisted rather than AI-automated. The model drafts; a qualified human approves. This is slower than full automation and dramatically safer — and it’s almost always the shape compliance, liability, and user trust require anyway.