Over the past year and a half I’ve been the sole engineer on the retrieval and evaluation stack of a chatbot deployed nationally for Italian Public Administration — a system called Camilla that helps civil servants navigate public tenders.
It processes real queries from real users, in a regulated context, with no room to quietly paper over the failures. That experience forced clarity on questions I don’t often see written up: how do you evaluate an agentic system in a way that actually catches what breaks in production? How do you build hybrid retrieval over a corpus that’s inconsistent by design? How do you ship an AI product to the Italian PA under EU AI Act disclosure requirements?
I wrote about the architecture and the governance decisions — the ones that turned out to matter more than the ML choices — in a piece for Agenda Digitale.