Building an internal agent: Iterative prompt and skill refinement
Some of our internal workflows are being used quite frequently, and usage reveals gaps in the current prompts, skills, and tools. Here is how we’re working to iterate on these internal workflows.
This is part of the Building an internal agent series.
Why does iterative refinement matter?
When companies push on AI-led automation, specifically meaning LLM agent-driven automation, there are two major goals. First is the short-term goal of increasing productivity. That’s a good goal. Second, and I think even more importantly, is the long-term goal of helping their employees build a healthy intuition for how to use various kinds of agents to accomplish complex tasks.
If we see truly remarkable automation benefits from the LLM wave of technology, it’s not going to come from the first-wave of specific tools we build, but the output of a new class of LLM-informed users and developers. There is nowhere that you can simply acquire that talent, instead it’s talent that you have to develop inhouse, and involving more folks in iterative refinement of LLM-driven systems is the most effective approach that I’ve encountered.
How are we enabling iterative refinement?
We’ve taken a handful of different approaches here, all of which are currently in use. From earliest to latest, our approaches have been:
Being responsive to feedback is our primary mechanism for solving issues. This is both responding quickly in an internal
#aichannel, but also skimming through workflows each day to see humans interacting, for better and for worse, with the agents. This is the most valuable ongoing source of improvement.Owner-led refinement has been our intended primary mechanism, although in practice it’s more of the secondary mechanism. We store our prompts in Notion documents, where they can be edited by their owners in real-time. Permissions vary on a per-document basis, but most prompts are editable by anyone at the company, as we try to facilitate rapid learning.
Editable prompts alone aren’t enough, these prompts also need to be discoverable. To address that, whenever an action is driven by a workflow, we include a link to the prompt. For example, a Slack message sent by a chat bot will include a link to the prompt, as well a comment in Jira.
Claude-enhanced, owner-led refinement via the Datadog MCP to pull logs into the repository where the skills live has been fairly effective, although mostly as a technique used by the AI Engineering team rather than directly by owners. Skills are a bit of a platform, as they are used by many different workflows, so it may be inevitable that they are maintained by a central team rather than by workflow owners.
Dashboard tracking shows how often each workflow runs and errors associated with those runs. We also track how often each tool is used, including how frequently each skill is loaded.
My guess is that we will continue to add more refinement techniques as we go, without being able to get rid of any of the existing ones. This is sort of disappointing–I’d love to have the same result with fewer–but I think we’d be worse off if we cut any of them.
Next steps
What we don’t do yet, but is the necessary next step to making this truly useful, is to include a subjective post-workflow eval that determines whether the workflow was effective. While we have evals to evaluate workflows, this would be using evals to evaluate individual workflow runs, which would provide a level of very useful detail to understand.
How it’s going
In our experience thus far, there are roughly three workflow archetypes:
chatbots,
very well understood iterative workflows (e.g. applying :merge: reacji
to merged PRs as discussed in code-driven workflows),
and not-yet-well-understood workflows.
Once we build a code-driven workflow, they have always worked well for us, because we have built a very focused, well-understood solution at that point. Conversely, chatbots are an extremely broad, amorphous problem space, and I think post-run evals will provide a high quality dataset to improve them iteratively with a small amount of human-in-the-loop to nudge the evolution of their prompts and skills.
The open question, for us anyway, is how we do a better job of identifying and iterating on the not-yet-well-understood workflows. Ideally without requiring a product engineer to understand and implement each of them individually. We’ve not scalably cracked this one yet, and I do think scalably cracking it is the key to whether these internal agents are somewhat useful (frequently performed tasks performed by many people eventually get automated) and are truly transformative (a significant percentage of tasks, even infrequent ones performed by a small number of people get automated).