I’ve been running a chat assistant application built on OpenAI for the past year. My biggest learning has come from analyzing our AI assistant responses and finding ways to optimize(both cost and quality) them. Like all RAG applications, we add source URLs to all chunks and instruct the LLM to include citations referencing the source link. Here’s a snippet of our answer generation prompt:
For each document indicate which sources most support it via valid citation markers at the end of sentence in the markdown format. Add a link to the source using markdown format. Also, include page number with the source.
Our analysis revealed that over 60% of our answers contain more than five source links, with listing questions exceeding ten links. These links inflate both input and output tokens.
Continue reading “A simple optimization that reduced output tokens by 30% in our LLM-based RAG solution”