I’ve been running a chat assistant application built on OpenAI for the past year. My biggest learning has come from analyzing our AI assistant responses and finding ways to optimize(both cost and quality) them. Like all RAG applications, we add source URLs to all chunks and instruct the LLM to include citations referencing the source link. Here’s a snippet of our answer generation prompt:
For each document indicate which sources most support it via valid citation markers at the end of sentence in the markdown format. Add a link to the source using markdown format. Also, include page number with the source.
Our analysis revealed that over 60% of our answers contain more than five source links, with listing questions exceeding ten links. These links inflate both input and output tokens.
Since our knowledge base resides on SharePoint, we incorporate long SharePoint URLs into the input, which the LLM then references with the same lengthy URLs in the output.
Our dataset analysis indicates that a single SharePoint URL consumes 60 to 100 tokens. These tokens become part of both input and output. Not all URLs mentioned in the input become part of the output generated by LLMs.
By now, you’ve likely realized the potential of URL shorteners to reduce costs by shortening URLs. We implemented Bitly, and Bitly URLs consume an average of 10-15 tokens. For instance, using Bitly links for a response with six SharePoint URLs reduced output tokens from 962 to 586 – a significant 40% reduction. This translates to direct cost savings.
On average we see 4-5 links in the output so we observed a 30% reduction in output tokens. This is 30% cost reduction.
URL shorteners offer an additional benefit beyond cost savings: gauging user engagement. Click-through rates on shortened URLs serve as an indicator of answer quality and usefulness.
Discover more from Shekhar Gulati
Subscribe to get the latest posts sent to your email.