LLM Knowledge agent with RAG Fusion on Cohere platform with LangChain and Streamlit

Feb 11, 2024

ThruThink is a business budgeting online app for business projection and forecasting. It is a product of literally decades of experience and careful thought, and thousands of calculations. It got its name from thru-hiking, or through-hiking: the act of hiking an established long-distance trail end-to-end continuously. When budgeting you need to carefully plan and account for everything, just like when planning a thru-hike.

It is a startup in the incubation phase, and there are no dedicated personnel for support chat agent roles. At some point we had a “classic” human chat agent integration, however, that’s on a hiatus now and it would be great to supplement that with an AI chat agent with the rise of the large language models.

For this variation of chat agent, I continued from the QnA Boosted RAG concept of the Vectara hackathon.

Cohere is another company that offers Retrieval Augmented Generation services, but they also have their own state-of-the-art large language model called Command R (and lately Command R+). Just like many companies they also move towards higher-level managed services in this space: their service takes care of a large portion of the RAG pipeline, including chunking, re-ranking, and more.

The most prominent feature of the current set of experiments (this blog post) was the RAG fusion. RAG fusion takes the original query and generates variations of that and performs the RAG procedure on all of those variations. Then we need to employ an extra reciprocal rank fusion on the unified fusion of the retrievals and their ranks from the variations to conclude the final ranking. The variations should stay close to the original query (or rather shouldn’t fall too far), however, they preferably shouldn’t be too similar either so they wouldn’t

The conceptual idea is twofold:

Sometimes the users themselves don’t know what is the proper question to ask about their issue.
In case of sparse data some variations might match on some indexed chunks better and could potentially uncover more matching results.

These are hypotheses.

As for the technology stack, Cohere’s co.chat has two modes:

Document mode, which essentially means a RAG
Connector mode, where it can perform a web search to supplement information for an answer

That would perfectly cover our goal for the agent: to be knowledgeable about ThruThink specifics and also be versed in accounting and business projection generic concepts.

One of the most interesting technical challenges was a prompt engineering one: when we generate the variations it’s trivial when it is the beginning of the conversation. However, once we are in the middle, a conversation can contain several contextual references to previously mentioned information. We wouldn’t want to perturbate the complete prompt with the whole history. I want to resolve any context references in the last user query. I had an extremely hard time achieving that originally, various large language models were performing badly until I crafted a few-shot prompt. This was a very important first lesson about how important few-shot prompting and prompt engineering are besides other technical parts. This realization will strike me again at my first Kaggle hackathon.