ElevenLabs Optimizes RAG System for 50% Faster Response

ElevenLabs has unveiled a significant improvement in the performance of its Retrieval-Augmented Generation (RAG) system, achieving a 50% reduction in query generation latency. This advancement is aimed at enhancing the efficiency of conversational agents, according to ElevenLabs.

The Challenge: Context-Aware Query Generation

In the realm of conversational AI, RAG systems are crucial for converting conversation history into precise search queries that accurately reflect user intent. A typical challenge involves maintaining context across multiple interactions. For instance, in customer support scenarios, the system must understand references to previous queries to generate accurate responses. Previously, this process was dependent on a single large language model (LLM), which introduced latency and availability issues.

The Solution: Parallel LLM Racing

To tackle these challenges, ElevenLabs designed a system that leverages multiple LLMs in parallel, effectively racing them to use the first successful response. This method involves a mix of models with varying characteristics, such as Google’s Gemini models and self-hosted Qwen models, each offering distinct advantages in speed and reliability. By distributing the workload across different models, ElevenLabs managed to stabilize response times even when individual models experienced fluctuations.

Smart Timeout Handling

In scenarios where no model responds within the designated one-second timeout, a fallback strategy is employed, defaulting to the most recent user message for query generation. This approach ensures that conversations continue smoothly, prioritizing flow over perfect query formulation.

The Results

The optimization efforts led to substantial improvements in response times across various percentiles. Median latency was reduced from 326ms to 155ms, while the 75th and 95th percentiles saw similar enhancements. This new architecture not only boosts speed but also enhances system reliability, as demonstrated during a recent Gemini outage, where self-hosted models maintained seamless operation.

Future Prospects

ElevenLabs’ innovative architecture paves the way for real-time, context-aware AI applications, particularly in the field of voice AI. By achieving sub-200ms RAG, ElevenLabs is setting new standards for the development of responsive and efficient conversational agents.

Image source: Shutterstock

Source: https://blockchain.news/news/elevenlabs-optimizes-rag-system-faster-response