Evaluating Phi-3-Mini's Performance on RAG, Routing, and Agents

Evaluating Phi-3-Mini's performance on RAG, routing, and agents. Exploring the model's capabilities in practical use cases, including simple RAG queries, complex query decomposition, and agent orchestration.

January 24, 2025

party-gif

This blog post explores the capabilities of the Phi-3-Mini language model in practical use cases, including retrieval, query routing, and agent-based frameworks. The content provides a detailed analysis of the model's performance across various tasks, offering insights into its strengths and limitations. Readers will gain a better understanding of the model's suitability for real-world applications.

Simple Retrieval and RAG

The model performs reasonably well on simple retrieval tasks using the RAG (Retrieval-Augmented Generation) pipeline. When asked a simple query like "how do OpenAI and Meta differ on AI tools", the model is able to provide an accurate response by compacting the relevant chunks of text and generating a coherent summary.

However, when the queries become more complex, the model starts to exhibit some limitations. For example, when asked "what are the new features added by OpenAI to ChatGPT", the model incorrectly attributes some features introduced by Meta to OpenAI, showcasing a tendency to hallucinate or confuse information from different sources.

The model's performance improves when using the "tree summarize" mode, which recursively summarizes each chunk of text before generating the final response. This approach helps to mitigate the issue of conflicting information across different chunks.

Overall, the model demonstrates a decent capability for simple retrieval tasks using RAG, but its performance starts to degrade when dealing with more complex queries that require a deeper understanding of the underlying information.

Complex Queries and RAG Limitations

The model's performance on complex queries reveals some limitations of the RAG (Retrieval-Augmented Generation) approach. While it handles simple queries reasonably well, it struggles with more complex queries that involve conflicting information across different document chunks.

When asked about the new features introduced by OpenAI, the model incorrectly attributed some features that were actually introduced by Meta. This suggests that the model has difficulty reconciling and synthesizing information from multiple sources, especially when there are discrepancies or contradictions.

The model's query decomposition capabilities, however, seem more promising. When presented with a complex query, the model was able to break it down into relevant sub-questions and retrieve information accordingly. This suggests that the model has some understanding of the underlying structure of the query and can attempt to address it in a more systematic way.

In the context of agent orchestration, the model's performance was mixed. For simple queries, it was able to determine that no tool was necessary and generate a response on its own. However, for more complex queries, the model struggled to effectively utilize the available tools to provide a comprehensive answer.

Overall, the results indicate that while the model has some capabilities in handling RAG-based tasks, it still has limitations when it comes to complex queries and agent orchestration. Further improvements in the model's ability to reconcile conflicting information, synthesize knowledge, and effectively leverage external tools would be necessary to make it more robust for these types of applications.

Query Routing and Query Decomposition

The model's performance on query routing and query decomposition tasks was mixed.

For query routing, the model was able to effectively use the provided tool descriptions to determine which vector store to use for answering specific queries. When asked a question about information related to Meta, the model correctly identified the "Vector Tool" as the appropriate resource and provided a relevant response. Similarly, when asked a more specific question about the number of personality-driven chatbots introduced by Meta, the model again used the correct vector store to retrieve the accurate information.

However, when the model was allowed to select multiple tools, its performance declined. For a query asking about the main features introduced by OpenAI and other companies, the model incorrectly attributed information about Tesla and Apple, which were not mentioned in the original document. This suggests that the model still struggles with complex query routing and may hallucinate information when attempting to combine multiple sources.

The model performed better on query decomposition tasks. When presented with a complex query about the differences between how Meta and OpenAI are discussed, the model was able to break it down into three sub-questions, retrieving relevant information for each and then synthesizing a final response. The sub-questions generated were logical and the overall answer provided a reasonable comparison between the two companies.

In summary, the model shows promise in basic query routing capabilities, but its performance degrades for more complex queries that require combining information from multiple sources. The query decomposition abilities are more robust, indicating the model can effectively break down and address intricate questions. However, further refinement may be needed to fully harness the model's potential in practical use cases.

Agents and Mathematical Operations

The tests conducted on the Retrieval-Augmented Generation (RAG) model reveal some interesting insights about its capabilities and limitations:

  1. Simple RAG Queries: The model performs reasonably well on simple RAG queries, providing accurate responses based on the information available in the document.

  2. Complex RAG Queries: When faced with more complex queries that involve conflicting information across different document chunks, the model struggles and tends to hallucinate or miscategorize the information.

  3. Query Routing: The model demonstrates the ability to perform query routing, where it can select the appropriate vector store to retrieve relevant information based on the query. This suggests the model can handle tasks that require understanding the metadata and capabilities of different information sources.

  4. Query Decomposition: The model is able to decompose complex queries into sub-questions and retrieve information to answer them individually, then combine the results. This shows promise for the model's ability to handle complex information needs.

  5. Agent Orchestration: When tested in an agent-based framework, the model exhibits limited capabilities. It struggles to effectively utilize the provided tools, especially for more complex tasks involving mathematical operations. The model seems to prefer performing computations on its own rather than leveraging the available tools.

  6. Mathematical Operations: Interestingly, the model appears to have a better grasp of performing simple mathematical operations on its own, without relying on the provided tools. This suggests the model may have some inherent mathematical reasoning capabilities.

Overall, the results indicate that the RAG model has potential for certain applications, such as simple information retrieval and query routing. However, its performance on more complex tasks, including agent orchestration and handling of conflicting information, is limited. Further advancements in the model's reasoning and integration with external tools may be necessary to fully leverage its capabilities in practical use cases.

Conclusion

The Pi 3 small language model, while impressive on benchmarks, has some limitations when it comes to practical use cases.

For simple retrieval tasks, the model performs reasonably well, able to provide accurate responses by compacting relevant information from the document. However, when faced with more complex queries that involve conflicting information across different document chunks, the model struggles and tends to hallucinate, mixing up features introduced by different companies.

The model's query routing capabilities show promise, as it is able to select the appropriate vector store based on the provided descriptions. This suggests the model can understand the context and purpose of different information sources. However, for more complex query decomposition tasks, the model's performance is still limited.

When it comes to agent orchestration, the model exhibits mixed results. While it can handle simple queries without the need for tools, for more complex mathematical operations, it seems reluctant to leverage the provided tools and instead attempts to perform the computations itself, sometimes inaccurately.

Overall, the Pi 3 small language model demonstrates capabilities in certain areas, but its limitations become apparent when dealing with complex, multi-faceted queries and tasks. Further advancements in areas like hallucination mitigation and robust reasoning will be necessary to unlock the model's full potential for practical applications.

FAQ