Agents Powered by Llama 3.1: Testing Function Calling Capabilities

Explore the capabilities of Llama 3.1 in function calling and tool usage. Learn how to leverage observability tools like LangTrace to monitor LLM performance. Discover the strengths and limitations of different Llama model sizes in complex task handling.

December 22, 2024

Unlock the power of Llama 3.1 with this comprehensive guide on its function calling capabilities. Discover how this cutting-edge language model can be leveraged as an intelligent agent, seamlessly integrating with APIs to tackle complex tasks. Explore the observability aspects and gain insights into the model's performance, empowering you to make informed decisions for your next project.

Capabilities of Llama 3.1 and Meta's Agentic System
Setting up LangTrace for Observability
Testing Function Calling with Llama 3.1 70B and 8B Models
Parallel Function Calls and Nested Sequential Function Calls
Llama 3.1 8B Model's Struggles with Function Calling
Groq's Fine-Tuned Llama 3 Model for Function Calling
Conclusion

Capabilities of Llama 3.1 and Meta's Agentic System

One of the key capabilities of Llama 3.1 that Meta highlighted in the release is function calling or tool usage. The author wanted to put this capability to the test.

The author first set up the necessary tools and APIs, including the Groq API, which provides one of the fastest APIs for interacting with Llama 3.1. They tested the 70 billion and 8 billion Llama 3.1 models, as well as a Groq-specific fine-tuned version of the 70 billion model.

The author started with a simple example of a single function call, then moved on to more complex scenarios involving parallel and nested function calls. They used the LangTrace observability platform to track the token usage and other metrics during the experiments.

The results showed that the 70 billion Llama 3.1 model performed very well, handling both parallel and nested function calls effectively. The 8 billion model struggled with more complex tasks, while the Groq-specific fine-tuned model had the most trouble, often requiring additional information or clarification from the user.

Overall, the author concluded that the 70 billion Llama 3.1 model is the best option for serious function calling or agentic use cases, demonstrating impressive capabilities in this area. The author also highlighted the usefulness of the LangTrace observability platform for tracking and understanding the behavior of large language models during these types of experiments.

Setting up LangTrace for Observability

In this section, we will set up LangTrace, an open-source and open-telemetry observability platform for LLM applications. LangTrace allows us to track the number of requests and tokens being communicated between our local environment and the LLM API.

First, we need to install the required packages, including the LangTrace Python SDK, the Groq Python SDK, and the OpenAI Python SDK (even though we're not using the OpenAI LLM, it's a dependency of the LangTrace SDK).

Next, we set up our API keys. For this experiment, we don't strictly need LangTrace, but it can provide valuable insights into the usage of our tokens. LangTrace is similar in functionality to LangSmith, an observability platform from LangChain, but it supports a wider range of vendors, including OpenAI, Groq, Cohere, and Perplexity.

We'll be using the cloud-hosted version of LangTrace, so we'll need to create a new account and project. Once we have our API key, we can add it as a secret in our notebook.

With the setup complete, we can start using LangTrace to observe the token usage and other metrics during our function calling experiments with the Groq LLM API. LangTrace will provide us with detailed information about the number of tokens exchanged and the associated costs (if applicable, as Groq doesn't charge for API usage).

By using LangTrace, we can gain valuable insights into the performance and efficiency of our LLM-powered applications, which can be especially useful when testing advanced capabilities like parallel and nested function calls.

Testing Function Calling with Llama 3.1 70B and 8B Models

The author starts by highlighting Meta's release of an agentic system around function calling in Llama 3.1. Since the author has not set up the system locally, they decide to use the Groq API, which provides one of the fastest APIs for interacting with Llama 3.1.

The author tests the 70B and 8B Llama 3.1 models, as well as a Groq-specific fine-tuned version of the 70B model. They use LangTrace, an open-source observability platform for LLM applications, to track the number of requests and tokens exchanged between the local environment and the LLM API.

The author begins with a simple example, where the model needs to use a "get game scores" function to determine the winner of an NBA game. The 70B model performs this task successfully, and the author examines the LangTrace data to understand the internal mechanism.

Next, the author tests the models' ability to handle parallel function calls, where the user asks for information related to weather, flights, hotels, and attractions. The 70B model is able to decompose the initial prompt, make parallel function calls, and generate a comprehensive response. The 8B model, however, struggles with this task, hallucinating information and failing to provide complete responses.

The author then introduces a more complex scenario, where the user wants to plan a trip from New York to London and then to Tokyo, including weather, flights, hotels, and attractions. Again, the 70B model performs well, while the 8B model has difficulty.

Finally, the author tests the Groq-specific fine-tuned 70B model, which surprisingly struggles with even the simple "get game scores" task, repeatedly asking for more specific details instead of using the provided function.

In conclusion, the author finds that the 70B Llama 3.1 model is the best performer when it comes to function calling and tool usage, able to handle parallel and nested function calls. The 8B model, on the other hand, is not recommended for serious function calling tasks. The Groq-specific fine-tuned model also underperforms in the author's tests.

Parallel Function Calls and Nested Sequential Function Calls

The Llama 3.1 model, particularly the 70B version, demonstrated impressive capabilities in handling parallel function calls and nested sequential function calls.

When presented with a complex prompt to plan a trip from New York to Paris, including checking weather, finding flights, hotels, and attractions, the 70B model was able to decompose the task and make parallel function calls to gather the necessary information. It then combined the results from the various functions to provide a comprehensive summary of the trip details.

The model also showed its ability to handle nested function calls, where the output of one function was used as the input to another. In the movie recommendation scenario, the model was able to first select a movie based on the user's preference, and then recommend a suitable snack and streaming platform to watch the movie.

In contrast, the smaller 8B Llama 3.1 model struggled with these more advanced use cases. It was unable to consistently provide accurate and complete responses when asked to handle parallel function calls or nested function calls. The model often hallucinated information or failed to use the provided tools effectively.

Additionally, the specialized function calling model from Groq also underperformed in the tests, failing to properly utilize the provided functions and requiring more specific input from the user.

Overall, the 70B Llama 3.1 model demonstrated a strong capability in handling complex, multi-step tasks that involve parallel and nested function calls, showcasing its potential as a capable agent-like system. The smaller 8B model and the Groq function calling model, however, still have room for improvement in these areas.

Llama 3.1 8B Model's Struggles with Function Calling

The Llama 3.1 8B model struggled significantly with the more complex function calling tasks compared to the larger 70B model. Some key observations:

For the simple "get game scores" function, the 8B model was able to handle it without issues, similar to the 70B model.
However, when it came to parallel function calls for tasks like trip planning, the 8B model faltered. It was unable to provide comprehensive information on weather, flights, hotels, and attractions, often hallucinating details or failing to list available options.
With the expanded set of functions, the 8B model struggled even more, hallucinating information about events and weather details that were not requested.
The 8B model also had trouble with nested function calls for the movie recommendation task. It was unable to properly use the provided tools and resorted to directly suggesting movies instead.
The specialized function calling model from Groq also performed poorly in the tests, often requesting more specific details rather than utilizing the provided tools effectively.

In contrast, the Llama 3.1 70B model demonstrated much stronger capabilities in handling parallel and nested function calls, providing comprehensive and accurate responses. The 8B model simply does not seem ready for serious function calling or agent-like tasks, and the specialized Groq model also underperformed in these tests.

For observability and tracing of these LLM function calls, the open-source LangTrace platform proved to be a useful tool to monitor the token usage and API interactions.

Groq's Fine-Tuned Llama 3 Model for Function Calling

The Groq fine-tuned Llama 3 model for function calling struggled in the tests compared to the larger 70B Llama 3.1 model. Some key findings:

When asked to provide the score of a Warriors game, the model requested more specific details like the date or opponent team, rather than using the provided "get game scores" function.
For trip planning requests, the model repeatedly asked for more specific details like travel dates, rather than using the provided functions to generate a response.
For the movie night recommendation task, the model struggled to use the nested functions and often resorted to providing a direct movie recommendation instead.

Overall, the Groq fine-tuned Llama 3 model did not perform as well as the 70B Llama 3.1 model on the function calling and tool usage tests. The 70B model demonstrated strong capabilities in parallel and nested function calls, while the fine-tuned model seemed to have difficulty leveraging the provided tools and functions. Further optimization or fine-tuning may be required for the Groq model to match the performance of the larger Llama 3.1 version on these types of tasks.

Conclusion

The 70 billion LLAMA 3.1 model from Groq performed exceptionally well in the function calling and tool usage tests. It was able to handle parallel function calls as well as nested function calls with ease, demonstrating its strong capabilities as an agentic system.

In contrast, the 8 billion LLAMA 3.1 model struggled with these more complex tasks, highlighting the importance of using larger and more capable language models for such applications.

The specialized function calling model from Groq, however, did not perform as well as expected, even with the provided example. This suggests that the fine-tuning process for this model may not have been as effective as hoped.

For observability and tracing purposes, the open-source LangTrace AI platform proved to be a valuable tool, providing detailed insights into the token usage and API calls made by the language models during the experiments.

Overall, the results demonstrate the potential of large language models like LLAMA 3.1 for function calling and tool usage, but also highlight the need for continued research and development to improve the capabilities of smaller and specialized models in this area.

FAQ

What capabilities of Llama 3.1 did Meta highlight in the release?

What API is used in this video to test the function calling capabilities of Llama 3.1?

What Llama 3.1 models are tested in this video?

What observability tool is used in this video to track the usage of the Llama 3.1 models?

What are the key capabilities of the 70 billion Llama 3.1 model tested in this video?

How do the 8 billion Llama 3.1 model and the Groq-specific function calling model perform in the tests?