Discover Mistral's Powerful 'Mr Large 2' Model: Outperforming GPT-4 on Key Benchmarks

Discover Mistral's Powerful 'Mr Large 2' Model: Outperforming GPT-4 on Key Benchmarks. Mistral's new 123B-parameter model outshines GPT-4 in code generation, math, and multilingual capabilities. Explore its advanced function calling and retrieval skills.

January 13, 2025

party-gif

Unlock the power of a cutting-edge language model with Mistral's latest release, Mr. Large 2. This advanced AI system outperforms industry giants in code generation, mathematics, and multilingual capabilities, all while maintaining a smaller footprint. Discover how this versatile model can elevate your projects and streamline your workflows.

Capabilities of Mistral Large 2 Model

The Mistral Large 2 model, recently released by Mistral, is a powerful language model that outperforms the state-of-the-art 405B model on a number of important benchmarks. Despite being significantly smaller in size, with only 123 billion parameters compared to 405 billion, the Mistral Large 2 model demonstrates impressive capabilities.

One of the key strengths of the Mistral Large 2 model is its improved performance in code generation and mathematics/reasoning tasks. It also provides much stronger multilingual support, with the ability to handle up to 80 programming languages and support for languages such as French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean.

Another notable feature of the Mistral Large 2 model is its enhanced contact window of 128,000 tokens, allowing it to handle longer context compared to earlier releases. This makes it particularly well-suited for real-world applications that require handling of long, multi-turn conversations.

The model has also been trained with a focus on minimizing hallucination, a common issue with large language models. It is now better equipped to acknowledge when it lacks sufficient information to provide a confident answer, reducing the risk of generating plausible but incorrect or irrelevant information.

Additionally, the Mistral Large 2 model excels in instruction following and alignment, making it a strong candidate for applications that require precise task execution and handling of complex, multi-step instructions.

The model's capabilities in tool use and function calling are also noteworthy. It can perform parallel and sequential function calls, allowing for agent orchestration and enhanced retrieval skills, which are crucial for many business and enterprise applications.

Overall, the Mistral Large 2 model represents a significant advancement in the field of large language models, offering a compelling combination of performance, efficiency, and versatility. Its release further highlights the rapid progress in the open-source AI landscape, challenging the dominance of proprietary models and providing new opportunities for developers and researchers.

Benchmarks and Comparisons with Other Models

The Mr Large 2 model from Mistol is outperforming the 405b model, which was previously considered one of the state-of-the-art models, both for proprietary and open-source models. According to the blog post, Mr Large 2 is significantly more capable in code generation, mathematics, and reasoning. It also provides much stronger multilingual support and advanced function calling capabilities.

The model has a context window of 128,000 tokens, supporting a much larger context compared to some of the earlier releases. It is multilingual, with support for French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean. Additionally, it supports over 80 programming languages.

In terms of benchmarks, the Mr Large 2 model is on par with GPT-4 and outperforms the 405b model in most of the benchmarks, despite being only one-third the size of the 405b model (123 billion parameters compared to 405 billion).

One of the key focus areas during the training of Mr Large 2 was to minimize the model's tendency to hallucinate or generate plausible-sounding but factually incorrect or irrelevant information. This has been a significant issue with large language models, and it seems that Mistol has paid close attention to this problem, resulting in a model with reduced hallucination.

Another improvement is in instruction following and alignment. According to the blog post, this model is particularly better at following precise instructions and handling long, multi-turn conversations, which is crucial for real-world applications. Smaller models tend to suffer in performance when it comes to long, multi-turn conversations.

The model also has enhanced tool use and function calling capabilities, which are practical applications for businesses and enterprises. It can perform both parallel and sequential function calls, and on benchmarks dedicated to function calling, it even outperforms GPT-4 and Chinchilla 3.5, which is a significant achievement.

It's important to note that while the benchmarks are promising, it's always recommended to do your own evaluation and "vibe check" for your specific applications, as the performance of these models can vary depending on the prompts and data used for testing.

Improved Hallucination Reduction and Instruction Following

One of the key focus areas during the training of Mr. Large 2 was to minimize the model's tendency to hallucinate or generate plausible-sounding but factually incorrect or irrelevant information. This has been a significant issue with large language models, but the creators of Mr. Large 2 have paid close attention to it.

They have gathered training data where the model's hallucination has been reduced substantially. As a result, the new model is trained to acknowledge when it cannot find solutions or does not have sufficient information to provide a confident answer.

Another improvement in Mr. Large 2 is its instruction following and alignment capabilities. According to the creators, this model is particularly better at following precise instructions and handling long, multi-turn conversations. This is a significant enhancement, as smaller models tend to suffer in performance when it comes to long, multi-turn interactions.

The improved hallucination reduction and instruction following capabilities of Mr. Large 2 are expected to make it more suitable for real-world applications, where accurate and reliable responses are crucial.

Tool Use and Function Calling Capabilities

The Mr Large 2 model from Anthropic has enhanced capabilities when it comes to tool use and function calling. This allows the model to interact with external tools and functions to gather information and perform tasks, making it more practical for real-world applications.

The process works as follows:

  1. The LLM analyzes the user query and determines whether it needs to use a tool or not. If no tool is required, it will generate a direct response.
  2. If the LLM decides to use a tool, it will select an appropriate tool from a predefined list based on the task at hand.
  3. The LLM will then generate the necessary inputs for the selected tool.
  4. The user's code will need to execute the tool call or function and pass the response back to the LLM.
  5. The LLM will then use the tool's response to generate the final output for the user.

This functionality is enabled through the use of a JSON schema that describes the available tools, their names, descriptions, input parameters, and required outputs. The LLM can then reference this information to determine the appropriate tool to use and how to interact with it.

The Mr Large 2 model has demonstrated strong performance on benchmarks focused on function calling, even outperforming GPT-4 and Chinchilla 3.5 in some cases. This suggests that the model's tool use and function calling capabilities are a significant improvement over previous LLMs.

Overall, the enhanced tool use and function calling abilities of the Mr Large 2 model make it a more practical and versatile tool for real-world applications, where the ability to interact with external data and systems is crucial.

Pricing and Availability of Mistral Models

Mistral AI is making their models available through various API providers, including Google, Microsoft, Amazon, Bedrock, and IBM Watson. The pricing for using the Mr. Large 2 model through their platform seems to be similar to the pricing for the 405B model from other providers.

However, the output pricing from Mistral's platform appears to be a bit more expensive compared to Anthropic's Fireworks AI, which offers $3 per million tokens for both input and output for the 405B model.

It's important to note that the pricing may vary depending on the API provider and the specific usage requirements. Businesses and developers interested in using the Mr. Large 2 model will need to contact Mistral AI to obtain a commercial license, as the model is released under the Mistral Research License and is not freely available for commercial use.

Overall, the availability of the Mr. Large 2 model through multiple API providers gives users more options to choose from, but the pricing may need to be carefully evaluated based on the specific needs and usage patterns of the application.

Hands-on Example: Integrating Function Calling

To demonstrate the function calling capabilities of the Mr. Large 2 model, let's walk through a step-by-step example:

  1. Install the necessary dependencies:

    • Install the Mistral AI Python client: pip install mistral-ai
    • Import the required libraries:
      1import pandas as pd 2from functools import partial 3from mistral_ai.client import MistralClient
  2. Prepare the sample data:

    • Create a sample DataFrame with transaction data:
      1data = { 2 'customer_id': [1, 2, 3, 4, 5], 3 'transaction_id': ['tx1', 'tx2', 'tx3', 'tx4', 'tx5'], 4 'payment_amount': [100.0, 50.0, 75.0, 25.0, 150.0], 5 'payment_date': ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05'], 6 'status': ['paid', 'pending', 'paid', 'refunded', 'paid'] 7} 8df = pd.DataFrame(data)
  3. Define the tool functions:

    • Create functions to retrieve payment status and payment date:
      1def retrieve_payment_status(data, transaction_id): 2 return {'status': data[data['transaction_id'] == transaction_id]['status'].values[0]} 3 4def retrieve_payment_date(data, transaction_id): 5 return {'date': data[data['transaction_id'] == transaction_id]['payment_date'].values[0]}
  4. Describe the tool usage:

    • Provide a JSON schema to describe the tool functions:
      1tools = [ 2 { 3 'type': 'function', 4 'name': 'retrieve_payment_status', 5 'description': 'Retrieves the payment status for a given transaction ID', 6 'parameters': [ 7 {'name': 'data', 'type': 'object', 'description': 'The transaction data'}, 8 {'name': 'transaction_id', 'type': 'string', 'required': True, 'description': 'The transaction ID'} 9 ], 10 'returns': {'type': 'object', 'description': 'The payment status'} 11 }, 12 { 13 'type': 'function', 14 'name': 'retrieve_payment_date', 15 'description': 'Retrieves the payment date for a given transaction ID', 16 'parameters': [ 17 {'name': 'data', 'type': 'object', 'description': 'The transaction data'}, 18 {'name': 'transaction_id', 'type': 'string', 'required': True, 'description': 'The transaction ID'} 19 ], 20 'returns': {'type': 'object', 'description': 'The payment date'} 21 } 22] 23 24tools_dict = { 25 'retrieve_payment_status': partial(retrieve_payment_status, df), 26 'retrieve_payment_date': partial(retrieve_payment_date, df) 27}
  5. Interact with the Mr. Large 2 model:

    • Set up the Mistral AI client and the model:
      1client = MistralClient(api_key='your_api_key') 2model = client.chat_model('mr-large-v2')
    • Initiate the conversation and let the model select the appropriate tool:
      1messages = [{'content': 'What is the status of my transaction tx3?', 'role': 'user'}] 2response = model.generate_response(messages, tools=tools) 3print(response)
    • The model will select the retrieve_payment_status tool, execute the function, and generate the final response:
      {'content': 'Your transaction tx3 is marked as paid.', 'role': 'assistant'}
      

This example demonstrates how the Mr. Large 2 model can integrate with external functions or tools to provide more comprehensive and accurate responses. The model analyzes the user's query, selects the appropriate tool, and then generates the final response by combining the tool's output with its own language generation capabilities.

You can further expand this example by adding more tools, handling nested tool calls, and exploring the model's other capabilities, such as multi-lingual support and advanced function calling.

Closing Thoughts on Model Size and Open-Source Developments

It seems like the size of these large language models (LLMs) does not matter as much as we are thinking. It's probably more about the quality of data and the compute scaling, because we have a relatively small model compared to GPT-4 and it's getting better results than that model. Even the Llama 3.17 billion parameter model is very close to GPT-4 now.

You do have to give credit to companies like OpenAI or Anthropic because they are able to release multi-model offerings or at least they are providing multi-model options through their API endpoints. When it comes to open-source, we haven't really seen a very capable multi-model model yet, but hopefully that will change pretty soon.

When it comes to the ecosystem, there is definitely space for both closed-weight and open-weight models. But it's really great to see that open-weight is not only catching up, but seems to be ahead, because we had two releases in the last two days that are state-of-the-art. It's hard to predict what the future will bring, but the rapid pace of progress is exciting.

Conclusion

The release of the MrLarge 2 model by Mistol is a significant development in the world of large language models. This 123 billion parameter model has outperformed the 405 billion parameter Llama 3.1 model on a number of important benchmarks, including code generation and mathematics/reasoning tasks.

One of the key highlights of the MrLarge 2 model is its focus on reducing the tendency to hallucinate or generate plausible but incorrect information. The model has been trained to acknowledge when it lacks sufficient information to provide a confident answer, which is a crucial improvement for real-world applications.

Another notable feature of the MrLarge 2 is its enhanced instruction following and alignment capabilities, as well as its improved tool use and function calling abilities. These capabilities make the model particularly well-suited for enterprise and business applications that require precise interactions and the ability to integrate with external tools and systems.

While the author cautions against blindly trusting benchmark results and recommends conducting one's own evaluations, the MrLarge 2 model appears to offer a compelling balance of performance and model size, allowing for deployment on a single NVIDIA A100 node compared to the multiple nodes required for the larger 405B model.

Overall, the release of the MrLarge 2 model is a significant step forward in the development of open-source large language models, and it will be interesting to see how it compares to proprietary models from companies like OpenAI and Anthropic in real-world applications.

FAQ