Leveraging Context Caching to Optimize Long-Form LLM Usage

Discover how to leverage Google's context caching feature in the Gemini API to optimize long-form LLM usage, reduce processing time and costs. Learn the implementation details and potential benefits for developers building on the Gemini API.

October 6, 2024

party-gif

Unlock the power of long context LLMs with Google's Gemini API and its new context caching feature. Discover how this innovative solution can significantly reduce processing time, latency, and costs, making it easier to leverage large datasets in your AI applications. Explore the practical implementation details and learn how to effectively leverage this game-changing technology.

Understand Caching and Its Benefits

Google's recent addition of context caching to their Gemini API aims to address some of the major limitations of long-context language models (LLMs). While LLMs can hold a significant amount of information, they suffer from several issues:

  • Increased Processing Time: With every query, the entire context needs to be sent to the LLM, resulting in processing a large amount of data, which increases the processing time.
  • High Latency: The large data transfers required for each query lead to high latency.
  • Higher Costs: Since API providers charge based on the number of tokens, the increased data transfer leads to higher costs.

Google's context caching feature tries to mitigate these issues. Here's how it works:

  1. Initialize the Cache: You provide a system instruction or a large context (e.g., documents, video files, audio files) that you want to cache.
  2. Cache Identification: Each cache has a unique identifier, which can be thought of as the name of the cache, and a "time to live" parameter to determine the cache's expiration.
  3. Cache Retrieval: When the Gemini API receives a user query, it analyzes the available cache datasets, retrieves the appropriate cache, and combines it with the user's query for processing.

This approach offers several benefits:

  • Reduced Processing Time: By reusing the cached data, the system only needs to process the user's query, reducing the overall processing time.
  • Lower Latency: Sending only the user's query, instead of the entire context, results in lower latency.
  • Cost Savings: Reducing the number of tokens sent with each query leads to lower costs.

Google claims that using caching for up to 2,128,000 tokens can result in almost four times less cost compared to sending the entire context with every query.

It's important to note that there are some limitations and considerations when using context caching:

  • Minimum Input Token Count: The minimum input token count for context caching is currently set at 32,000 tokens.
  • Maximum Token Count: The maximum number of tokens that can be cached is limited by the model's maximum context window, which is around 2 million tokens for both the Gemini Pro and Flash models.
  • Storage Cost: There is a storage cost associated with the cached content, which is $1 per million tokens per hour.

Overall, the context caching feature in Google's Gemini API is a valuable addition that can significantly improve the performance and cost-effectiveness of LLM-based applications, especially for those dealing with large amounts of context.

Explore the Caching Process

Google's recent addition of context caching to the Gemini API aims to address the limitations of long-context language models. By caching the context, the system can reduce processing time, latency, and cost associated with sending the entire context with every query.

The caching process works as follows:

  1. Initialize the Cache: You provide a system instruction or a large context (e.g., documents, video files, audio files) that you want to cache. Each cache has a unique identifier and a "time to live" parameter to determine the cache's expiration.

  2. Cache Storage: The Gemini API's internal storage system, optimized for quick retrieval, stores the cached data.

  3. Query Processing: When the Gemini API receives a user query, it analyzes the available cache datasets, identifies the appropriate cache based on the unique identifier, and verifies the cache's validity by checking the "time to live" parameter. The API then combines the cached data and the user's query as input for processing.

  4. Reduced Costs and Latency: By using cached data, the system reduces the number of tokens sent with each query, leading to lower processing time and cost. Google estimates that using caching for up to 2,128,000 tokens can result in almost four times less cost compared to sending the entire context with every query.

  5. Storage Cost: The storage cost for the cached content is $1 per million tokens per hour. The total cost depends on factors like the cache token count and the "time to live" parameter.

  6. Supported Models: Context caching is currently supported by both the Gemini Pro and Gemini Flash models.

  7. Minimum and Maximum Tokens: The minimum input token count for context caching is 32,000 tokens, and the maximum is the model's maximum context window, which is around 2 million tokens for both Gemini Pro and Gemini Flash.

By leveraging context caching, developers can optimize their usage of the Gemini API, reducing costs and improving performance, especially for applications that require frequent queries on large datasets.

Understand Token Limits and Costs

When using the context caching feature in the Gemini API, there are a few important considerations regarding token limits and costs:

  1. Minimum Input Token Count: The minimum input token count for context caching is 32,000 tokens. This means you'll need to cache at least 32,000 tokens for the feature to work.

  2. Maximum Token Count: The maximum number of tokens you can cache is the maximum context window of the given model. For both the Gemini Pro and Flash models, this is around 2 million tokens.

  3. Storage Cost: The storage cost for the cached content is $1 per million tokens per hour. This cost is in addition to the regular API usage charges.

  4. Time to Live (TTL): When creating a cache, you can specify a "time to live" parameter to determine how long the cache should be kept. If you don't provide a value, the default is 1 hour. The minimum TTL is 60 seconds.

  5. Token Counts: When using the cached content, the total token count includes both the cached tokens and the new input tokens from the user query. This combined token count is what is used for billing purposes.

  6. Caching Availability: Context caching is currently supported by both the Gemini Pro and Flash models.

By understanding these token limits and cost considerations, you can effectively leverage the context caching feature to reduce processing time and costs when using the Gemini API.

Implement Caching with Code Examples

To implement caching with the Gemini API, we'll need to follow these steps:

  1. Install the required packages:

    1!pip install google-generative-ai-client pdfplumber
  2. Import the necessary modules:

    1from google.generative.v1 import GenerativeAIClient 2from google.generative.v1.types import CachedContent 3import markdown 4import pdfplumber
  3. Set up the Gemini API client with your Google API key:

    1api_key = "your_google_api_key" 2client = GenerativeAIClient(credentials=api_key)
  4. Load the PDF content and convert it to a single string:

    1with pdfplumber.open("path/to/your/pdf/file.pdf") as pdf: 2 pages = [page.extract_text() for page in pdf.pages] 3 content = "\n".join(pages) 4print(f"Total pages: {len(pages)}") 5print(f"Total words: {len(content.split())}") 6print(f"Total tokens: {len(content.split())}")
  5. Create a cached content object and store it:

    1cached_content = CachedContent( 2 model_name="gemini-1.5-flash", 3 system_instruction="You are an expert on rack systems. Answer questions based on the provided text.", 4 content=content, 5 time_to_live_seconds=7200 # 2 hours 6) 7model = client.from_cached_content(cached_content, name="my-rack-system-cache")
  6. Use the cached model to answer questions:

    1queries = [ 2 "What is rack? What are the main components?", 3 "How to use HuggingFace in LangChain? Provide code examples.", 4 "How does routing work in LangChain?" 5] 6 7for query in queries: 8 response = model.generate_text(prompt=query) 9 print(markdown.markdown(response.generated_text)) 10 print(f"Total tokens used: {response.total_tokens}")
  7. Manage the cache:

    1# Update the cache time-to-live 2model.update_cached_content(time_to_live_seconds=14400) # 4 hours 3 4# List all cached contents 5cached_contents = client.list_cached_contents() 6for cached_content in cached_contents: 7 print(f"Name: {cached_content.name}") 8 print(f"Created: {cached_content.created_timestamp}") 9 print(f"Expires: {cached_content.expires_timestamp}") 10 11# Delete a cached content 12client.delete_cached_content(name="my-rack-system-cache")

This code demonstrates how to implement caching with the Gemini API, including loading content from a PDF file, creating a cached content object, using the cached model to answer questions, and managing the cache (updating the time-to-live and deleting cached content).

Manage Cache Updates and Expiration

To manage cache updates and expiration, the following key points are covered:

  1. Updating Cache Time-to-Live (TTL): You can update the TTL of a cache by calling the update() function on the cache object and providing a new TTL value. This allows you to extend the expiration time of the cached content.

  2. Viewing Cache Metadata: While you cannot directly retrieve or view the cached content, you can access metadata about the cache, such as the total number of tokens used, the number of tokens from the prompt, and the number of candidate tokens generated by the model.

  3. Listing Cached Content: You can list all the cached content you have created by accessing the cache_contents attribute of the Gemini client. This shows you the names, creation times, and expiration times of each cache.

  4. Deleting Caches: If you want to delete a cache before its expiration time, you can call the delete() function on the cache object. This will remove the cached content from the system.

By managing the cache updates and expiration, you can optimize the usage of the caching feature to reduce costs and improve performance when working with the Gemini API.

Conclusion

The introduction of context caching in Google's Gemini API is a significant development that addresses some of the major limitations of long-context language models. By allowing users to cache large amounts of context data, the API can reduce processing time, lower latency, and decrease costs associated with token-based billing.

The caching system works by allowing users to initialize a cache with a large dataset, such as documents, video files, or audio files. Each cache has a unique identifier and a time-to-live parameter, which determines how long the cache will be stored. When a user query is received, the API analyzes the available caches, combines the cached data with the user's query, and processes the input to generate a response.

The benefits of this approach are twofold. First, by only sending the user's query instead of the entire context, the amount of data transferred is significantly reduced, leading to lower latency. Second, the cost savings can be substantial, as the API charges per token, and caching can reduce the number of tokens required for each query by up to four times.

While the current implementation of context caching in the Gemini API has some limitations, such as a minimum input token count of 32,000, it is a promising step forward. As the technology matures, we can expect to see further improvements, such as lower minimum token requirements and potentially even reduced latency.

Overall, the introduction of context caching in the Gemini API is a welcome development that can greatly benefit developers working with large-scale language models. By addressing some of the key challenges, this feature has the potential to unlock new use cases and drive further innovation in the field of natural language processing.

FAQ