Reduce Costs and Latency with Amazon Bedrock Intelligent Prompt Routing and Prompt Caching (Preview)

By Davy Raitt, November 2024

Are you excited about Generative AI and Large Language Models (LLMs, for short)? If yes, you're about to discover something gamechanging: Intelligent Prompt Routing and Prompt Caching with Amazon Bedrock. These features can help you cut down your inference costs, speed up responses, and still keep the quality of answers top-notch.

Quick Context: Amazon Bedrock now lets you use smaller, cheaper foundation models for simpler requests and only uses more powerful models when needed. Plus, you can cache often-used prompt details so you don’t have to pay for them over and over.

1. Intelligent Prompt Routing in Simple Words

Imagine you have two AI models: Model A (bigger, pricier, super capable) and Model B (smaller, cheaper, faster). If your user’s question is straightforward, you want to use Model B to save money and get a quick response. But if the user asks a tricky question, you want Model A for a detailed answer. Amazon Bedrock’s Intelligent Prompt Routing does this automatically. It checks how complex each request is and routes it to the best model.

It’s like having two gears in your AI engine: one for short or basic queries, and another for advanced or bigger tasks. No more guesswork on which model to pick each time. You can cut costs by up to 30% without hurting performance.

1.1 A Quick CLI Example

Here’s a simple AWS CLI snippet showing how you might call a “Prompt Router” instead of a single foundation model. We’re calling a made-up ARN (Amazon Resource Name) for the router:


aws bedrock-runtime converse \
    --model-id arn:aws:bedrock:us-east-1:111122223333:default-prompt-router/smart-router-demo:1 \
    --messages '[
        { "role": "user", "content": [
            { "text": "How do I sum two numbers in Python?" }
        ]}
    ]'

In this request, the router automatically decides if your prompt is “easy” or “hard” and picks the right model. You don’t have to do anything special—just ask your question!

2. Prompt Caching 101

Prompt caching is the other superpower. Sometimes we ask the AI similar questions over and over, or we give it the same context repeatedly. With caching, you can store those repetitive parts and reuse them, which can reduce costs (sometimes by up to 90%) and speed up response time (by up to 85%).

For instance, if your conversation always starts with a large chunk of text (like a long product spec), normally the model has to process it each time, racking up token fees. Prompt caching remembers that text for a short time (like 5 minutes). If you don’t change it, it doesn’t bill you all over again. Gamechanger, right?

2.1 A Simple Python Example

Let’s say you have a list of topics you keep referencing while chatting with your AI. You want to cache them so you’re not paying the full price each time. Here’s a very short Python snippet to show how it might look in practice. (Note: This is a hypothetical example to illustrate how caching could be used.)


import boto3

def chat_with_bedrock(message: str, enable_caching: bool = False):
    client = boto3.client("bedrock-runtime", region_name="us-east-1")

    prompt_content = [
        {"text": "Here is some context about my app..."},
    ]
    
    if enable_caching:
        # We mark this as a checkpoint to be cached
        prompt_content.append({"cachePoint": {"type": "default"}})

    # Add the actual user message at the end
    prompt_content.append({"text": message})

    response = client.converse(
        modelId="arn:aws:bedrock:us-east-1:111122223333:foundation-model/prompt-caching-demo",
        messages=[
            {"role": "user", "content": prompt_content}
        ]
    )

    return response.get("output", {}).get("message", {}).get("content", [])

# Example usage
if __name__ == "__main__":
    # First time, we enable caching so that repeated context can be stored.
    response1 = chat_with_bedrock("What is the best way to scale this app?", enable_caching=True)
    print("Response #1:", response1)

    # Next time, hopefully caching helps reduce repeated cost.
    response2 = chat_with_bedrock("Okay, now how do I monitor my app once it scales?", enable_caching=False)
    print("Response #2:", response2)

In this example, we add a “cache checkpoint” to store context about our app. The first request might use more tokens (and cost) because it’s storing new content in the cache. The second request then sees that context has already been cached, so it doesn’t fully reprocess it. The result? Lower cost and faster answers, especially when you keep reusing large blocks of text.

3. Why This Matters

By combining Intelligent Prompt Routing with Prompt Caching, you get a one-two punch of cost efficiency and speed. You can route simple questions to cheaper models, let the big guns handle tricky ones, and cache all that repeated content for later. That’s how you build AI-driven solutions that scale without breaking the bank.

4. Things to Know

Preview Mode: Both Intelligent Prompt Routing and Prompt Caching are in preview, so you might need special access to try them out.
Region Availability: Right now, these features might only be available in a limited set of AWS Regions, so check your console or docs for specifics.
Language Constraints: The router might be mostly tuned for English prompts, so watch out if you’re sending it other languages.
Cache Expiration: Cached content generally sticks around for a few minutes. If you don't use it within that time, you’ll need to add it again.
Billing Details: Cache reads (and often writes) are significantly cheaper than full model inference, but always check the latest Amazon Bedrock pricing page to see exactly how it’s billed.

5. Final Thoughts

This is all about giving you more control over your AI usage. It’s kind of like using “fast lanes” for easy tasks and “super lanes” for complex ones, plus a memory system to avoid paying twice for the same stuff. We think it’s going to be a big deal for anyone building AI apps on AWS.

Thanks for reading! If you’re curious to try these features, reach out to AWS for preview access or keep an eye on new announcements. We hope this short blog sparks some ideas on how to make your Generative AI apps both smart and cost-friendly. Go forth and build!