Are you excited about Generative AI and Large Language Models (LLMs, for short)? If yes, you're about to discover something gamechanging: Intelligent Prompt Routing and Prompt Caching with Amazon Bedrock. These features can help you cut down your inference costs, speed up responses, and still keep the quality of answers top-notch.
Quick Context: Amazon Bedrock now lets you use smaller, cheaper foundation models for simpler requests and only uses more powerful models when needed. Plus, you can cache often-used prompt details so you don’t have to pay for them over and over.
Imagine you have two AI models: Model A (bigger, pricier, super capable) and Model B (smaller, cheaper, faster). If your user’s question is straightforward, you want to use Model B to save money and get a quick response. But if the user asks a tricky question, you want Model A for a detailed answer. Amazon Bedrock’s Intelligent Prompt Routing does this automatically. It checks how complex each request is and routes it to the best model.
It’s like having two gears in your AI engine: one for short or basic queries, and another for advanced or bigger tasks. No more guesswork on which model to pick each time. You can cut costs by up to 30% without hurting performance.
Here’s a simple AWS CLI snippet showing how you might call a “Prompt Router” instead of a single foundation model. We’re calling a made-up ARN (Amazon Resource Name) for the router:
aws bedrock-runtime converse \
--model-id arn:aws:bedrock:us-east-1:111122223333:default-prompt-router/smart-router-demo:1 \
--messages '[
{ "role": "user", "content": [
{ "text": "How do I sum two numbers in Python?" }
]}
]'
In this request, the router automatically decides if your prompt is “easy” or “hard” and picks the right model. You don’t have to do anything special—just ask your question!
Prompt caching is the other superpower. Sometimes we ask the AI similar questions over and over, or we give it the same context repeatedly. With caching, you can store those repetitive parts and reuse them, which can reduce costs (sometimes by up to 90%) and speed up response time (by up to 85%).
For instance, if your conversation always starts with a large chunk of text (like a long product spec), normally the model has to process it each time, racking up token fees. Prompt caching remembers that text for a short time (like 5 minutes). If you don’t change it, it doesn’t bill you all over again. Gamechanger, right?
Let’s say you have a list of topics you keep referencing while chatting with your AI. You want to cache them so you’re not paying the full price each time. Here’s a very short Python snippet to show how it might look in practice. (Note: This is a hypothetical example to illustrate how caching could be used.)
import boto3
def chat_with_bedrock(message: str, enable_caching: bool = False):
client = boto3.client("bedrock-runtime", region_name="us-east-1")
prompt_content = [
{"text": "Here is some context about my app..."},
]
if enable_caching:
# We mark this as a checkpoint to be cached
prompt_content.append({"cachePoint": {"type": "default"}})
# Add the actual user message at the end
prompt_content.append({"text": message})
response = client.converse(
modelId="arn:aws:bedrock:us-east-1:111122223333:foundation-model/prompt-caching-demo",
messages=[
{"role": "user", "content": prompt_content}
]
)
return response.get("output", {}).get("message", {}).get("content", [])
# Example usage
if __name__ == "__main__":
# First time, we enable caching so that repeated context can be stored.
response1 = chat_with_bedrock("What is the best way to scale this app?", enable_caching=True)
print("Response #1:", response1)
# Next time, hopefully caching helps reduce repeated cost.
response2 = chat_with_bedrock("Okay, now how do I monitor my app once it scales?", enable_caching=False)
print("Response #2:", response2)
In this example, we add a “cache checkpoint” to store context about our app. The first request might use more tokens (and cost) because it’s storing new content in the cache. The second request then sees that context has already been cached, so it doesn’t fully reprocess it. The result? Lower cost and faster answers, especially when you keep reusing large blocks of text.
By combining Intelligent Prompt Routing with Prompt Caching, you get a one-two punch of cost efficiency and speed. You can route simple questions to cheaper models, let the big guns handle tricky ones, and cache all that repeated content for later. That’s how you build AI-driven solutions that scale without breaking the bank.
This is all about giving you more control over your AI usage. It’s kind of like using “fast lanes” for easy tasks and “super lanes” for complex ones, plus a memory system to avoid paying twice for the same stuff. We think it’s going to be a big deal for anyone building AI apps on AWS.
Thanks for reading! If you’re curious to try these features, reach out to AWS for preview access or keep an eye on new announcements. We hope this short blog sparks some ideas on how to make your Generative AI apps both smart and cost-friendly. Go forth and build!