ElastiCache: Making 15 API Calls Feel Like One to the Shopper
By a Senior AWS Solutions Architect | #ComposableCommerce #ElastiCache #Performance #Redis
Here's the uncomfortable truth about composable commerce performance: a headless storefront rendering a product detail page might make 8–15 API calls to different PBCs. Product data from the Catalogue PBC. Pricing from the Pricing PBC. Inventory status from the Inventory PBC. Recommendations from the ML PBC. Reviews from the Review PBC. Personalisation signals from the Customer Data Platform.
Done naively, that's 8–15 database queries per page render. At 50,000 concurrent users, that's potentially 750,000 database queries per second across your backend PBCs. Most database clusters will either refuse that load or respond too slowly to be usable.
ElastiCache is how you make that problem manageable — and how you turn the multi-PBC architecture from a potential performance liability into a genuine competitive advantage.
Cache-Aside: The Universal Pattern for Composable PBCs
Every PBC that serves read-heavy, cacheable data should implement the cache-aside (lazy loading) pattern. The logic is simple and worth stating explicitly:
// Product Catalogue PBC — cache-aside with versioned keys
const CACHE_VERSION = "v3"; // Increment on schema change to bust cache
async function getProduct(sku) {
const cacheKey = `${CACHE_VERSION}:product:${sku}`;
// Step 1: check cache first
const cached = await redis.get(cacheKey);
if (cached) {
metrics.increment('cache.hit', { pbc: 'catalogue', type: 'product' });
return JSON.parse(cached); // Sub-millisecond, no database involved
}
metrics.increment('cache.miss', { pbc: 'catalogue', type: 'product' });
// Step 2: cache miss — go to database
const product = await db.query(
`SELECT p.*, c.name as category_name, b.name as brand_name
FROM products p
JOIN categories c ON p.category_id = c.id
JOIN brands b ON p.brand_id = b.id
WHERE p.sku = ? AND p.active = 1`,
[sku]
);
if (!product) return null;
// Step 3: populate cache with appropriate TTL
await redis.setex(cacheKey, 300, JSON.stringify(product)); // 5 minute TTL
return product;
}
With this pattern, the first request for a product goes to the database. Every subsequent request over the next 5 minutes is served from Redis in under 1ms. For a popular SKU during a Black Friday sale — viewed 50,000 times per minute — the database serves 1 request. Redis serves 49,999.
Versioned cache keys (v3:product:sku) are the clean solution to cache invalidation. When the product schema changes (adding a new field, restructuring the response), increment the version prefix and all caches are naturally invalidated as their old keys go unqueried and expire. No explicit purge operation needed.
TTL as a Business Decision, Not a Technical One
Every PBC has data with different staleness tolerance. The TTL configuration should be driven by the business impact of stale data, not a uniform "we cache everything for 5 minutes."
| Data | PBC | TTL | Staleness Impact |
|---|---|---|---|
| Product name/description | Catalogue | 600s (10 min) | Brand team updates once a week — very low |
| Product price | Pricing | 30s | Dynamic pricing, promotion eligibility — high |
| Inventory count (display) | Inventory | 10s | Customers see "3 left" — tolerance for brief lag |
| Inventory (checkout confirm) | Inventory | 0 — always live | Must be accurate at point of sale |
| Recommendation list | ML PBC | 300s (5 min) | Personalisation can lag slightly |
| Customer loyalty points | Loyalty | 60s | Points earned, customer checks balance |
| Category/navigation tree | Catalogue | 3600s (1 hour) | Changes only with editorial releases |
| Session data | Auth | 1800s (30 min) | User's active session — sliding TTL |
The critical design decision: inventory at checkout is always a live database call. Cache the display number (show "3 left in stock"), but when a customer actually clicks "Place Order," always verify inventory synchronously against the database. Overselling because you trusted a 10-second-old cache is a customer service problem, a logistics problem, and a brand problem simultaneously.
Memcached vs. Redis: Choose Redis for Composable Commerce
The default answer for composable commerce is Redis. Here's why:
Memcached is a simple, multithreaded key-value store. It's fast for simple get/set operations, horizontally scalable, but has no persistence, no replication, and no data structures beyond strings. If the Memcached cluster loses a node, all cached data on that node is gone — your application sees a thundering herd of cache misses hitting the database simultaneously.
Redis is a data structure server. For composable commerce it provides:
-
Persistence — Redis can snapshot its data to disk (RDB) and/or write every operation to an append-only log (AOF). After a restart or failover, the cache isn't cold.
-
Replication + Multi-AZ failover — Primary in AZ-a, replica in AZ-b. If the primary fails, the replica promotes automatically in under 60 seconds. Sessions, inventory counters, and leaderboards survive an AZ failure.
-
Sorted sets — ordered data structures maintained automatically. Product leaderboards, top-selling items, customer spend rankings.
-
Atomic operations —
INCR,DECR,HINCRBYare atomic. No two Redis operations on the same key can interleave. Critical for inventory counters and rate limiting. -
Pub/Sub — PBCs can publish events to Redis channels and other PBCs subscribe. Lightweight event bus for low-latency use cases.
The only case where Memcached is preferable: simple, horizontally scalable caching where you need to add many cache nodes dynamically and ElastiCache Auto Discovery is important. Even then, Redis Cluster mode now provides comparable horizontal scaling.
Redis Data Structures in Practice for Composable Commerce
Strings: The Universal Cache Entry
// Cache any serialisable object as a JSON string
await redis.setex(`product:${sku}`, 300, JSON.stringify(productData));
const data = JSON.parse(await redis.get(`product:${sku}`));
Hashes: Session Storage Across PBCs
Sessions stored as Redis hashes let any PBC read or update specific session fields without serialising/deserialising the entire session object:
// Auth PBC writes session on login
await redis.hset(`session:${sessionId}`, {
customer_id: customerId,
email: customer.email,
cart_id: cartId,
currency: 'EUR',
locale: 'de-DE',
ab_variant_checkout: 'B',
loyalty_tier: 'gold'
});
await redis.expire(`session:${sessionId}`, 3600);
// Checkout PBC reads only what it needs — no full deserialisation
const [customerId, currency, loyaltyTier] = await redis.hmget(
`session:${sessionId}`,
'customer_id', 'currency', 'loyalty_tier'
);
// Recommendation PBC updates preference signal without reading whole session
await redis.hset(`session:${sessionId}`, 'last_category_viewed', categoryId);
await redis.expire(`session:${sessionId}`, 3600); // Reset TTL on activity
Atomic Counters: Flash Sale Inventory
Redis's atomic operations prevent the race conditions that lose you money in flash sales:
// Inventory PBC: atomic decrement for flash sale
async function reserveInventory(sku, quantity) {
// DECRBY is atomic — no two calls interleave
const remaining = await redis.decrby(`flash-inventory:${sku}`, quantity);
if (remaining < 0) {
// Oversold — restore and reject
await redis.incrby(`flash-inventory:${sku}`, quantity);
throw new InsufficientInventoryError(`SKU ${sku} is sold out`);
}
// Confirmed reservation — write to DB asynchronously
await sqs.sendMessage({
QueueUrl: inventoryUpdateQueue,
MessageBody: JSON.stringify({ sku, delta: -quantity, reservationId: uuid() })
});
return remaining; // e.g., "47 remaining after your reservation"
}
This pattern handles 10,000 concurrent checkout attempts for the same limited-inventory item. Redis's single-threaded command processing means DECRBY operations execute serially, one at a time, in microseconds. No locking required. No lost updates. No overselling.
Sorted Sets: Real-Time Commerce Rankings
// Loyalty PBC: update leaderboard on every purchase
async function recordPurchaseForLeaderboard(customerId, orderAmount, campaignId) {
const key = `leaderboard:campaign:${campaignId}`;
// ZINCRBY atomically adds score to member's existing score
const newTotal = await redis.zincrby(key, orderAmount, customerId);
await redis.expireat(key, campaignEndTimestamp);
return newTotal;
}
// Storefront widget: get top 10 with spend amounts
async function getLeaderboard(campaignId, n = 10) {
const key = `leaderboard:campaign:${campaignId}`;
// ZREVRANGE with WITHSCORES: highest scores first
const entries = await redis.zrevrange(key, 0, n - 1, 'WITHSCORES');
// Returns: [customerId1, score1, customerId2, score2, ...]
// Get customer display names in bulk (one Redis call)
const customerIds = entries.filter((_, i) => i % 2 === 0);
const displayNames = await redis.mget(customerIds.map(id => `customer:${id}:display`));
return customerIds.map((id, i) => ({
rank: i + 1,
displayName: displayNames[i] || 'Anonymous',
totalSpend: parseFloat(entries[i * 2 + 1])
}));
}
Redis sorted sets maintain the ranking automatically on every insert. A leaderboard query returning the top 10 from a set of 500,000 participants completes in O(log N) time — under 1ms regardless of participant count. This is simply not achievable with SQL at this latency.
Rate Limiting: Protecting PBC APIs
// Applied as middleware on any PBC API — prevents abuse and controls costs
async function rateLimitMiddleware(req, res, next) {
const customerId = req.session.customerId || req.ip;
const endpoint = req.path.split('/')[2]; // e.g., 'search', 'recommendations'
const key = `ratelimit:${endpoint}:${customerId}`;
const limit = 60; // 60 requests per minute
const window = 60;
const current = await redis.incr(key);
if (current === 1) await redis.expire(key, window); // Set TTL on first request
if (current > limit) {
res.set('Retry-After', await redis.ttl(key));
return res.status(429).json({
error: 'Rate limit exceeded',
retryAfter: await redis.ttl(key)
});
}
res.set('X-RateLimit-Remaining', limit - current);
next();
}
One Redis INCR with TTL. Applied across all instances of the PBC (they share the Redis cluster). A customer making 200 search requests per minute is rate limited consistently regardless of which PBC instance serves each request.
Cache Warming: Solving the Cold Start Problem
Every composable PBC deployment starts with a cold cache. The first wave of requests after a deployment all hit the database — temporarily spiking database CPU and response times.
Three approaches I use depending on the PBC's criticality:
1. Pre-warm on deployment (for high-traffic PBCs):
# Deployment pipeline step after new PBC instances pass health checks
# but before they receive production traffic
aws ecs run-task --task-definition cache-warmer \
--overrides '{"containerOverrides":[{"name":"warmer","environment":[
{"name":"TARGET_ENV","value":"production"},
{"name":"WARM_TOP_N_PRODUCTS","value":"10000"}
]}]}'
# Warmer fetches top 10,000 products and populates Redis
# New PBC instances join traffic with warm cache
2. Canary traffic warming: Route 5% of production traffic to new instances before shifting full traffic (weighted routing in ALB). The small traffic percentage warms the cache against real usage patterns. At full traffic shift, cache hit rates are already above 80%.
3. Rolling deployments with overlap: During a rolling deployment, old instances (with warm cache) continue serving traffic while new instances are added. New instances see traffic immediately but share the Redis cluster — their cache is populated by serving real requests. The cold start impact is gradual rather than sudden.
Multi-AZ Configuration: Sessions Must Survive an AZ Failure
For a composable platform where sessions, carts, and inventory counters all live in Redis, a Redis cluster failure is a revenue event. Multi-AZ with automatic failover is not optional.
flowchart TD
subgraph CLUSTER["ElastiCache Redis — Multi-AZ Cluster"]
subgraph AZA["Availability Zone A (us-east-1a)"]
PRI["🔵 Primary Node\nhandles all writes"]
end
subgraph AZB["Availability Zone B (us-east-1b)"]
REP1["🟢 Replica Node\nauto-promoted on failover"]
end
subgraph AZC["Availability Zone C (us-east-1c)"]
REP2["🟢 Replica Node\nread scaling"]
end
end
APP["⚙️ PBC Instances\n(connect via cluster endpoint)"]
S3_SNAP["☁️ S3 Snapshot\n(daily backup · restore in minutes)"]
ENDPOINT["🔗 Cluster Endpoint DNS\n(auto-updates on failover)"]
APP --> ENDPOINT --> PRI
PRI -->|"async replication\n< 1s lag"| REP1 & REP2
PRI -->|"snapshot"| S3_SNAP
FAILOVER["⚡ Failover: ~60s\nReplica promoted\nEndpoint DNS updated\nPBCs reconnect automatically"]
PRI -. "primary fails" .-> FAILOVER
FAILOVER --> REP1
style PRI fill:#1a3a5a,color:#fff
style FAILOVER fill:#3a1a1a,color:#fff
The application connects to the cluster endpoint (not individual node endpoints). When failover occurs, the cluster endpoint DNS record updates to point to the promoted replica. Your PBCs reconnect automatically after the brief failover period.
The Practical Outcome
A composable commerce platform without a properly designed caching layer is an expensive distributed system that's slower than the monolith it replaced. A composable platform with ElastiCache properly integrated — cache-aside on every read-heavy PBC, Redis data structures for sessions and counters and leaderboards, Multi-AZ for resilience — is a platform that genuinely delivers on the performance promise of the MACH architecture.
The goal is that the 15 API calls your storefront makes per page render feel, to the user, like one fast response. ElastiCache is what makes that possible.
Next: CloudFront, Kinesis, CloudFormation and the additional services that complete the composable commerce infrastructure picture.
💬 What's your Redis TTL strategy for product pricing data on a platform with dynamic pricing? I'd love to hear how teams are balancing freshness against database load.
#ElastiCache #Redis #AWS #ComposableCommerce #Performance #Caching #MACH #SolutionsArchitect #BackendEngineering