LLM Offer Copy Generation: Prompt Registry, Dual Backends, and Brand Image Generation
Versioned YAML prompts, OpenAI and HuggingFace backends, JSON retry logic, and FLUX.1 brand image generation — LoyaltyLens Module 4
Series position: Article 4 of 8
Module 4 takes the offer retrieved in Module 3 and generates personalized copy for it. The module has three layers:
- Copy generation — a versioned YAML prompt registry, two LLM backends (OpenAI and HuggingFace/Mistral-7B), and a structured
OfferCopygenerator with JSON parse retry logic - Brand alignment — FLUX.1-schnell image generation from the offer copy via HuggingFace Inference API
- Prompt governance — machine-readable
eval_criteriaembedded in each prompt YAML, consumed by the eval harness in Module 5
Treating prompts as production artifacts — versioned, testable, and auditable — is the central design principle of this module.
The Three Layers of Offer Copy Generation
The end-to-end offer copy pipeline has three layers:
- Retrieval (Module 3): Which offer is most relevant for this customer?
- Copy generation (this module): What message best presents that offer to this customer?
- Brand alignment (this module): Does the generated copy look and feel like an on-brand loyalty message?
Most tutorials cover the middle layer only. The brand alignment layer — Flux AI image generation from the offer copy itself — is what makes this project meaningfully different from a generic LLM demo.
The Prompt Registry
Every prompt lives in a YAML file under version control. The registry is built before any inference code:
# llm_generator/prompts/system_v1.yaml
version: 1
created_at: "2024-01-15"
author: "Pushparajan Ramar"
system: |
You are a loyalty program offer copywriter with deep knowledge of
the brand voice: warm, personal, community-focused, and
never pushy. You write copy that feels like a message from a friend,
not an advertisement.
Always output valid JSON and nothing else. No preamble, no explanation.
user_template: |
Write personalized offer copy for this customer:
- Loyalty tier: {tier}
- Engagement score: {engagement_score:.2f} (0=disengaged, 1=highly engaged)
- Preferred channel: {channel}
- Days since last visit: {recency_days}
Offer to promote:
- Title: {offer_title}
- Description: {offer_description}
- Category: {offer_category}
Output JSON with exactly these fields:
{{
"headline": "max 8 words, creates urgency or delight",
"body": "max 40 words, warm and personal, mention the customer benefit",
"cta": "max 5 words, action-oriented",
"tone": "friendly | urgent | exclusive"
}}
eval_criteria:
- headline_max_words: 8
- body_max_words: 40
- cta_max_words: 5
- required_fields: [headline, body, cta, tone]
- valid_tones: [friendly, urgent, exclusive]
The eval_criteria block is not documentation — it's machine-readable. The evaluation harness in Module 5 reads it to validate every generated output automatically.
Version 2 tightened the tone:
# llm_generator/prompts/system_v2.yaml
version: 2
# ...changes from v1:
# - Added instruction to reference specific product category in body
# - Changed tone options to [warm | celebratory | exclusive]
# - Added constraint: never use the word "deal"
That last constraint — "never use the word 'deal'" — is a brand guideline: the word signals transactional intent rather than community membership. Capturing brand constraints as versioned, auditable prompt rules is what makes generative AI deployable at enterprise scale.
The Generator Implementation
# llm_generator/generator.py
import json
import yaml
from dataclasses import dataclass
from pathlib import Path
@dataclass
class OfferCopy:
headline: str
body: str
cta: str
tone: str
model_version: str
prompt_version: int
latency_ms: float
token_count: int | None = None
class OfferCopyGenerator:
def __init__(self, backend: LLMBackend, prompt_version: int = 1):
prompt_path = Path(f"llm_generator/prompts/system_v{prompt_version}.yaml")
self.prompt_cfg = yaml.safe_load(prompt_path.read_text())
self.backend = backend
self.prompt_version = prompt_version
def generate(self, customer_context: dict, offer: dict) -> OfferCopy:
user_message = self.prompt_cfg["user_template"].format(
tier=customer_context.get("tier", "Member"),
engagement_score=customer_context["engagement_score"],
channel=customer_context["channel_preference"],
recency_days=int(customer_context["recency_days"]),
offer_title=offer["title"],
offer_description=offer["description"],
offer_category=offer["category"],
)
start = time.perf_counter()
raw = self.backend.generate(
system=self.prompt_cfg["system"],
user=user_message,
)
latency_ms = (time.perf_counter() - start) * 1000
copy = self._parse_with_retry(raw, max_retries=1)
self._validate(copy)
return OfferCopy(
**copy,
model_version=self.backend.version,
prompt_version=self.prompt_version,
latency_ms=round(latency_ms, 2),
)
def _parse_with_retry(self, raw: str, max_retries: int) -> dict:
for attempt in range(max_retries + 1):
try:
# Strip markdown code fences if present
clean = raw.strip().removeprefix("```json").removesuffix("```").strip()
return json.loads(clean)
except json.JSONDecodeError:
if attempt == max_retries:
raise CopyGenerationError(f"JSON parse failed after {max_retries+1} attempts")
# Retry with explicit JSON instruction
raw = self.backend.generate(
system=self.prompt_cfg["system"],
user=f"Your previous response was not valid JSON. Output ONLY the JSON object, no other text.\n\n{raw}"
)
The retry logic is not optional. Even well-prompted frontier models produce non-JSON output in roughly 2–4% of calls. At a million daily offer generations, 2% is 20,000 failed renders — a visible degradation in customer experience.
The Two LLM Backends
LoyaltyLens supports two backends with a common interface:
# llm_generator/backends.py
from abc import ABC, abstractmethod
class LLMBackend(ABC):
@abstractmethod
def generate(self, system: str, user: str) -> str: ...
class OpenAIBackend(LLMBackend):
def __init__(self, model: str = "gpt-4o-mini"):
self.client = openai.OpenAI(api_key=config.OPENAI_API_KEY)
self.version = model
def generate(self, system: str, user: str) -> str:
response = self.client.chat.completions.create(
model=self.version,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user},
],
temperature=0.7,
max_tokens=300,
)
return response.choices[0].message.content
class HuggingFaceBackend(LLMBackend):
def __init__(self, model_id: str = "mistralai/Mistral-7B-Instruct-v0.2"):
from transformers import pipeline
self.pipe = pipeline(
"text-generation",
model=model_id,
max_new_tokens=300,
temperature=0.7,
device_map="auto",
)
self.version = model_id.split("/")[-1]
def generate(self, system: str, user: str) -> str:
prompt = f"[INST] {system}\n\n{user} [/INST]"
output = self.pipe(prompt)[0]["generated_text"]
# Strip the input prompt from the output
return output[len(prompt):].strip()
The abstract base class is what makes this testable. In unit tests, inject a MockBackend that returns predetermined JSON — no API calls, no GPU, deterministic results. If an ML component can't be tested without a live model call, the CI pipeline will be slow, flaky, and expensive.
The FLUX.1 Brand Image Generator
Once the LLM produces offer copy, the same copy serves as a text prompt to generate a matching campaign image. The /generate endpoint accepts an optional generate_image: true flag; when set, it feeds the headline and body into FLUX.1-schnell via the HuggingFace Inference API and saves the result alongside the copy.
# llm_generator/multimodal.py
from huggingface_hub import InferenceClient
from PIL import Image
from pathlib import Path
class BrandImageGenerator:
"""Generate brand images from offer copy using FLUX.1-schnell (free HF Inference API)."""
def __init__(self) -> None:
settings = get_settings()
self._client = InferenceClient(
model="black-forest-labs/FLUX.1-schnell",
token=settings.hf_token or None, # optional — higher rate limits with token
)
def generate(self, copy_text: str) -> Image.Image:
return self._client.text_to_image(copy_text)
def generate_to_path(self, copy_text: str, output_path: str | Path) -> Path:
path = Path(output_path)
path.parent.mkdir(parents=True, exist_ok=True)
self.generate(copy_text).save(path)
return path
FLUX.1-schnell is four-step diffusion with no classifier-free guidance — under two seconds on the HuggingFace serverless GPU tier. The interface is designed to swap in a fine-tuned brand model in production without changing calling code.
When generate_image=true, the endpoint concatenates headline and body into a single prompt and calls the generator:
# llm_generator/api.py — relevant excerpt
@app.post("/generate")
async def generate(req: GenerateRequest) -> GenerateResponse:
copy = _get_generator(req.prompt_version).generate(customer_context, offer)
image_path = None
if req.generate_image:
gen = BrandImageGenerator()
prompt = f"{copy.headline}. {copy.body}"
image_path = str(gen.generate_to_path(prompt, BRAND_IMAGES_DIR / f"{req.offer_id}_generated.png"))
return GenerateResponse(..., generated_image_path=image_path)
In production you would replace the free Inference API with a fine-tuned Flux variant conditioned on brand-approved visual training data. The interface stays identical — swap the model ID and token. That's the point of the abstraction.
Sample Output
Full generation pipeline output for a high-engagement mobile-first customer with a double-points offer:
{
"headline": "Your reward moment is here",
"body": "You've earned this: 2x points on every purchase in your top category this week. A small thank-you for showing up consistently.",
"cta": "Redeem in-app now",
"tone": "friendly",
"model_version": "gpt-4o-mini",
"prompt_version": 2,
"latency_ms": 847.3,
"generated_image_path": "data/brand_images/O042_generated.png"
}
The body references the customer's engagement score implicitly ("you've been showing up for it") — personalization without exposing the underlying metric. This is what separates prompt v2 from v1: consistent nuance, not occasional nuance.
What the Latency Numbers Look Like
| Backend | Median latency | p95 latency | Token cost |
|---|---|---|---|
| OpenAI gpt-4o-mini | 847ms | 1,240ms | ~$0.0002/call |
| Mistral-7B local (GPU) | 1,100ms | 1,850ms | $0 (infra cost) |
| Mistral-7B local (CPU) | 4,200ms | 6,100ms | $0 (infra cost) |
For a real-time offer push use case at scale, these latencies are fine — offer generation happens asynchronously, not in the request path. For a live in-store interaction where you need a rendered offer in under 2 seconds, you'd pre-generate offers for high-propensity customers and cache them.
Next: Module 5 — LLMOps Pipeline
Module 4 generates copy. Module 5 monitors whether it's getting better or worse over time: a prompt versioning CLI, an LLM-as-judge evaluation harness with a hard quality gate, a PSI-based propensity drift monitor, and a GitHub Actions CI/CD pipeline that ties all of it together.
→ Read Module 5: Building an LLMOps Pipeline with Prompt Versioning, Drift Monitoring, and CI/CD