Signed Words Only
A Practical Playbook for Trust-Bound AI (Keys, Policies, and the End of “Just Trust the Model”)
At a certain point you stop arguing with jailbreaks and start asking better questions: Who signed this instruction? Why does this agent believe that paragraph has authority? If your system can’t answer, you don’t have “AI safety.” You have improv theater with production access.
This is not another sermon about prompt injection. You already know the perimeter is gone. This is the practical JUNIOR playbook for turning language into a governed surface, where only signed, policy-authorized words can make machines act.
First Principles (write these on the wall)
Provenance over vibes. If a sentence can move money or data, it must arrive with a verifiable origin and an integrity check.
Least authority for language. Unverified text is reference, not control. Treat it as inert unless elevated by policy.
Authorization before execution. Tool calls are security events; pass them through policy, not optimism.
Degrade gracefully. As confidence falls (more unverified context), the agent’s powers shrink, read-only first, then human-in-the-loop, then refuse.
The Stack You Actually Build
1) Provenance Layer — “Prompt Fencing”
What it is: A signing service that stamps each content fragment with
{origin, trust_tier, type, timestamp, content_hash, signature}.How to do it: HSM-backed keys; sign at ingest (KBs, RAG docs, API responses). Store signatures and hashes alongside embeddings; never assemble a context window from unsigned pieces.
Outcome: The model can see which blocks are trusted instructions vs. untrusted content, and downstream policy engines can prove it.
(This follows Thoughtworks’ fencing pattern, cryptographic chain-of-custody for text.)
2) Policy-Aware Assembly — “Respect the Fence”
What it is: A composer that lays out input as explicit blocks (trusted policy → verified internal → unverified external → user).
Model behavior: Fine-tune and reinforce on patterns where the right move is to ignore instructions inside untrusted blocks. Reward fence-compliant refusals; penalize obedient mistakes.
Why it works: You’re not detecting clever phrasing; you’re enforcing role of text. The model’s job isn’t to be moral, it’s to obey the map.
3) Execution Mediation — “No Signature, No Action”
Gate every tool call through a policy engine that checks:
Syntax: function exists, types sane, ranges constrained.
Semantics: terms like “primary account” resolve against ground truth, not conversational drift.
Authorization + provenance: high-impact calls require (trusted instruction) or (step-up human confirmation via a second channel).
Audit by default: bind the call to the content hash, signer, policy version, and attested agent identity. If something goes sideways, you can reconstruct why.
Migration Plan (90 days that actually ship)
Days 0–15 | Inventory & blast radius
Map which agents can act (email, tickets, payments, data exports).
Tag “high-impact tools” (money, PII, privileges). Default them to human-in-the-loop.
Days 16–30 | Sign the supply chain
Stand up the signing service; wrap your knowledge bases and internal docs on ingest.
Flip RAG ingestion to sign+hash. Unsigned = quarantined.
Days 31–45 | Assemble fenced prompts
Introduce block-structured assembly. Log block composition per request.
Add a “confidence budget”: when unverified dominates, auto-downgrade capabilities.
Days 46–60 | Train for compliance
Few-shot + fine-tune fence-respect examples; RLHF on “refuse when untrusted instructs.”
Bake in unit tests: indirect injections in alt-text, redefinitions mid-thread, multilingual detours.
Days 61–75 | Policy gateway for tools
Interpose the execution gateway. Encode rules like:
transfer_funds: trusted or step-up auth; daily limits enforced.read_customer_data: role + ticket context + subject match.bulk_export: trustee signature + managerial approval.
Days 76–90 | Prove it under fire
Red-team the pipeline (SEO-poison retriever, KB payloads, HTML comments).
Track five metrics (below). Tune policies, not just weights.
Metrics That Matter (and the thresholds)
ASR (Attack Success Rate) | % of seeded attacks that trigger policy-forbidden behavior.
Target: <1% for read-only; 0% for money/PII tools.
FPR (False Positive Rate) | legit requests blocked.
Target: <2% after week 2 of tuning; show users clear reasons.
Utility Preservation | task success vs. pre-governance baseline.
Target: ≥95% on core use cases; drop only when confidence is low by design.
Latency Overhead (p95) | added by signatures + policy.
Target: <50ms p95 for verify; <100ms with gateway checks.
Provenance Coverage | % of context tokens carrying valid signatures.
Target: >90% in enterprise flows; >99% for high-impact actions.
Failure Scenarios (design for the bad day)
Unsigned instruction sneaks in: Gateway blocks action; user sees “Instruction missing trusted signature. Confirm to proceed.”
Semantic hijack (“primary” redefined): Tool resolver ties terms to ground truth records; conversational redefinitions are ignored.
Retriever poisoned: Assembly shows unverified dominance → capability downgraded to read-only summary; no writes, no tools.
Key compromise: Rotate signer keys; replay provenance logs to quarantine affected content; rebuild embeddings from clean-signed sources.
Cost Model (the honest one)
Upfront: signer infra, policy gateway, assembly refactor, fine-tuning.
Steady-state: low-signature checks and rules are cheap; model inference dominates.
Avoided OpEx: endless guardrail retraining, manual false-positive triage, incident response.
Executive line: detection is a treadmill; governance is a fence. Treadmills rent safety. Fences own it.
Compliance & Proof
Artifacts you can hand an auditor:
Provenance logs (signer, hash, timestamp, origin).
Policy decisions per tool call (inputs, rule, outcome).
Chain-of-custody from content to action.
Standards outlook: plan for a portable, signed “prompt-provenance” envelope so trust survives hops (LLM → tool → partner system). It will become table stakes.
Culture: What You Tell the Team
We’re not building moral machines. We’re building honest infrastructure: systems that admit which words have authority, and prove it. Every other approach is performance art with root access.
When someone pitches “next-gen guardrails,” ask the only question that matters: Who signed the words that made the agent act?
If the answer is “no one,” the meeting is over.
Want this running in 90 days?
If you’re an enterprise with agents, tools, and regulators breathing down your neck, we can blueprint and implement this stack, provenance, policy-aware assembly, and tool mediation, on your infrastructure. Book a call and we’ll walk you through a tailored migration plan, the controls to meet your risk profile, and the dashboards your board actually understands.
Reply here or contact Mjolnir Security to schedule a working session.



