How to Structure Data for AI Without Creating Security Nightmares

AI systems live in a constant tug-of-war: they need rich context to give useful answers, but every extra scrap of context widens the attack surface. Starve them of data and they’re useless. Overfeed them and you’ve built a data exfiltration pipeline.

Encryption-at-rest and access control still matter, but they’re not enough on their own. Probabilistic systems that chew through natural language at scale demand their own set of security patterns.

What follows are practical approaches to structuring data so AI models can do their job without blowing a hole in your security posture.

The AI Context Problem

Why Context Matters

AI models are stateless — they carry nothing between requests. Every fact the model needs has to ride along in the prompt itself. That creates constant pressure to pack in more data:

Minimal context example:

prompt = "How do I reset my password?"

AI response: Generic instructions applicable to any system.

Rich context example:

prompt = f"""
User Question: How do I reset my password?

User Context:
- Name: {customer.name}
- Email: {customer.email}
- Account ID: {customer.account_id}
- Account Status: {customer.status}
- Last Login: {customer.last_login}
- Recent Orders: {customer.recent_orders}
- Support History: {customer.support_tickets}
"""

AI response: Personalized, specific, genuinely helpful.

But now the prompt contains:

❌ Personally Identifiable Information (PII)
❌ Account details that could enable account takeover
❌ Purchase history exposing customer behavior
❌ Support history potentially containing sensitive issues

If prompt injection succeeds, all this data leaks.

The Attack Surface Expansion

Each piece of data sitting in a prompt is a piece of data that can be exposed:

Prompt injection can trick the model into spitting out embedded data
Model providers see all prompt content (unless you’re running local models)
Logging systems may capture prompts that contain sensitive information
Debugging and monitoring tools expose prompts to engineers
Prompt caching may retain sensitive data longer than you intended

More context means more ways for an attacker to win.

Principle 1: Make Structure Explicit

AI models parse structured data far more reliably than freeform text. Explicit structure also hands you precise control over what gets included and what doesn’t.

Bad: Unstructured Data Dump

# Don't do this
prompt = "Here's all the user info: " + json.dumps(user_record)

Problems:

Model must parse arbitrary JSON
You’ve included everything, not just what’s needed
No clear boundaries between data types
Hard to audit what was sent

Good: Hierarchical Markdown Structure

# Do this
prompt = f"""
## User Information
- ID: {user.id}
- Role: {user.role}
- Department: {user.department}

## Request
{sanitize(user_request)}

## Context
- Timestamp: {timestamp}
- Session ID: {session.id}
- Previous Action: {previous_action}
"""

Benefits:

Clear hierarchy from general to specific
Explicit sections you can selectively include/exclude
Easy to redact specific sections (e.g., remove PII for certain tasks)
Audit trail shows exactly what was sent
Model can focus on relevant sections

Semantic Tagging

Tags clarify the purpose of each data element:

<SYSTEM_PROMPT>
You are a secure code review assistant.
Never execute code. Only analyze for vulnerabilities.
</SYSTEM_PROMPT>

<USER_REQUEST>
Review this authentication function for security issues.
</USER_REQUEST>

<CODE classification="internal">
def login(username, password):
    query = f"SELECT * FROM users WHERE name='{username}'"
    cursor.execute(query)
    return cursor.fetchone()
</CODE>

<SECURITY_CONTEXT>
This code handles user authentication for internal admin portal.
OWASP Top 10 compliance required.
</SECURITY_CONTEXT>

Benefits:

Clear separation between instructions and user input (helps with prompt injection defense)
Classification labels enable access control checks before sending
Purpose tags help model understand context type
Easier to parse and validate programmatically

Tags alone won’t prevent prompt injection, but they raise the bar for attackers and give your security controls something solid to hook into.

Principle 2: Least Context Necessary

Only send what the AI actually needs for the task at hand. Think of it as least-privilege, applied to prompt context.

# Sends entire customer database to AI
customers = database.query("SELECT * FROM customers")
prompt = f"Summarize feedback from these customers: {customers}"

Problems:

Massive privacy violation
Huge token cost
Model overwhelmed by irrelevant data
Catastrophic if prompt leaks

Good: Minimal Scope

# Sends only relevant subset
feedback = database.query("""
    SELECT feedback_text, category, date
    FROM customer_feedback
    WHERE category = 'product' AND date > DATE_SUB(NOW(), INTERVAL 30 DAY)
    LIMIT 100
""")
prompt = f"Summarize recent product feedback: {sanitize_feedback(feedback)}"

Benefits:

Only necessary data exposed
Reduced token cost
Model focuses on relevant information
Limited blast radius if compromised

Principle 3: Redaction and Sanitization

Strip sensitive data before it ever touches the model.

Practical Sanitization Example

Original email:

From: [email protected]
Subject: Billing Issue - Order #A-98765

Hi Support,

My credit card ending in 4567 was charged $299.99 twice for
order #A-98765. My customer ID is C-123456.

Please refund the duplicate charge to card ending 4567.

Thanks,
John Doe
SSN: 123-45-6789 (for verification)

Sanitized for AI:

From: [USER_EMAIL]
Subject: Billing Issue - Order #[ORDER_ID]

Hi Support,

My payment method ending in [REDACTED] was charged [AMOUNT]
twice for order #[ORDER_ID]. My customer ID is [CUSTOMER_ID].

Please refund the duplicate charge.

Thanks,
[USER_NAME]

Implementation:

import re

def sanitize_for_ai(text):
    """Remove sensitive data before sending to AI"""

    # Redact email addresses
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
                  '[USER_EMAIL]', text)

    # Redact credit card numbers (last 4 digits)
    text = re.sub(r'\b\d{4}\b', '[REDACTED]', text)

    # Redact SSN
    text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN_REDACTED]', text)

    # Redact dollar amounts (preserve for context but anonymize scale)
    text = re.sub(r'\$[\d,]+\.?\d*', '[AMOUNT]', text)

    # Replace actual names (requires NER or lookup)
    text = replace_known_names(text, '[USER_NAME]')

    # Replace order IDs with generic tokens
    text = re.sub(r'#[A-Z]-\d+', '#[ORDER_ID]', text)

    return text

AI can still:

Understand the issue (duplicate charge)
Generate appropriate response
Route to correct team (billing)
Provide helpful troubleshooting

AI cannot:

See real email address
Access credit card details
View customer name
Extract order numbers for unauthorized lookups

Result: Privacy risk drops substantially while the model stays fully functional for the task.

Principle 4: Data Classification and Routing

Match data sensitivity to the right model tier.

Classification	Model Type	Rationale
Public	Cloud AI (Claude, GPT-4)	Best performance, no privacy concern
Internal	Cloud AI with DPA/BAA	Business Associate Agreement for compliance
Confidential	Local/on-prem model	Data never leaves your control
Secret/Regulated	No AI processing	Human-only, too sensitive for AI

Implementation Example

def route_code_review(code_file):
    """Route code review to appropriate model based on classification"""

    classification = classify_file(code_file)  # Use data classification system

    if classification == "public":
        # Open-source code, use best available model
        return claude_api.review(code_file)

    elif classification == "internal":
        # Proprietary but not critical, cloud with Data Processing Agreement
        if has_signed_dpa("openai"):
            return openai_api.review(code_file)
        else:
            return local_model.review(code_file)

    elif classification == "confidential":
        # Trade secrets, critical IP - local model only
        return local_model.review(code_file)

    elif classification in ["secret", "top-secret"]:
        # Classified information, government secrets
        return {
            "status": "rejected",
            "reason": "Classification too high for AI processing",
            "recommendation": "Manual review required"
        }

    else:
        raise ValueError(f"Unknown classification: {classification}")

Security through routing: Sensitive data never touches cloud providers.

Take a look at your current AI prompts: how much sensitive data are you sending that the model doesn’t actually need for the task?

Retrieval-Augmented Generation (RAG): Security-First Pattern

RAG tackles the context problem head-on by fetching only the documents that matter instead of cramming everything into the prompt.

Traditional Approach (Insecure)

# BAD: Load all documents
all_company_docs = load_all_documents()  # Thousands of documents
prompt = f"""
Question: {user_question}

Company Knowledge Base:
{all_company_docs}
"""
response = ai.query(prompt)

Problems:

User might not have access to all documents
Massive token cost
Prompt contains far more than needed
Privacy violations if docs contain PII
Slow, expensive, insecure

RAG Approach (Secure)

# GOOD: Retrieve only relevant, authorized documents
relevant_docs = vector_db.search(
    query=user_question,
    filters={
        "accessible_by": user.id,  # Access control filter
        "classification": ["public", "internal"]  # Exclude confidential
    },
    limit=5  # Only top 5 most relevant
)

# Sanitize retrieved documents
sanitized_docs = [sanitize_document(doc) for doc in relevant_docs]

prompt = f"""
Question: {user_question}

Relevant Context (user-authorized):
{sanitized_docs}
"""
response = ai.query(prompt)

# Log retrieval for audit
log_document_access(user.id, [doc.id for doc in relevant_docs])

Security benefits:

Least privilege: AI only sees documents user can access
Access control: Search respects user permissions automatically
Audit trail: Logged which documents were accessed
Minimal context: Only relevant documents included
Cost-efficient: Dramatically fewer tokens

RAG Best Practices

1. Chunk Documents Appropriately

Too large chunks: AI gets irrelevant context, wastes tokens Too small chunks: AI lacks context to answer effectively Sweet spot: 500-1000 tokens per chunk

def chunk_document(document, chunk_size=800, overlap=100):
    """
    Split document into overlapping chunks for RAG
    """
    chunks = []
    current_chunk = []
    current_size = 0

    for paragraph in document.paragraphs:
        para_tokens = count_tokens(paragraph)

        if current_size + para_tokens > chunk_size:
            # Save current chunk
            chunks.append({
                "text": "\n\n".join(current_chunk),
                "metadata": extract_metadata(current_chunk)
            })

            # Start new chunk with overlap from previous
            overlap_text = current_chunk[-1] if current_chunk else ""
            current_chunk = [overlap_text, paragraph]
            current_size = count_tokens(overlap_text) + para_tokens
        else:
            current_chunk.append(paragraph)
            current_size += para_tokens

    # Add final chunk
    if current_chunk:
        chunks.append({
            "text": "\n\n".join(current_chunk),
            "metadata": extract_metadata(current_chunk)
        })

    return chunks

Overlap is critical: It keeps concepts that span chunk boundaries from getting lost.

2. Embed Access Control in Metadata

# When indexing documents
chunk_metadata = {
    "chunk_id": "doc-547_chunk-12",
    "source_document": "security_policy.pdf",
    "classification": "internal",
    "accessible_by_roles": ["security_team", "management", "employees"],
    "accessible_by_users": ["user123", "user456"],  # Specific user grants
    "created_date": "2024-11-01",
    "last_modified": "2025-01-15",
    "content_type": "policy_document"
}

vector_db.insert(
    vector=chunk_embedding,
    metadata=chunk_metadata
)

At query time:

# Retrieve only authorized chunks
results = vector_db.search(
    query_vector=embed(user_question),
    filters={
        "$or": [
            {"accessible_by_roles": {"$in": user.roles}},
            {"accessible_by_users": user.id}
        ],
        "classification": {"$in": ["public", "internal"]}  # User's clearance
    },
    top_k=5
)

Result: Users only retrieve chunks they’re authorized to see. Access control lives in the data layer, not just the application layer.

3. Sanitize Retrieved Content

Even when a user is authorized, sanitize before sending to the model:

def sanitize_chunk(chunk, requesting_user):
    """
    Sanitize chunk even after authorization check
    """
    text = chunk["text"]

    # Redact PII even from authorized documents
    text = redact_ssn(text)
    text = redact_credit_cards(text)
    text = redact_api_keys(text)

    # Add provenance watermark
    text += f"\n\n[Source: {chunk['metadata']['source_document']}]"
    text += f"\n[Retrieved for user {requesting_user.id} at {datetime.now()}]"

    # Check for additional redaction rules
    if chunk["metadata"].get("contains_financial_data"):
        text = redact_financial_details(text)

    return {
        "text": text,
        "source": chunk["metadata"]["source_document"],
        "classification": chunk["metadata"]["classification"]
    }

Defense in depth: Authorization alone isn’t enough. Always sanitize.

Practical Example: Secure AI Code Review System

Here’s what all these principles look like working together:

def secure_code_review_pipeline(code_file, requesting_user):
    """
    Complete secure pipeline for AI code review
    """

    # Step 1: Classify the code
    classification = classify_code(code_file)

    # Step 2: Check user authorization
    if not user_authorized_for_classification(requesting_user, classification):
        raise PermissionError(f"User not authorized for {classification} code")

    # Step 3: Route to appropriate model
    if classification in ["secret", "top-secret"]:
        return {"error": "Classification too high for AI", "manual_review": True}
    elif classification == "confidential":
        model = local_model_provider
    else:
        model = cloud_model_provider

    # Step 4: Structure the code for AI
    structured_prompt = f"""
# Code Review Request

## Metadata
- File: {code_file.path}
- Classification: {classification}
- Language: {code_file.language}
- Author: {code_file.author}
- Date: {code_file.date}

## Review Criteria
- Security vulnerabilities (SQL injection, XSS, command injection)
- Authentication/authorization flaws
- Sensitive data exposure
- Cryptographic issues

## Code
```{code_file.language}
{sanitize_code(code_file.content)}

Instructions

Provide security-focused review. Flag vulnerabilities as HIGH/MEDIUM/LOW. Include OWASP references where applicable. """

# Step 5: Query AI model
response = model.query(structured_prompt)

# Step 6: Filter AI output for any leaked sensitive data
filtered_response = filter_sensitive_output(response)

# Step 7: Log interaction for audit
log_ai_interaction({
    "user": requesting_user.id,
    "file": code_file.path,
    "classification": classification,
    "model": model.name,
    "timestamp": datetime.now(),
    "tokens_used": count_tokens(structured_prompt) + count_tokens(response)
})

return filtered_response


**Security layers:**
1. ✅ Classification-based routing
2. ✅ User authorization check
3. ✅ Structured prompt with clear sections
4. ✅ Code sanitization before sending
5. ✅ Output filtering to catch leaks
6. ✅ Comprehensive audit logging

## Key Takeaways

1. **Structure data explicitly** - Markdown, tags, and hierarchy give you clarity and fine-grained control

2. **Least context necessary** - Send only what the model needs for the specific task

3. **Classify and route** - Sensitive data stays on local models or stays away from AI entirely

4. **Use RAG for large datasets** - Retrieve only relevant, authorized documents instead of dumping everything into context

5. **Sanitize aggressively** - Redact PII and sensitive data before anything reaches the model

6. **Embed access control** - Enforce authorization at the data layer, not just the application layer

7. **Log everything** - Maintain an audit trail for compliance and forensic analysis

8. **Defense in depth** - Layer authorization, sanitization, output filtering, and monitoring together

**Make data AI-readable AND secure.** It takes deliberate design, but these patterns make it straightforward.

The alternative -- hobbling AI with too little context or flooding it with sensitive data -- doesn't work. These patterns carve out the middle ground where AI systems are both useful and safe.

---

## What Data Challenges Are You Facing?

How are you balancing context richness with data security in your AI systems? Have you implemented RAG with access control, or are you using a different approach to limit what the model sees? If you've built sanitization pipelines or data classification routing, I'd like to hear what worked and what didn't. The practical details matter more than theory here -- share your experience.

The AI Context Problem

Why Context Matters

The Attack Surface Expansion

Principle 1: Make Structure Explicit

Bad: Unstructured Data Dump

Good: Hierarchical Markdown Structure

Semantic Tagging

Principle 2: Least Context Necessary

Bad: Over-Sharing

Good: Minimal Scope

Principle 3: Redaction and Sanitization

Practical Sanitization Example

Principle 4: Data Classification and Routing

Implementation Example

Retrieval-Augmented Generation (RAG): Security-First Pattern

Traditional Approach (Insecure)

RAG Approach (Secure)

RAG Best Practices

1. Chunk Documents Appropriately

2. Embed Access Control in Metadata

3. Sanitize Retrieved Content

Practical Example: Secure AI Code Review System

Instructions