AI-Assisted mapping generation

Introduction: Building on Our Framework Mapping Foundation

Last year, we introduced the capability to map your compliance data from one framework to another in CISO Assistant. This feature allowed organizations to reuse their compliance work across multiple frameworks—applying a control implemented for ISO 27001 to satisfy SOC 2 requirements, for example. We also shipped a toolbox to customize and create your own mappings, enabling teams to cover custom frameworks or refine existing crosswalks to match their specific organizational needs.

While these capabilities were powerful, they still required significant time and effort from compliance experts. Creating a comprehensive mapping between two frameworks meant manually analyzing hundreds or thousands of requirement pairs, understanding their semantic relationships, and documenting the rationale. For organizations working with multiple frameworks, or for us maintaining mappings across the 100+ frameworks in our library, this manual approach simply didn’t scale.

This led us to ask: Could AI help us accelerate the framework mapping creation process while maintaining quality? This article chronicles our journey exploring different AI approaches to automate mapping generation—from LLM-based semantic analysis to embedding-based similarity matching—and the practical tooling we built to make AI-assisted framework mapping a reality.

The Problem: The Framework Mapping Challenge

In the world of cybersecurity governance, risk, and compliance (GRC), organizations rarely operate under a single framework. A typical enterprise might need to demonstrate compliance with ISO 27001 for international credibility, SOC 2 for customer assurance, NIST CSF for cybersecurity maturity, and various industry-specific frameworks depending on their sector.

This creates an immediate challenge: how do requirements across these frameworks relate to each other?

Understanding these relationships—called “framework mapping”—is crucial for:

Efficiency: Implementing one control that satisfies multiple framework requirements
Coverage Analysis: Identifying gaps where one framework requires something another doesn’t
Compliance Reuse: Leveraging existing compliance work when adopting new frameworks
Strategic Planning: Understanding the overlap and delta when moving between frameworks

The Manual Mapping Problem

Traditionally, framework mapping is done manually by compliance experts who:

Read each requirement in the source framework
Understand its intent and scope
Search through the target framework for related requirements
Determine the type and strength of relationship
Document the mapping with justification

For two frameworks with 100 requirements each, this means 10,000 pairwise comparisons. At 2-3 minutes per comparison, we’re looking at 300-500 hours of expert time—and that’s for just one framework pair. With 100+ frameworks in the CISO Assistant library, comprehensive mapping would require tens of thousands of expert-hours.

Beyond time, manual mapping suffers from:

Inconsistency: Different experts interpret relationships differently
Incompleteness: Easy to miss non-obvious connections
Non-reproducibility: Hard to validate or update as frameworks evolve
Cognitive fatigue: Quality degrades over long sessions

The question became: Can we use AI to accelerate this process while maintaining quality?

Our Journey: Exploring Different Approaches

We explored three main approaches to automated framework mapping, each with distinct trade-offs.

Approach 1: LLM-Based Semantic Mapping

The Hypothesis: Large language models have been trained on vast amounts of text, including security and compliance documentation. They should be able to understand semantic relationships between requirements and provide human-like reasoning about their connections.

Implementation: We built semantic_mapper.py using local Ollama models to:

Parse frameworks from YAML files, extracting all assessable requirements
Build rich context by combining each requirement with its parent/grandparent context (e.g., “Identify > Asset Management > Hardware Assets” provides better context than just “Hardware Assets”)
Perform semantic comparison by prompting the LLM to compare each source requirement against all target requirements
Classify relationships into three types:
- equal: Same topic/requirement with equivalent scope
- intersect: Related but not entirely equivalent
- no_relationship: Different topics with no overlap
Generate explanations for each relationship with confidence scores

Key Features:

Checkpoint/resume: Long-running jobs can be interrupted and resumed
Top-N matching: Find multiple related requirements, not just the best one
Threshold filtering: Only keep high-confidence matches
Model flexibility: Compare results from different LLMs (mistral, llama3, phi3, etc.)
Connection pooling and model keep-alive: Optimized for faster inference

Strengths:

✅ Natural language explanations that humans can review
✅ Nuanced understanding of context and intent
✅ Flexibility in interpretation (can understand implied relationships)
✅ Can compare results from multiple models for validation

Weaknesses:

❌ Extremely slow: Performance varies drastically based on model size and hardware
- Even on high-end hardware (M4 Mac), processing took ~30 hours for real-world framework pairs
- Some benchmark runs took 3+ days to complete
- For large frameworks (500+ requirements), multiply these times by 5-10x
❌ Quality-speed paradox: Longer runtime doesn’t guarantee better results—some multi-day runs produced poor mappings
❌ Non-deterministic (same input may produce slightly different outputs)
❌ Requires local Ollama server and model downloads (multi-GB)
❌ Occasional JSON parsing errors from LLM responses
❌ Quality varies significantly by model choice—extensive testing required to find good model-task fit

Performance Reality Check:

Our extensive benchmarks revealed that LLM-based mapping is impractical for production use:

M4 Mac (state-of-the-art hardware): ~30 hours for typical framework pairs
Older hardware: 72+ hours not uncommon
Best performing models: Gemma and Qwen overall showed the best quality-to-speed ratio
The model lottery: Most models required extensive testing to validate quality, with many producing poor results despite long runtimes

This extreme slowness made checkpoint/resume functionality absolutely essential—losing days of progress to interruption was unacceptable.

Optimization Journey: Despite implementing several performance improvements, the fundamental speed limitation remained:

HTTP connection pooling (saved ~50-100ms per request)
Model keep-alive to prevent costly reloads (30 minutes)
Pre-warming models at startup
Configurable checkpoint intervals to reduce I/O overhead
Optimized LLM parameters (temperature=0.1, num_predict=200)

Even with these optimizations, the 30+ hour runtime on top-tier hardware made LLM mapping viable only for small-scale validation, not production use. This reality drove us to prioritize SBERT.

Approach 2: SBERT-Based Embedding Mapping

The Hypothesis: If we treat framework mapping as a pure semantic similarity problem, sentence transformers (SBERT) can compute embeddings and cosine similarity 10-100x faster than LLM inference, with deterministic results.

Implementation: We built sbert_mapper.py using Sentence-BERT models to:

Parse frameworks identically to the LLM approach
Generate embeddings for all requirements in both frameworks using pre-trained sentence transformers
Compute similarity matrix using cosine similarity between all source-target pairs
Classify relationships based on similarity thresholds:
- equal: similarity ≥ 0.85
- intersect: 0.50 ≤ similarity < 0.85
- no_relationship: similarity < 0.50
Return top matches with similarity scores

Model Options:

all-MiniLM-L6-v2: Fast (1000 sentences/sec), lightweight (384 dims)
all-mpnet-base-v2: High quality (768 dims), our recommendation
paraphrase-multilingual: For non-English frameworks

Strengths:

✅ Blazingly fast: 2-30 seconds for frameworks that took LLMs 30-60 minutes
✅ Deterministic: Same input always produces identical output
✅ No external services: Runs completely offline
✅ GPU acceleration: Even faster on CUDA/MPS devices
✅ No parsing issues: Direct numerical computation
✅ Mathematically grounded: Cosine similarity is well-understood

Weaknesses:

❌ No natural language explanations
❌ Less nuanced understanding of context
❌ Purely similarity-based (can’t reason about relationship types)
❌ Threshold tuning is somewhat arbitrary

Performance: Significantly faster than LLM-based approaches—typically completing in seconds to minutes what would take LLMs hours or days. Performance scales with framework size and hardware capabilities, with GPU acceleration providing additional speedup.

Approach 3: Hybrid LLM + Embeddings

The Hypothesis: What if we give LLMs the benefit of embedding similarity as additional context? They could use it to validate their reasoning while still providing explanations.

Implementation: Enhanced semantic_mapper.py with an optional --embedding-model parameter that:

Generates embeddings using Ollama’s embedding models (nomic-embed-text, mxbai-embed-large)
Computes cosine similarity for each source-target pair
Passes similarity score to the LLM as context: “The embedding similarity is X. Does this match your semantic analysis?”
Outputs both similarity and LLM reasoning

Outcome:

Provides numerical grounding for LLM decisions
Allows comparison between embedding and LLM scores
Helps identify cases where they disagree (often interesting edge cases)
Still slow (LLM speed remains the bottleneck)

Multi-Model Comparison

We also built compare_models.py to systematically compare different LLMs on the same mapping task. Through extensive benchmarking across multiple models, we discovered:

Agreement patterns: Where models agree usually indicates high-confidence mappings
Disagreement analysis: Where models disagree often highlights nuanced requirements needing human judgment
Model performance hierarchy:
- Gemma & Qwen: Emerged as top performers with the best balance of quality and speed for framework mapping
- Mistral/Llama: Reasonable quality but significantly slower
- Smaller models (phi3, deepseek-1.5b, smollm2): Much faster but inconsistent quality—sometimes surprisingly good, often poor
The “bigger is better” myth: We found model architecture and training matter more than size—some 3-4B models (Gemma, Qwen) outperformed 7B+ models

The Complete Toolchain

Beyond the core mapping engines, we built supporting tools:

1. Visualization: `heatmap_builder.py`

Generates heatmap visualizations of the score matrix between frameworks, making patterns visible at a glance:

Clusters of strong relationships
Gaps in coverage
Asymmetric mappings (A→B ≠ B→A)

2. Interactive Exploration: `heatmap_builder_notebook.py`

Marimo notebook for interactive parameter tuning:

Drag-and-drop CSV loading
Real-time threshold adjustment
Colormap selection
Statistics dashboard

3. Simplification: `simplify_mapping.py`

Converts detailed mapping CSVs to standard 4-column format:

source_node_id,target_node_id,relationship,strength_of_relationship

For easy import into graph databases (Neo4j) or other systems.

4. Human Review: `prepare_review.py`

Generates Excel and HTML review files with:

Side-by-side requirement descriptions
Filtering options (favor “equal” relationships)
Human-friendly formatting for validation

This workflow emerged:

1. Run SBERT mapping (fast exploration)
2. Review heatmap to understand patterns
3. Run LLM mapping on interesting subsets (deep analysis)
4. Compare multi-model results for validation
5. Generate Excel review for expert validation
6. Simplify to standard format for integration

Key Insights and Best Practices

1. Context is Critical

Our early attempts used only requirement descriptions. Adding parent/grandparent context improved matching significantly:

Without context: “Maintain an inventory” → ambiguous With context: “Identify > Asset Management > Maintain an inventory of hardware assets” → clear

2. The Speed-Quality Trade-off

SBERT: Use for rapid exploration, large-scale mapping, deterministic results
LLM: Use for detailed analysis, explanation generation, nuanced judgment
Best practice: SBERT first for candidates, LLM second for validation

3. Multiple Matches Matter

Framework relationships are rarely 1:1. Using --top-n 3-5 reveals:

Primary mappings (strongest match)
Alternative implementations
Partial overlaps
Related requirements for cross-checking

4. Threshold Tuning by Use Case

Different thresholds serve different needs:

High threshold (0.7-0.9): Compliance mapping (must be certain)
Medium threshold (0.5-0.7): Coverage analysis (interesting relationships)
Low threshold (0.3-0.5): Gap discovery (tangential connections)

5. Human-in-the-Loop is Essential

AI accelerates but doesn’t replace expertise. The tools:

Eliminate 90% of obvious “no relationship” cases
Surface likely matches for expert review
Provide explanations as starting points for validation
Catch non-obvious connections humans might miss

6. Model Selection Matters

Different models showed different strengths in our extensive testing:

For LLM-based mapping:

Gemma & Qwen: Best overall performers for framework mapping—good quality results with relatively better speed (though still 30+ hours)
Mistral: Decent quality but slower than Gemma/Qwen for this task
Llama 3.1: More conservative but extremely slow
DeepSeek/SmolLM: Tested but did not perform as well as Gemma/Qwen

For SBERT-based mapping:

MPNet (all-mpnet-base-v2): Best quality-to-speed ratio, our recommended default
MiniLM (all-MiniLM-L6-v2): Fastest option, good for exploration

7. The Value of Determinism

For production systems, SBERT’s determinism is invaluable:

Testable and debuggable
Version-controllable
Reproducible audits
No “why did it change?” questions

Results and Impact

These tools have enabled us to:

Accelerate mapping: Tasks taking weeks now take hours
Scale coverage: Map more framework pairs than humanly feasible manually
Improve consistency: Algorithmic approach reduces human bias
Enable discovery: Find non-obvious relationships experts missed
Facilitate updates: Re-run mappings when frameworks evolve
Support experimentation: Try different models/parameters quickly

Future Directions

Several areas for continued exploration:

1. Fine-tuned Models

Training SBERT models specifically on security/compliance text could improve accuracy.

2. Graph-Based Analysis

Treating frameworks as knowledge graphs and using graph algorithms (PageRank, community detection) could reveal structural patterns.

3. Bidirectional Mapping

Currently we map source→target. Comparing with target→source could identify asymmetries and improve confidence.

4. Active Learning

Using human corrections to iteratively improve model parameters or fine-tune embeddings.

5. Multi-Framework Transitive Mapping

If A→B and B→C are known, can we infer A→C? Could this improve coverage and catch errors?

6. Confidence Calibration

Better understanding when the model is uncertain and should defer to humans.

Conclusion

Framework mapping is a perfect example of a task that AI can accelerate but not fully automate. The combination of:

SBERT for speed and determinism
LLMs for reasoning and explanation
Visualization for pattern discovery
Human expertise for validation

…creates a powerful workflow that multiplies human capability.

The tools we’ve built reduce what was a months-long expert task to a days-long collaborative process between AI and humans. More importantly, they make comprehensive framework mapping feasible at the scale of 100+ frameworks—something that was simply not practical before.

The code is open source in the CISO Assistant repository under tools/mapping_builder/. We hope it accelerates the work of compliance teams everywhere and serves as a template for applying AI to other document-heavy expert tasks.

For technical details and usage instructions, see README.md in the mapping_builder directory.

AI-Assisted mapping generation

Introduction: Building on Our Framework Mapping Foundation