· Abderrahmane Smimite · Articles · 10 min read
AI-Assisted mapping generation
Summary of our ongoing research on accelerating the creation of consistent frameworks mappings
Introduction: Building on Our Framework Mapping Foundation
Last year, we introduced the capability to map your compliance data from one framework to another in CISO Assistant. This feature allowed organizations to reuse their compliance work across multiple frameworks—applying a control implemented for ISO 27001 to satisfy SOC 2 requirements, for example. We also shipped a toolbox to customize and create your own mappings, enabling teams to cover custom frameworks or refine existing crosswalks to match their specific organizational needs.
While these capabilities were powerful, they still required significant time and effort from compliance experts. Creating a comprehensive mapping between two frameworks meant manually analyzing hundreds or thousands of requirement pairs, understanding their semantic relationships, and documenting the rationale. For organizations working with multiple frameworks, or for us maintaining mappings across the 100+ frameworks in our library, this manual approach simply didn’t scale.
This led us to ask: Could AI help us accelerate the framework mapping creation process while maintaining quality? This article chronicles our journey exploring different AI approaches to automate mapping generation—from LLM-based semantic analysis to embedding-based similarity matching—and the practical tooling we built to make AI-assisted framework mapping a reality.
The Problem: The Framework Mapping Challenge
In the world of cybersecurity governance, risk, and compliance (GRC), organizations rarely operate under a single framework. A typical enterprise might need to demonstrate compliance with ISO 27001 for international credibility, SOC 2 for customer assurance, NIST CSF for cybersecurity maturity, and various industry-specific frameworks depending on their sector.
This creates an immediate challenge: how do requirements across these frameworks relate to each other?
Understanding these relationships—called “framework mapping”—is crucial for:
- Efficiency: Implementing one control that satisfies multiple framework requirements
- Coverage Analysis: Identifying gaps where one framework requires something another doesn’t
- Compliance Reuse: Leveraging existing compliance work when adopting new frameworks
- Strategic Planning: Understanding the overlap and delta when moving between frameworks
The Manual Mapping Problem
Traditionally, framework mapping is done manually by compliance experts who:
- Read each requirement in the source framework
- Understand its intent and scope
- Search through the target framework for related requirements
- Determine the type and strength of relationship
- Document the mapping with justification
For two frameworks with 100 requirements each, this means 10,000 pairwise comparisons. At 2-3 minutes per comparison, we’re looking at 300-500 hours of expert time—and that’s for just one framework pair. With 100+ frameworks in the CISO Assistant library, comprehensive mapping would require tens of thousands of expert-hours.
Beyond time, manual mapping suffers from:
- Inconsistency: Different experts interpret relationships differently
- Incompleteness: Easy to miss non-obvious connections
- Non-reproducibility: Hard to validate or update as frameworks evolve
- Cognitive fatigue: Quality degrades over long sessions
The question became: Can we use AI to accelerate this process while maintaining quality?
Our Journey: Exploring Different Approaches
We explored three main approaches to automated framework mapping, each with distinct trade-offs.
Approach 1: LLM-Based Semantic Mapping
The Hypothesis: Large language models have been trained on vast amounts of text, including security and compliance documentation. They should be able to understand semantic relationships between requirements and provide human-like reasoning about their connections.
Implementation: We built semantic_mapper.py using local Ollama models to:
- Parse frameworks from YAML files, extracting all assessable requirements
- Build rich context by combining each requirement with its parent/grandparent context (e.g., “Identify > Asset Management > Hardware Assets” provides better context than just “Hardware Assets”)
- Perform semantic comparison by prompting the LLM to compare each source requirement against all target requirements
- Classify relationships into three types:
equal: Same topic/requirement with equivalent scopeintersect: Related but not entirely equivalentno_relationship: Different topics with no overlap
- Generate explanations for each relationship with confidence scores
Key Features:
- Checkpoint/resume: Long-running jobs can be interrupted and resumed
- Top-N matching: Find multiple related requirements, not just the best one
- Threshold filtering: Only keep high-confidence matches
- Model flexibility: Compare results from different LLMs (mistral, llama3, phi3, etc.)
- Connection pooling and model keep-alive: Optimized for faster inference
Strengths:
- ✅ Natural language explanations that humans can review
- ✅ Nuanced understanding of context and intent
- ✅ Flexibility in interpretation (can understand implied relationships)
- ✅ Can compare results from multiple models for validation
Weaknesses:
- ❌ Extremely slow: Performance varies drastically based on model size and hardware
- Even on high-end hardware (M4 Mac), processing took ~30 hours for real-world framework pairs
- Some benchmark runs took 3+ days to complete
- For large frameworks (500+ requirements), multiply these times by 5-10x
- ❌ Quality-speed paradox: Longer runtime doesn’t guarantee better results—some multi-day runs produced poor mappings
- ❌ Non-deterministic (same input may produce slightly different outputs)
- ❌ Requires local Ollama server and model downloads (multi-GB)
- ❌ Occasional JSON parsing errors from LLM responses
- ❌ Quality varies significantly by model choice—extensive testing required to find good model-task fit
Performance Reality Check:
Our extensive benchmarks revealed that LLM-based mapping is impractical for production use:
- M4 Mac (state-of-the-art hardware): ~30 hours for typical framework pairs
- Older hardware: 72+ hours not uncommon
- Best performing models: Gemma and Qwen overall showed the best quality-to-speed ratio
- The model lottery: Most models required extensive testing to validate quality, with many producing poor results despite long runtimes
This extreme slowness made checkpoint/resume functionality absolutely essential—losing days of progress to interruption was unacceptable.
Optimization Journey: Despite implementing several performance improvements, the fundamental speed limitation remained:
- HTTP connection pooling (saved ~50-100ms per request)
- Model keep-alive to prevent costly reloads (30 minutes)
- Pre-warming models at startup
- Configurable checkpoint intervals to reduce I/O overhead
- Optimized LLM parameters (temperature=0.1, num_predict=200)
Even with these optimizations, the 30+ hour runtime on top-tier hardware made LLM mapping viable only for small-scale validation, not production use. This reality drove us to prioritize SBERT.
Approach 2: SBERT-Based Embedding Mapping
The Hypothesis: If we treat framework mapping as a pure semantic similarity problem, sentence transformers (SBERT) can compute embeddings and cosine similarity 10-100x faster than LLM inference, with deterministic results.
Implementation: We built sbert_mapper.py using Sentence-BERT models to:
- Parse frameworks identically to the LLM approach
- Generate embeddings for all requirements in both frameworks using pre-trained sentence transformers
- Compute similarity matrix using cosine similarity between all source-target pairs
- Classify relationships based on similarity thresholds:
equal: similarity ≥ 0.85intersect: 0.50 ≤ similarity < 0.85no_relationship: similarity < 0.50
- Return top matches with similarity scores
Model Options:
- all-MiniLM-L6-v2: Fast (1000 sentences/sec), lightweight (384 dims)
- all-mpnet-base-v2: High quality (768 dims), our recommendation
- paraphrase-multilingual: For non-English frameworks
Strengths:
- ✅ Blazingly fast: 2-30 seconds for frameworks that took LLMs 30-60 minutes
- ✅ Deterministic: Same input always produces identical output
- ✅ No external services: Runs completely offline
- ✅ GPU acceleration: Even faster on CUDA/MPS devices
- ✅ No parsing issues: Direct numerical computation
- ✅ Mathematically grounded: Cosine similarity is well-understood
Weaknesses:
- ❌ No natural language explanations
- ❌ Less nuanced understanding of context
- ❌ Purely similarity-based (can’t reason about relationship types)
- ❌ Threshold tuning is somewhat arbitrary
Performance: Significantly faster than LLM-based approaches—typically completing in seconds to minutes what would take LLMs hours or days. Performance scales with framework size and hardware capabilities, with GPU acceleration providing additional speedup.
Approach 3: Hybrid LLM + Embeddings
The Hypothesis: What if we give LLMs the benefit of embedding similarity as additional context? They could use it to validate their reasoning while still providing explanations.
Implementation: Enhanced semantic_mapper.py with an optional --embedding-model parameter that:
- Generates embeddings using Ollama’s embedding models (nomic-embed-text, mxbai-embed-large)
- Computes cosine similarity for each source-target pair
- Passes similarity score to the LLM as context: “The embedding similarity is X. Does this match your semantic analysis?”
- Outputs both similarity and LLM reasoning
Outcome:
- Provides numerical grounding for LLM decisions
- Allows comparison between embedding and LLM scores
- Helps identify cases where they disagree (often interesting edge cases)
- Still slow (LLM speed remains the bottleneck)
Multi-Model Comparison
We also built compare_models.py to systematically compare different LLMs on the same mapping task. Through extensive benchmarking across multiple models, we discovered:
- Agreement patterns: Where models agree usually indicates high-confidence mappings
- Disagreement analysis: Where models disagree often highlights nuanced requirements needing human judgment
- Model performance hierarchy:
- Gemma & Qwen: Emerged as top performers with the best balance of quality and speed for framework mapping
- Mistral/Llama: Reasonable quality but significantly slower
- Smaller models (phi3, deepseek-1.5b, smollm2): Much faster but inconsistent quality—sometimes surprisingly good, often poor
- The “bigger is better” myth: We found model architecture and training matter more than size—some 3-4B models (Gemma, Qwen) outperformed 7B+ models
The Complete Toolchain
Beyond the core mapping engines, we built supporting tools:
1. Visualization: heatmap_builder.py
Generates heatmap visualizations of the score matrix between frameworks, making patterns visible at a glance:
- Clusters of strong relationships
- Gaps in coverage
- Asymmetric mappings (A→B ≠ B→A)
2. Interactive Exploration: heatmap_builder_notebook.py
Marimo notebook for interactive parameter tuning:
- Drag-and-drop CSV loading
- Real-time threshold adjustment
- Colormap selection
- Statistics dashboard
3. Simplification: simplify_mapping.py
Converts detailed mapping CSVs to standard 4-column format:
source_node_id,target_node_id,relationship,strength_of_relationshipFor easy import into graph databases (Neo4j) or other systems.
4. Human Review: prepare_review.py
Generates Excel and HTML review files with:
- Side-by-side requirement descriptions
- Filtering options (favor “equal” relationships)
- Human-friendly formatting for validation
This workflow emerged:
1. Run SBERT mapping (fast exploration)
2. Review heatmap to understand patterns
3. Run LLM mapping on interesting subsets (deep analysis)
4. Compare multi-model results for validation
5. Generate Excel review for expert validation
6. Simplify to standard format for integrationKey Insights and Best Practices
1. Context is Critical
Our early attempts used only requirement descriptions. Adding parent/grandparent context improved matching significantly:
Without context: “Maintain an inventory” → ambiguous With context: “Identify > Asset Management > Maintain an inventory of hardware assets” → clear
2. The Speed-Quality Trade-off
- SBERT: Use for rapid exploration, large-scale mapping, deterministic results
- LLM: Use for detailed analysis, explanation generation, nuanced judgment
- Best practice: SBERT first for candidates, LLM second for validation
3. Multiple Matches Matter
Framework relationships are rarely 1:1. Using --top-n 3-5 reveals:
- Primary mappings (strongest match)
- Alternative implementations
- Partial overlaps
- Related requirements for cross-checking
4. Threshold Tuning by Use Case
Different thresholds serve different needs:
- High threshold (0.7-0.9): Compliance mapping (must be certain)
- Medium threshold (0.5-0.7): Coverage analysis (interesting relationships)
- Low threshold (0.3-0.5): Gap discovery (tangential connections)
5. Human-in-the-Loop is Essential
AI accelerates but doesn’t replace expertise. The tools:
- Eliminate 90% of obvious “no relationship” cases
- Surface likely matches for expert review
- Provide explanations as starting points for validation
- Catch non-obvious connections humans might miss
6. Model Selection Matters
Different models showed different strengths in our extensive testing:
For LLM-based mapping:
- Gemma & Qwen: Best overall performers for framework mapping—good quality results with relatively better speed (though still 30+ hours)
- Mistral: Decent quality but slower than Gemma/Qwen for this task
- Llama 3.1: More conservative but extremely slow
- DeepSeek/SmolLM: Tested but did not perform as well as Gemma/Qwen
For SBERT-based mapping:
- MPNet (all-mpnet-base-v2): Best quality-to-speed ratio, our recommended default
- MiniLM (all-MiniLM-L6-v2): Fastest option, good for exploration
7. The Value of Determinism
For production systems, SBERT’s determinism is invaluable:
- Testable and debuggable
- Version-controllable
- Reproducible audits
- No “why did it change?” questions
Results and Impact
These tools have enabled us to:
- Accelerate mapping: Tasks taking weeks now take hours
- Scale coverage: Map more framework pairs than humanly feasible manually
- Improve consistency: Algorithmic approach reduces human bias
- Enable discovery: Find non-obvious relationships experts missed
- Facilitate updates: Re-run mappings when frameworks evolve
- Support experimentation: Try different models/parameters quickly
Future Directions
Several areas for continued exploration:
1. Fine-tuned Models
Training SBERT models specifically on security/compliance text could improve accuracy.
2. Graph-Based Analysis
Treating frameworks as knowledge graphs and using graph algorithms (PageRank, community detection) could reveal structural patterns.
3. Bidirectional Mapping
Currently we map source→target. Comparing with target→source could identify asymmetries and improve confidence.
4. Active Learning
Using human corrections to iteratively improve model parameters or fine-tune embeddings.
5. Multi-Framework Transitive Mapping
If A→B and B→C are known, can we infer A→C? Could this improve coverage and catch errors?
6. Confidence Calibration
Better understanding when the model is uncertain and should defer to humans.
Conclusion
Framework mapping is a perfect example of a task that AI can accelerate but not fully automate. The combination of:
- SBERT for speed and determinism
- LLMs for reasoning and explanation
- Visualization for pattern discovery
- Human expertise for validation
…creates a powerful workflow that multiplies human capability.
The tools we’ve built reduce what was a months-long expert task to a days-long collaborative process between AI and humans. More importantly, they make comprehensive framework mapping feasible at the scale of 100+ frameworks—something that was simply not practical before.
The code is open source in the CISO Assistant repository under tools/mapping_builder/. We hope it accelerates the work of compliance teams everywhere and serves as a template for applying AI to other document-heavy expert tasks.
For technical details and usage instructions, see README.md in the mapping_builder directory.

