Processing Methodology¶
This document describes the methodology used in TI Mindmap HUB for processing threat intelligence using Generative AI.
Overview¶
TI Mindmap HUB employs a multi-stage pipeline to transform unstructured threat intelligence reports into structured, actionable data. The system leverages Large Language Models (LLMs) for natural language understanding and information extraction, combined with pattern matching and validation layers.
For the visual pipeline diagrams, see Concepts.
Processing Pipeline¶
Stage 1: Content Acquisition¶
Process:
- Curated list of OSINT threat intelligence sources is monitored continuously
- Analysts can submit URLs via the web interface or the MCP
submit_articletool - Submissions are validated, deduplicated, and assigned a tracking identifier
- Raw content is stored for reference and reproducibility
- The article is queued for automated analysis
Sources include:
- Security vendor blogs (e.g., Mandiant, CrowdStrike, Recorded Future)
- Government advisories (e.g., CISA, NCSC)
- Security research publications
- Industry reports
- Analyst-submitted URLs
Stage 2: Normalization and Processing¶
Process:
- HTML is stripped and content is converted to clean text
- Metadata (source, publication date, URL) is preserved
- LLMs and pattern matchers identify candidate entities across the text
- Extracted entities are validated, deduplicated, and assigned confidence levels (high, medium, low)
Stage 3: Threat Analysis Layer¶
Five extraction branches run in parallel:
IOC Detection and Enrichment¶
IOC Types Extracted:
| Type | Method | Validation |
|---|---|---|
| IPv4/IPv6 | Regex + LLM | Format validation, private range exclusion |
| Domains | Regex + LLM | TLD validation, whitelist filtering |
| URLs | Regex + LLM | Format validation |
| File Hashes (MD5, SHA-1, SHA-256) | Regex | Length and character validation |
| Email Addresses | Regex + LLM | Format validation |
Whitelisting:
- Common benign domains are excluded (e.g., google.com, microsoft.com)
- Cloud provider infrastructure ranges are filtered
- RFC 5737 documentation IP ranges are excluded
- Known false-positive patterns are maintained
TTP Extraction and ATT&CK Mapping¶
Process:
- LLM analyzes content for described attack behaviors
- Behaviors are mapped to specific ATT&CK techniques based on behavioral analysis (not keyword matching)
- Technique IDs are validated against the ATT&CK database
- Tactics are inferred from technique associations
Output:
- Technique ID (e.g., T1566.001)
- Technique name
- Associated tactic(s)
- Confidence level (when determinable)
CVE Extraction with Risk Context¶
Process:
- CVE identifiers are extracted via pattern matching (CVE-YYYY-NNNNN)
- Each CVE is enriched with CVSS severity, EPSS score, and exploit status
- Patch availability and Proof-of-Concept status are correlated
- Affected products and correlated references are linked
Threat Actor and Malware Extraction¶
Process:
- LLM identifies named threat groups, malware families, and tools
- Contextual relationships between actors and their tools/techniques are preserved
- Attribution is extracted without fabrication — only explicitly stated attributions are captured
Summary and Mindmap Synthesis¶
Outputs generated:
- Technical summary
- Visual mindmap (Mermaid format) connecting actors, campaigns, malware, TTPs, IOCs, and targets
- "Five Whats" structured root-cause analysis (Who, What, When, Where, Why)
- Probable attack execution sequence
Prompt Engineering:
- Each output type uses specialized prompts
- Prompts are iteratively refined based on output quality
- Temperature and other parameters are tuned per task
Stage 4: STIX 2.1 Structuring¶
The backend assembles all extraction outputs into a unified STIX 2.1 bundle:
Objects Generated:
report— Container for the intelligencethreat-actor— When identified in the sourcemalware— Malware families mentionedindicator— IOCs with STIX patternsattack-pattern— MITRE ATT&CK techniquesvulnerability— CVE identifiers with risk contextrelationship— Connections between objects
Relationship Types:
indicates— Indicator → Malware/Threat-Actoruses— Threat-Actor → Malware/Attack-Patternattributed-to— Malware → Threat-Actorexploits— Malware/Threat-Actor → Vulnerability
Validation:
- STIX 2.1 JSON Schema compliance
- Object reference integrity
- Required field presence
- Pattern syntax validation (for indicators)
See STIX 2.1 Data Model for detailed STIX generation documentation.
Stage 5: Per-Article Frontend Delivery¶
Each processed article is presented to the analyst through a tabbed interface with:
- Header Metadata — Title, source, publication date, link to original report, bookmark, and PDF export
- Intel Graph — Interactive STIX graph with graph/JSON views, object count, and bundle download
- Diamond Model — Adversary, capability, infrastructure, and victim mapping
- AI Summary — AI-generated technical summary
- TI Mindmap — Interactive visual threat model
- IOCs — High and medium confidence indicators with JSON export (low-confidence IOCs available in downloadable file)
- CVEs — CVSS severity, exploited/patch/PoC status, affected products, and references
- TTP Catalog — Full MITRE ATT&CK technique catalog
- Attack Flow — Reconstructed attack execution sequence
- 5W Context — Structured root-cause analysis
- ATT&CK Heatmap — Visual technique heatmap across tactics
- Source Report — Original content for verification
Stage 6: Weekly Briefing Generation¶
Multi-Agent System (Autogen-based):
The weekly briefing uses a specialized multi-agent architecture that processes 50–60 reports per week:
- Collector Agent — Aggregates all reports from the past week
- Analyst Agents — Each analyzes a subset of reports
- Trend Agent — Identifies patterns across analyses
- Synthesis Agent — Produces final briefing
- Editor Agent — Reviews and refines output
Briefing Sections:
- Executive summary
- Top TTPs observed
- Most targeted sectors
- Emerging threats
- Notable campaigns (deep dives)
Quality Assurance¶
Automated Validation¶
- IOC format validation
- STIX schema compliance
- MITRE ATT&CK ID verification
- Duplicate detection
- Confidence scoring
Human Review¶
- Periodic sampling of outputs
- Feedback incorporation
- Prompt refinement based on errors
Metrics Tracked¶
- IOC extraction precision/recall (sampled)
- TTP mapping accuracy (sampled)
- Processing success rate
- User feedback scores
Known Limitations¶
See Known Limitations for detailed information on system limitations.
Technology Stack¶
| Component | Technology |
|---|---|
| LLM Provider | OpenAI GPT-4 / Azure OpenAI |
| Backend | Python, Azure Functions |
| Database | Azure Cosmos DB |
| Frontend | React, TypeScript, Material-UI |
| Authentication | Azure AD B2C |
| Hosting | Azure Static Web Apps, Azure Container Apps |
Reproducibility¶
While the core application code is private, we aim to provide:
- Detailed methodology documentation (this document)
- Example outputs and STIX bundles
- Evaluation metrics when available
- Research publications describing specific components
Future Research Directions¶
- Knowledge Graph Construction — Building a graph database of threat relationships for longitudinal analysis
- Confidence Scoring — Developing reliable confidence metrics for AI outputs
- Cross-Source Correlation — Linking intelligence across multiple sources
- Evaluation Framework — Systematic evaluation of extraction accuracy
- Expanded OSINT Coverage — Broader source monitoring and deeper cross-report correlation