Company Updates
Introducing AI-Enhanced DLP capabilities in Versa Universal SASE Platform
Summary
Versa has enhanced its AI DLP capabilities, comprehensive multi-format file inspection with granular metadata analysis, and intelligent redaction/tokenization features. These enhancements enable organizations to detect and prevent sensitive data leakage with unprecedented accuracy while minimizing false positives through contextual understanding. In this surgical walkthrough, we’ll dissect each of these enhancements to show exactly how Versa’s AI DLP elevates data protection from detection to intelligent prevention.
AI DLP – Enhancements to Data Discovery based on ‘ML Analysis’
Versa’s new capabilities in AI-powered DLP takes data protection beyond contextual DLP by embedding advanced machine learning directly into the SASE fabric. Deployed as containerized microservices, the new ML analysis engine runs locally on SASE gateways—ensuring sensitive data never leaves your infrastructure while delivering intelligent, adaptive protection that builds upon Versa’s existing contextual ML capabilities for even deeper data understanding.
Key Capabilities:
- On-Premises ML Processing: Docker-containerized ML models execute data discovery and classification locally, maintaining complete data sovereignty while scanning file repositories, SaaS applications, and inline traffic flows
- Discovery-First Approach: ML analysis mode enables organizations to discover what sensitive data exists across their environment before enforcement, automatically baselining data flows and user behaviors to recommend tailored DLP policies—eliminating guesswork in rule creation
- Policy-Free Data Classification: Unlike traditional DLP requiring hundreds of prescriptive rules, ML models learn from organizational data in monitoring mode to automatically classify sensitive content and suggest context-aware policies based on actual usage patterns, not hypothetical scenarios
Why it Matters:
- Enhanced Contextual Intelligence: New ML layer augments Versa’s existing contextual analysis with deeper semantic understanding—learning organizational data patterns, user behavior baselines, and business workflows to distinguish legitimate data usage from policy violations
- Intelligent Data Discovery: NLP-powered models automatically identify PII, PHI, PCI, IP, and proprietary content across structured and unstructured data—working alongside existing context analysis, EDM, and document fingerprinting methods
- Adaptive Classification: Builds on current ML capabilities to achieve 60-70% reduction in false positives through multi-layered analysis, continuously improving accuracy via analyst feedback and historical violation patterns
- Real-Time Performance: Containerized architecture delivers sub-100ms inspection latency for inline DLP enforcement while enabling horizontal scaling based on traffic volume and computational needs
- Self-Learning Discovery: Automatically identifies new sensitive data types as your data estate evolves, complementing existing detection methods without requiring manual policy updates
- Audit-Ready Governance: Explainable AI framework provides transparent justifications for every classification decision, meeting compliance requirements for regulated industries
Enhancing Current Capabilities of Versa DLP
Versa DLP uses advanced transformer models and fine-tuned Large Language Models (LLMs) to detect sensitive information across diverse document types and formats. Unlike traditional pattern matching, Versa ONE applies contextual and behavioral analysis using spatial and temporal signals to differentiate legitimate data use from potential breaches. This context-aware approach reduces false positives and dynamically applies predefined, policy-driven protections without disrupting normal business workflows.
Key Capabilities
Dynamic Adaptation with Real-Time Protection
- Monitors data in motion, at rest, and in use with real-time threat assessment
- Provides immediate response to potential data leaks or unauthorized access attempts
Multi-Modal Document Intelligence
- Supports comprehensive DLP for text embedded across text, PDF, DOC/DOCX, XLS/XLSX, PPT/PPTX, and image files (jpg, png)
- Extracts and analyzes multi-modal content—files combining text, images, tables, charts, and embedded objects—using transformer-based models
- Performs intelligent pre-processing of both images and text for optimal detection
Source Code Detection
- Automatically identifies and classifies source code across multiple programming languages:
- C (.c files), C++ (.cpp files), PHP (.php files), Python (.py files)
- Prevents inadvertent exposure of proprietary code and intellectual property
- Detects code patterns even when embedded in other document types
How It Works
- ETL Steps – Non security infrastructure
- Pre-Processing Pipeline: Advanced image and text pre-processing optimizes content for both training and inference
- Document Classification: Transformer models analyze and classify document types (credit cards, passports, source code, PII, etc.)
- Content Extraction: Multi-modal extraction captures text, metadata, embedded objects, and visual elements
- LLM-Powered Analysis: Fine-tuned LLMs perform classification and detection of sensitive/PII data with contextual awareness
- Real-Time Decision: System applies configured actions (alert, block, redact, etc.) based on detection results
Compliance & Use Cases
- PCI-DSS: Detect credit card numbers and payment information
- HIPAA: Identify protected health information (PHI)
- US PII: Flag personally identifiable information including Social Security numbers, driver’s licenses,
- INDIA PII: Aadhaar, PAN cards
- Source Code Protection: Prevent intellectual property leakage through code repositories
- Financial Data: Identify sensitive financial documents including fingerprints of IC circuits and proprietary designs
- Japan PII – Flag Personally Identifiable Information including Japanese My number, DL written in Japanese script
Granular File Inspection (Including Metadata)
What It Does
Goes beyond basic content scanning by performing deep inspection of file metadata, embedded objects, and structural elements within documents. The platform analyzes core file content as well as non-obvious areas such as comments and embedded images in PowerPoint files. It also inspects Excel formulas to detect sensitive data that traditional DLP solutions often miss.
Key Capabilities
Optical Character Recognition (OCR) for Embedded Images
OCR converts images to searchable, analyzable text. When a document containing images passes through the DLP engine, the system identifies embedded images, extracts visual text using OCR technology, and applies DLP pattern matching to the extracted content. This ensures sensitive information hidden in screenshots, scanned documents, or visual elements doesn’t bypass security controls.
Supported scenarios:
- Extracts text from images embedded in PDF documents, PowerPoint presentations (.ppt, .pptx), Excel spreadsheets (.xls, .xlsx), and Word documents (.doc, .docx)
- Applies DLP policies to OCR-extracted text (supports PNG, JPEG, BMP formats)
- Detects Social Security Numbers, Aadhaar numbers, PAN cards, driver’s licenses, credit cards, and other PII in image attachments
- OCR on ZIP File Contents: Recursively scans compressed archives, performs OCR on images found within ZIP files, and applies comprehensive DLP analysis
Comprehensive Metadata Analysis
- Comments & Notes: Scans comments in Word/Excel documents and speaker notes in PowerPoint presentations and takes action based on the policy such as “redact/obfuscate” or “block”
- Headers & Footers: Analyzes header and footer content across all document types
- Works with both Content Analysis and EDM (Exact Data Match) rule types
- Supports compliance frameworks: US_PII, PCI_DSS, HIPAA
How It Works
OCR Processing Pipeline:
- Image Detection: System identifies embedded images in supported file formats
- Text Extraction: OCR engine converts visual text in images to machine-readable format
- Content Analysis: Extracted text undergoes same DLP pattern matching as native text
- Policy Application: Configured DLP rules (Content Analysis, EDM, or OCR rule types) apply to all discovered content
- Action Enforcement: System executes configured action (alert, block, redact, tokenize) when matches are found
Redaction/Tokenization
What It Does
Automatically obscures or replaces sensitive data in documents while maintaining file usability and format integrity. This allows organizations to share documents externally or internally while protecting specific sensitive elements.
Key Capabilities
Text-Based Redaction/Tokenization for EDM
Supports Exact Data Match (EDM) redaction and tokenization across Word documents (.doc, .docx), Excel spreadsheets (.xls, .xlsx), PDF documents, PowerPoint presentations (.ppt, .pptx), and plain text files (.txt, .xml, .sh, .html, .c, .php).
Redaction: When a DLP rule detects a match in an editable, text-based file, the system replaces the matched content with random characters, making the original data unrecoverable.
Tokenization (EDM only): Replaces sensitive data with a token value that retains the same format (character set and length) as the original while changing the actual values to protect the data. This allows documents to remain functional for workflows requiring format consistency.
Image Redaction/Tokenization in EDM Redacts or tokenizes sensitive content detected in embedded images within PDF documents, Excel spreadsheets, PowerPoint presentations, and Word documents.
Content Analysis & OCR Redaction
- Applies redaction/tokenization to content discovered through Content Analysis rules
- Redacts text extracted via OCR from images
- Maintains document structure and formatting while protecting sensitive data
Processing Flow
- DLP engine detects sensitive pattern match (via EDM, Content Analysis, or OCR)
- System evaluates configured action (redaction vs. tokenization)
- Matched content is replaced according to policy
- Modified document is delivered to destination
- Action is logged to Versa Analytics for audit trail
Use Cases
- Email Attachments: Automatically redact Social Security numbers from documents before external transmission
- Document Sharing: Tokenize credit card numbers in financial reports shared with third parties
- OCR-Based Protection: Redact sensitive information detected in scanned images within PDFs
- Multi-Format Protection: Apply consistent redaction policies across Word, Excel, PowerPoint, and PDF documents
Technical Implementation
Supported File Types
- Office Documents: .doc, .docx, .xls, .xlsx, .ppt, .pptx
- PDFs: Standard and scanned PDFs with OCR
- Images: .png, .jpeg, .bmp, .gif, .tif
- Code Files: .c, .cpp, .php, .py, .pl, .sh
- Archives: .zip, .tar, .gzip, .xz
- Other: .txt, .xml, .html, .csv, .rtf
Integration with Existing DLP Framework
These features integrate seamlessly with Versa’s existing DLP architecture:
- DLP Rules: Create Content Analysis, EDM, File DLP, Document Fingerprinting, and OCR rule types
- Data Protection Profiles: Define patterns and Boolean operations for matching
- DLP Profiles: Combine multiple rules with application groups and processing order
- Policy Association: Apply DLP profiles to SASE Internet Protection Rules and Private Application Protection Rules
Analytics & Reporting
All AI/ML detections, granular inspections, and redaction/tokenization actions are logged to Versa Analytics, providing comprehensive audit trails for compliance, detailed reports on data usage patterns, threat severity classification (Critical, Major, Normal), and threat type categorization (ML document fingerprint, ML image classification, ML source code, OCR match, etc.).
Summary and Next Steps
Versa Universal SASE 23.1.1 extends existing DLP capabilities with an ML‑driven discovery and classification layer that runs inline as containerized services on SASE gateways. This approach enables organizations to identify and baseline sensitive data across files, SaaS applications, and live traffic before enforcement, while keeping data local and avoiding external processing. Combined with Versa’s established contextual DLP, granular file and metadata inspection, and in‑place redaction or tokenization, these enhancements provide a consistent enforcement path from discovery to real‑time prevention with measurable reductions in false positives.
To evaluate these capabilities in your environment, contact Versa field subject matter expert or request a demo to see how AI‑assisted discovery and enforcement can be operationalized within your existing DLP framework.