Versa Blog

Company Updates

Introducing AI-Enhanced DLP capabilities in Versa Universal SASE Platform

By Shubham Sangle and Anusha Vaidyanathan

Summary

Versa has enhanced its AI DLP capabilities, comprehensive multi-format file inspection with granular metadata analysis, and intelligent redaction/tokenization features. These enhancements enable organizations to detect and prevent sensitive data leakage with unprecedented accuracy while minimizing false positives through contextual understanding. In this surgical walkthrough, we’ll dissect each of these enhancements to show exactly how Versa’s AI DLP elevates data protection from detection to intelligent prevention.

AI DLP – Enhancements to Data Discovery based on ‘ML Analysis’

Versa’s new capabilities in AI-powered DLP takes data protection beyond contextual DLP by embedding advanced machine learning directly into the SASE fabric. Deployed as containerized microservices, the new ML analysis engine runs locally on SASE gateways—ensuring sensitive data never leaves your infrastructure while delivering intelligent, adaptive protection that builds upon Versa’s existing contextual ML capabilities for even deeper data understanding.

Key Capabilities:

On-Premises ML Processing: Docker-containerized ML models execute data discovery and classification locally, maintaining complete data sovereignty while scanning file repositories, SaaS applications, and inline traffic flows
Discovery-First Approach: ML analysis mode enables organizations to discover what sensitive data exists across their environment before enforcement, automatically baselining data flows and user behaviors to recommend tailored DLP policies—eliminating guesswork in rule creation
Policy-Free Data Classification: Unlike traditional DLP requiring hundreds of prescriptive rules, ML models learn from organizational data in monitoring mode to automatically classify sensitive content and suggest context-aware policies based on actual usage patterns, not hypothetical scenarios

Why it Matters:

Enhanced Contextual Intelligence: New ML layer augments Versa’s existing contextual analysis with deeper semantic understanding—learning organizational data patterns, user behavior baselines, and business workflows to distinguish legitimate data usage from policy violations
Intelligent Data Discovery: NLP-powered models automatically identify PII, PHI, PCI, IP, and proprietary content across structured and unstructured data—working alongside existing context analysis, EDM, and document fingerprinting methods
Adaptive Classification: Builds on current ML capabilities to achieve 60-70% reduction in false positives through multi-layered analysis, continuously improving accuracy via analyst feedback and historical violation patterns
Real-Time Performance: Containerized architecture delivers sub-100ms inspection latency for inline DLP enforcement while enabling horizontal scaling based on traffic volume and computational needs
Self-Learning Discovery: Automatically identifies new sensitive data types as your data estate evolves, complementing existing detection methods without requiring manual policy updates
Audit-Ready Governance: Explainable AI framework provides transparent justifications for every classification decision, meeting compliance requirements for regulated industries

Enhancing Current Capabilities of Versa DLP

Versa DLP uses advanced transformer models and fine-tuned Large Language Models (LLMs) to detect sensitive information across diverse document types and formats. Unlike traditional pattern matching, Versa ONE applies contextual and behavioral analysis using spatial and temporal signals to differentiate legitimate data use from potential breaches. This context-aware approach reduces false positives and dynamically applies predefined, policy-driven protections without disrupting normal business workflows.

Key Capabilities

Dynamic Adaptation with Real-Time Protection

Monitors data in motion, at rest, and in use with real-time threat assessment
Provides immediate response to potential data leaks or unauthorized access attempts

Multi-Modal Document Intelligence

Supports comprehensive DLP for text embedded across text, PDF, DOC/DOCX, XLS/XLSX, PPT/PPTX, and image files (jpg, png)
Extracts and analyzes multi-modal content—files combining text, images, tables, charts, and embedded objects—using transformer-based models
Performs intelligent pre-processing of both images and text for optimal detection

Source Code Detection

Automatically identifies and classifies source code across multiple programming languages:
- C (.c files), C++ (.cpp files), PHP (.php files), Python (.py files)
Prevents inadvertent exposure of proprietary code and intellectual property
Detects code patterns even when embedded in other document types

How It Works

ETL Steps – Non security infrastructure
1. Pre-Processing Pipeline: Advanced image and text pre-processing optimizes content for both training and inference
2. Document Classification: Transformer models analyze and classify document types (credit cards, passports, source code, PII, etc.)
3. Content Extraction: Multi-modal extraction captures text, metadata, embedded objects, and visual elements
LLM-Powered Analysis: Fine-tuned LLMs perform classification and detection of sensitive/PII data with contextual awareness
Real-Time Decision: System applies configured actions (alert, block, redact, etc.) based on detection results

Compliance & Use Cases

PCI-DSS: Detect credit card numbers and payment information
HIPAA: Identify protected health information (PHI)
US PII: Flag personally identifiable information including Social Security numbers, driver’s licenses,
INDIA PII: Aadhaar, PAN cards
Source Code Protection: Prevent intellectual property leakage through code repositories
Financial Data: Identify sensitive financial documents including fingerprints of IC circuits and proprietary designs
Japan PII – Flag Personally Identifiable Information including Japanese My number, DL written in Japanese script

Granular File Inspection (Including Metadata)

What It Does

Goes beyond basic content scanning by performing deep inspection of file metadata, embedded objects, and structural elements within documents. The platform analyzes core file content as well as non-obvious areas such as comments and embedded images in PowerPoint files. It also inspects Excel formulas to detect sensitive data that traditional DLP solutions often miss.

Key Capabilities

Optical Character Recognition (OCR) for Embedded Images

OCR converts images to searchable, analyzable text. When a document containing images passes through the DLP engine, the system identifies embedded images, extracts visual text using OCR technology, and applies DLP pattern matching to the extracted content. This ensures sensitive information hidden in screenshots, scanned documents, or visual elements doesn’t bypass security controls.

Supported scenarios:

Extracts text from images embedded in PDF documents, PowerPoint presentations (.ppt, .pptx), Excel spreadsheets (.xls, .xlsx), and Word documents (.doc, .docx)
Applies DLP policies to OCR-extracted text (supports PNG, JPEG, BMP formats)
Detects Social Security Numbers, Aadhaar numbers, PAN cards, driver’s licenses, credit cards, and other PII in image attachments
OCR on ZIP File Contents: Recursively scans compressed archives, performs OCR on images found within ZIP files, and applies comprehensive DLP analysis

Comprehensive Metadata Analysis

Comments & Notes: Scans comments in Word/Excel documents and speaker notes in PowerPoint presentations and takes action based on the policy such as “redact/obfuscate” or “block”
Headers & Footers: Analyzes header and footer content across all document types
Works with both Content Analysis and EDM (Exact Data Match) rule types
Supports compliance frameworks: US_PII, PCI_DSS, HIPAA

How It Works

OCR Processing Pipeline:

Image Detection: System identifies embedded images in supported file formats
Text Extraction: OCR engine converts visual text in images to machine-readable format
Content Analysis: Extracted text undergoes same DLP pattern matching as native text
Policy Application: Configured DLP rules (Content Analysis, EDM, or OCR rule types) apply to all discovered content
Action Enforcement: System executes configured action (alert, block, redact, tokenize) when matches are found

Redaction/Tokenization

What It Does

Automatically obscures or replaces sensitive data in documents while maintaining file usability and format integrity. This allows organizations to share documents externally or internally while protecting specific sensitive elements.

Key Capabilities

Text-Based Redaction/Tokenization for EDM

Supports Exact Data Match (EDM) redaction and tokenization across Word documents (.doc, .docx), Excel spreadsheets (.xls, .xlsx), PDF documents, PowerPoint presentations (.ppt, .pptx), and plain text files (.txt, .xml, .sh, .html, .c, .php).

Redaction: When a DLP rule detects a match in an editable, text-based file, the system replaces the matched content with random characters, making the original data unrecoverable.

Tokenization (EDM only): Replaces sensitive data with a token value that retains the same format (character set and length) as the original while changing the actual values to protect the data. This allows documents to remain functional for workflows requiring format consistency.

Image Redaction/Tokenization in EDM Redacts or tokenizes sensitive content detected in embedded images within PDF documents, Excel spreadsheets, PowerPoint presentations, and Word documents.

Content Analysis & OCR Redaction

Applies redaction/tokenization to content discovered through Content Analysis rules
Redacts text extracted via OCR from images
Maintains document structure and formatting while protecting sensitive data

Processing Flow

DLP engine detects sensitive pattern match (via EDM, Content Analysis, or OCR)
System evaluates configured action (redaction vs. tokenization)
Matched content is replaced according to policy
Modified document is delivered to destination
Action is logged to Versa Analytics for audit trail

Use Cases

Email Attachments: Automatically redact Social Security numbers from documents before external transmission
Document Sharing: Tokenize credit card numbers in financial reports shared with third parties
OCR-Based Protection: Redact sensitive information detected in scanned images within PDFs
Multi-Format Protection: Apply consistent redaction policies across Word, Excel, PowerPoint, and PDF documents

Technical Implementation

Supported File Types

Office Documents: .doc, .docx, .xls, .xlsx, .ppt, .pptx
PDFs: Standard and scanned PDFs with OCR
Images: .png, .jpeg, .bmp, .gif, .tif
Code Files: .c, .cpp, .php, .py, .pl, .sh
Archives: .zip, .tar, .gzip, .xz
Other: .txt, .xml, .html, .csv, .rtf

Integration with Existing DLP Framework

These features integrate seamlessly with Versa’s existing DLP architecture:

DLP Rules: Create Content Analysis, EDM, File DLP, Document Fingerprinting, and OCR rule types
Data Protection Profiles: Define patterns and Boolean operations for matching
DLP Profiles: Combine multiple rules with application groups and processing order
Policy Association: Apply DLP profiles to SASE Internet Protection Rules and Private Application Protection Rules

Analytics & Reporting

All AI/ML detections, granular inspections, and redaction/tokenization actions are logged to Versa Analytics, providing comprehensive audit trails for compliance, detailed reports on data usage patterns, threat severity classification (Critical, Major, Normal), and threat type categorization (ML document fingerprint, ML image classification, ML source code, OCR match, etc.).

Summary and Next Steps

Versa Universal SASE 23.1.1 extends existing DLP capabilities with an ML‑driven discovery and classification layer that runs inline as containerized services on SASE gateways. This approach enables organizations to identify and baseline sensitive data across files, SaaS applications, and live traffic before enforcement, while keeping data local and avoiding external processing. Combined with Versa’s established contextual DLP, granular file and metadata inspection, and in‑place redaction or tokenization, these enhancements provide a consistent enforcement path from discovery to real‑time prevention with measurable reductions in false positives.

To evaluate these capabilities in your environment, contact Versa field subject matter expert or request a demo to see how AI‑assisted discovery and enforcement can be operationalized within your existing DLP framework.

AIDLPSASE

2025 Gartner^® Magic Quadrant™ for SASE Platforms

Versa has for the third consecutive year been recognized in the Gartner Magic Quadrant for SASE Platforms and is one of 11 vendors included in this year's report.

Introducing AI-Enhanced DLP capabilities in Versa Universal SASE Platform

Summary

Granular File Inspection (Including Metadata)

Redaction/Tokenization

Technical Implementation

Integration with Existing DLP Framework

Summary and Next Steps

Recent Posts

Reimagining NetOps: Operational Intelligence Powered by Versa NLP based Co-Pilot

Introducing AI-Enhanced DLP capabilities in Versa Universal SASE Platform

BrickStorm Malware: Anatomy of a Stealth Linux Backdoor Targeting Modern Infrastructure

Rethinking 5G for the Enterprise WAN: From Basic Modems to Intelligent Extenders

‘Secure by Design’ at Versa: One Year of Progress

Topics

Top Tags

2025 Gartner^® Magic Quadrant™ for SASE Platforms

Introducing AI-Enhanced DLP capabilities in Versa Universal SASE Platform

Summary

Granular File Inspection (Including Metadata)

Redaction/Tokenization

Technical Implementation

Integration with Existing DLP Framework

Summary and Next Steps

Recent Posts

Reimagining NetOps: Operational Intelligence Powered by Versa NLP based Co-Pilot

Introducing AI-Enhanced DLP capabilities in Versa Universal SASE Platform

BrickStorm Malware: Anatomy of a Stealth Linux Backdoor Targeting Modern Infrastructure

Rethinking 5G for the Enterprise WAN: From Basic Modems to Intelligent Extenders

‘Secure by Design’ at Versa: One Year of Progress

Topics

Top Tags

2025 Gartner® Magic Quadrant™ for SASE Platforms

2025 Gartner^® Magic Quadrant™ for SASE Platforms