Critical RCE Vulnerabilities in Major AI Inference Engines
Unprecedented Remote Code Execution Flaws Expose Global AI Infrastructure to Full System Compromise
The Discovery: November 14, 2025
On the evening of November 14, 2025, a coordinated disclosure by independent security researchers unveiled a cluster of critical remote code execution (RCE) vulnerabilities affecting the core inference engines powering modern artificial intelligence deployments worldwide.
Unlike previous AI security issues focused on prompt injection or model theft, these flaws reside in the runtime execution layer — the software responsible for loading models, processing inference requests, and managing GPU/TPU resources. The vulnerabilities enable unauthenticated attackers to execute arbitrary operating system commands with the full privileges of the AI service process.
The disclosure was not the result of a single bug bounty or internal audit, but rather a convergence of findings from multiple research teams analyzing production-grade AI serving stacks over the past six months.
Affected Platforms & Software
The vulnerabilities impact a wide spectrum of AI inference frameworks — from proprietary enterprise solutions to open-source projects used by millions. Below is a comprehensive list:
Meta Llama Inference Stack
All versions of Llama.cpp and Meta's internal serving runtime prior to emergency patches.
CriticalNvidia Triton Inference Server
Versions 2.38 through 2.51 in default configurations with model repositories enabled.
CriticalMicrosoft Azure ML Inference Endpoints
On-premise and hybrid deployments using KServe-compatible runtimes.
HighvLLM (UC Berkeley)
All releases before v0.5.2 when using continuous batching and PagedAttention.
CriticalPyTorch Serve & TorchServe
Default management API exposed on port 8081 with model archive loading.
HighSGLang & LightLLM
Runtime deserialization flaws in frontend request handlers.
Medium| Platform | Vulnerable Component | CVSS v3.1 Score | Exploit Requires Auth? | Patch Status |
|---|---|---|---|---|
| Llama.cpp | GGUF model loader | 9.8 | No | Emergency patch released |
| Nvidia Triton | Model repository HTTP endpoint | 9.8 | No | Hotfix in progress |
| vLLM | Continuous batching scheduler | 9.1 | No | v0.5.2 released |
| PyTorch Serve | Management API | 8.7 | Optional | Patch pending |
Technical Deep Dive
Root Cause: Unsafe Deserialization in Model Pipelines
The primary vulnerability class stems from insecure deserialization during model loading and inference request preprocessing. Many AI serving frameworks accept serialized tensors, metadata, or configuration objects directly from HTTP APIs to optimize performance.
POST /v2/models/llama-70b/infer HTTP/1.1
Content-Type: application/octet-stream
[Serialized PyTorch Tensor + Malicious Pickle Payload]
└── __reduce__() → os.system('rm -rf / --no-preserve-root')
Secondary Issue: Buffer Overflow in GPU Kernels
In Nvidia Triton and vLLM, oversized attention key-value caches can trigger heap-based buffer overflows when processing batched requests with extreme sequence lengths.
{
"inputs": [
{
"name": "input_ids",
"shape": [1, 2147483647],
"datatype": "INT32",
"data": [49407, 32000, ...]
}
]
}
# Triggers integer overflow in KV cache allocation
Privilege Escalation Path
Most AI inference services run as root or with CAP_SYS_ADMIN to access GPU drivers and hugepages. A successful RCE immediately grants full host compromise.
Real-World Attack Scenarios
api.company.com/v1/chat) are being actively scanned by automated bots within hours of disclosure. Shodan and Censys show over 180,000 exposed inference endpoints.
Scenario 1: Cryptojacking via GPU
Attacker deploys XMRig miner directly on compromised A100/H100 GPUs, generating $500–$2,000/day per server.
Scenario 2: Proprietary Model Theft
Exfiltrates 70B+ parameter models (worth millions in training costs) via DNS tunneling or chunked HTTP responses.
Scenario 3: Supply Chain Backdoor
Injects malicious weights into downstream fine-tuned models, creating persistent backdoors in customer deployments.
Scenario 4: Ransomware + Extortion
Encrypts model weights and training datasets, demands payment in BTC with proof-of-encryption screenshots.
Enterprise & National Security Impact
- Financial Services: AI-driven fraud detection systems compromised → regulatory fines under RBI, SEBI.
- Healthcare: Diagnostic AI models tampered → patient safety risks.
- Defense Contractors: Classified reasoning engines exposed.
- Cloud Providers: Multi-tenant GPU clusters at risk of cross-tenant breaches.
In India alone, over 4,200 organizations run vulnerable vLLM or Triton instances, including major banks, e-commerce platforms, and government AI labs.
Enterprise Mitigation Playbook
Immediate Actions (0–4 Hours)
- Block all inbound traffic to ports
8000,8080,8081,9000at network edge. - Disable model loading from untrusted sources in all serving configs.
- Rotate all API keys and JWT secrets used by AI frontends.
Short-Term (4–24 Hours)
- Upgrade to patched versions:
vLLM >= 0.5.2Triton >= 2.52.0-hotfixLlama.cpp >= Nov 15 emergency build
- Enable
--safe-deserializationflags where available. - Deploy WAF rules to block oversized
input_idsand pickle payloads.
Long-Term Hardening
- Run inference in firecracker microVMs or gVisor sandboxes.
- Implement model signing with TEE attestation (Nvidia Confidential Computing).
- Monitor GPU memory patterns with Falco or eBPF rules.
# Example Nginx WAF Rule
SecRule ARGS "@detectSQLi" \
"id:1001,phase:2,deny,status:403,msg:'Blocked malicious inference request'"
SecRule REQUEST_HEADERS:Content-Type "@contains" "pickle" \
"id:1002,phase:1,deny,status:403"
The Road Ahead: Securing AI Infrastructure
This incident marks a watershed moment in AI security. The era of "move fast and deploy models" is over. Expect:
- Mandatory SBOMs for all published AI models
- Runtime attestation for inference servers
- Standardized secure serving APIs (like OpenAI’s but open-source)
- Government regulations for critical AI systems (India’s DPDP Act expansion?)
The community is already rallying: AI Security Alliance announced an emergency working group on November 15 to draft Secure AI Serving Guidelines v1.0 by Q1 2026.