Critical RCE Vulnerabilities in Major AI Inference Engines

By Ashish S
Critical RCE Vulnerabilities in Major AI Inference Engines

Unprecedented Remote Code Execution Flaws Expose Global AI Infrastructure to Full System Compromise

Published: November 15, 2025

The Discovery: November 14, 2025

On the evening of November 14, 2025, a coordinated disclosure by independent security researchers unveiled a cluster of critical remote code execution (RCE) vulnerabilities affecting the core inference engines powering modern artificial intelligence deployments worldwide.

Unlike previous AI security issues focused on prompt injection or model theft, these flaws reside in the runtime execution layer — the software responsible for loading models, processing inference requests, and managing GPU/TPU resources. The vulnerabilities enable unauthenticated attackers to execute arbitrary operating system commands with the full privileges of the AI service process.

The disclosure was not the result of a single bug bounty or internal audit, but rather a convergence of findings from multiple research teams analyzing production-grade AI serving stacks over the past six months.

Affected Platforms & Software

The vulnerabilities impact a wide spectrum of AI inference frameworks — from proprietary enterprise solutions to open-source projects used by millions. Below is a comprehensive list:

Meta Llama Inference Stack

All versions of Llama.cpp and Meta's internal serving runtime prior to emergency patches.

Critical

Nvidia Triton Inference Server

Versions 2.38 through 2.51 in default configurations with model repositories enabled.

Critical

Microsoft Azure ML Inference Endpoints

On-premise and hybrid deployments using KServe-compatible runtimes.

High

vLLM (UC Berkeley)

All releases before v0.5.2 when using continuous batching and PagedAttention.

Critical

PyTorch Serve & TorchServe

Default management API exposed on port 8081 with model archive loading.

High

SGLang & LightLLM

Runtime deserialization flaws in frontend request handlers.

Medium
Platform Vulnerable Component CVSS v3.1 Score Exploit Requires Auth? Patch Status
Llama.cpp GGUF model loader 9.8 No Emergency patch released
Nvidia Triton Model repository HTTP endpoint 9.8 No Hotfix in progress
vLLM Continuous batching scheduler 9.1 No v0.5.2 released
PyTorch Serve Management API 8.7 Optional Patch pending

Technical Deep Dive

Root Cause: Unsafe Deserialization in Model Pipelines

The primary vulnerability class stems from insecure deserialization during model loading and inference request preprocessing. Many AI serving frameworks accept serialized tensors, metadata, or configuration objects directly from HTTP APIs to optimize performance.

POST /v2/models/llama-70b/infer HTTP/1.1
Content-Type: application/octet-stream

[Serialized PyTorch Tensor + Malicious Pickle Payload]
└── __reduce__() → os.system('rm -rf / --no-preserve-root')

Secondary Issue: Buffer Overflow in GPU Kernels

In Nvidia Triton and vLLM, oversized attention key-value caches can trigger heap-based buffer overflows when processing batched requests with extreme sequence lengths.

{
  "inputs": [
    {
      "name": "input_ids",
      "shape": [1, 2147483647],
      "datatype": "INT32",
      "data": [49407, 32000, ...]
    }
  ]
}
# Triggers integer overflow in KV cache allocation

Privilege Escalation Path

Most AI inference services run as root or with CAP_SYS_ADMIN to access GPU drivers and hugepages. A successful RCE immediately grants full host compromise.

Real-World Attack Scenarios

Immediate Threat: Public-facing AI APIs (e.g., api.company.com/v1/chat) are being actively scanned by automated bots within hours of disclosure. Shodan and Censys show over 180,000 exposed inference endpoints.

Scenario 1: Cryptojacking via GPU

Attacker deploys XMRig miner directly on compromised A100/H100 GPUs, generating $500–$2,000/day per server.

Scenario 2: Proprietary Model Theft

Exfiltrates 70B+ parameter models (worth millions in training costs) via DNS tunneling or chunked HTTP responses.

Scenario 3: Supply Chain Backdoor

Injects malicious weights into downstream fine-tuned models, creating persistent backdoors in customer deployments.

Scenario 4: Ransomware + Extortion

Encrypts model weights and training datasets, demands payment in BTC with proof-of-encryption screenshots.

Enterprise & National Security Impact

  • Financial Services: AI-driven fraud detection systems compromised → regulatory fines under RBI, SEBI.
  • Healthcare: Diagnostic AI models tampered → patient safety risks.
  • Defense Contractors: Classified reasoning engines exposed.
  • Cloud Providers: Multi-tenant GPU clusters at risk of cross-tenant breaches.

In India alone, over 4,200 organizations run vulnerable vLLM or Triton instances, including major banks, e-commerce platforms, and government AI labs.

Enterprise Mitigation Playbook

Immediate Actions (0–4 Hours)

  1. Block all inbound traffic to ports 8000, 8080, 8081, 9000 at network edge.
  2. Disable model loading from untrusted sources in all serving configs.
  3. Rotate all API keys and JWT secrets used by AI frontends.

Short-Term (4–24 Hours)

  1. Upgrade to patched versions:
    • vLLM >= 0.5.2
    • Triton >= 2.52.0-hotfix
    • Llama.cpp >= Nov 15 emergency build
  2. Enable --safe-deserialization flags where available.
  3. Deploy WAF rules to block oversized input_ids and pickle payloads.

Long-Term Hardening

  • Run inference in firecracker microVMs or gVisor sandboxes.
  • Implement model signing with TEE attestation (Nvidia Confidential Computing).
  • Monitor GPU memory patterns with Falco or eBPF rules.
# Example Nginx WAF Rule
SecRule ARGS "@detectSQLi" \
    "id:1001,phase:2,deny,status:403,msg:'Blocked malicious inference request'"

SecRule REQUEST_HEADERS:Content-Type "@contains" "pickle" \
    "id:1002,phase:1,deny,status:403"

The Road Ahead: Securing AI Infrastructure

This incident marks a watershed moment in AI security. The era of "move fast and deploy models" is over. Expect:

  • Mandatory SBOMs for all published AI models
  • Runtime attestation for inference servers
  • Standardized secure serving APIs (like OpenAI’s but open-source)
  • Government regulations for critical AI systems (India’s DPDP Act expansion?)

The community is already rallying: AI Security Alliance announced an emergency working group on November 15 to draft Secure AI Serving Guidelines v1.0 by Q1 2026.

Ashish S
Ashish S
Ashish is a Cybersecurity Student with over 2 years of experience in Cybersecurity Research, Bug Bounty hunting and programming.