NSFOCUS AI-Scan Typical Capabilities: Large Language Model Adversarial Defense Capability Assessment

July 16, 2025 | NSFOCUS

Large language model (LLM) adversarial attacks refer to techniques that deceive LLMs through carefully-designed input samples (adversarial samples) to produce incorrect predictions or behaviors. In this regard, AI-Scan provides LLM adversarial defense capability assessment, allowing users to select an adversarial attack assessment template for one-click task assignment and generate an adversarial defense capability assessment report.

Specifically, adversarial attacks are categorized into the following seven types:

01 Model Hallucination

Model hallucination occurs when a LLM generates content that appears plausible but is incorrect, fabricated, or irrelevant to the input, this includes attack types such as factual hallucination and faithful hallucination.

02 Meta-Prompt Leakage

Meta-prompt leakage refers to situations where a LLLM inadvertently exposes its internally preset system-level prompts while responding to user requests, revealing underlying instructions, rules, or preferences that should remain hidden. This includes attack types such as keyword positioning leakage and assumed scenario leakage.

03 Model Jailbreak Attack

Model jailbreak attack refers to bypassing the security restrictions of LLM systems through specific technical means, causing them to violate preset ethical guidelines, content review rules or terms of use, thereby generating otherwise prohibited output, such as violence, illegality, privacy leakage and other content. Attack types include DAN, assumed scenario jailbreak, assumed role jailbreak and adversarial suffix attacks.

04 Role Escape Attack

Role escape attack refers to users inducing LLMs to deviate from their preset roles or behavioral norms through specific means, so that they break through the security boundaries, ethical limits or functional constraints set by developers and perform operations that should be prohibited. Such attacks usually target role-playing AI (such as customer service assistants, audit robots, etc.) with the aim of making the model “forget” its duties and act according to the attacker’s intentions. Attack types include assumed role escape, assumed scenario escape, role escape by ignoring previous instructions, and prompt goal hijacking.

05 Application Vulnerability Attack

LLM application vulnerability attack refers to a type of attack method that exploits the design defects, logical vulnerabilities or integration problems of actual application systems built based on large language models to carry out malicious behavior. This attack not only targets the model itself, but also involves weak links in the entire application ecosystem, which may lead to serious consequences such as data leakage, service abuse, and permission bypass. Attack methods include code execution injection, XSS session hijacking, and adversarial encoding attacks.

06 Model Function Abuse

Model function abuse refers to attackers using a LLM’s capabilities in unintended ways to generate harmful content, bypass restrictions, or perform malicious actions. This type of abuse does not directly attack the model itself but leverages its legitimate functions for illegal or unethical purposes. Examples include phishing email generation, malicious code generation and more.

07 Model Inversion Attack

Model inversion attack is a privacy attack method. The attacker repeatedly queries the target model and analyzes its output to reversely infer the training data or input sensitive information of the model. This type of attack aims to “spy” on the original data memorized by the model, which may lead to personal privacy leakage or commercial secret exposure. Attack types include training data leakage and model anomalies and more.

NSFOCUS AI Scan now supports adversarial attacks assessment covering all attack types mentioned above.