NSFGPT VS DeepSeek: A Test on DeepSeek’s Security Alarm Capability

NSFOCUS Security Lab conducted actual tests recently to evaluate DeepSeek-R1â€™s performance in security alarm analysis, in which DeepSeek-R1 showed higher alarm coverage than NSFOCUS’ self-developed SecLLM NSFGPT, but it also faces high false alarm rate and large performance overhead. Nonetheless, its enormous potential is noteworthy. This post will focus on the application evaluation of DeepSeek-R1 in security alert analysis, and discuss its practical value and optimization space.

I. How is the performance of DeepSeek?

According to the test results, the performance differences between DeepSeek-R1 and NSFGPT in security alarm analysis are as follows:

Indicator	DeepSeek – R1 Highest Single-round Accuracy	DeepSeek – R1 Lowest Single-round Accuracy	DeepSeek – R1 Overall Accuracy	*NSFGPT -Overall Accuracy*
Whether the Alarm Indicates a Real Attack	90%	82%	86.33%	98.33%
Whether the Alarm Indicates a Real and Successful Attack	82%	68%	76.33%	91.67%

Table 1: Statistics of Alarm Analysis Test Results

In the three rounds of tests, there are 225 real attack samples, and DeepSeek-R1 successfully reported all of them, showing a very good detection ability. However, the false positive problem of DeepSeek-R1 is relatively serious. There are 41 classification errors misidentifying non-attack behaviors as real attacks, with an overall false positive rate as high as 54.67%. This is unacceptable in practical applications due to alarm fatigue concerns.

In addition, generic LLM has a pervasive tendency to overestimate alarm hazards, which is also reflected in the testing of models such as OpenAI GPT-4.

Annotation Source	No Harm	Slight Harm	Moderate Harm	Severe Harm	Undetermined	Total
Manual Classification	82	326	87	69	13	577
OpenAI API Classification	2	13	367	183	12	577

Table 2: Quantitative Analysis of the Generic LLM “Overestimated Alert Hazard” Problem (not related to DeepSeek, for reference only)

Note: The “Undetermined ” part of the table is due to insufficient information in manual classification, and too long Token and other reasons in machine classification.

This problem can be solved by fine-tuning the model. For example, it has been verified that after proper fine-tuning, the false positive rate of NSFGPT can be reduced to less than 1%, depending on the complexity of the environment.

Test method: Based on the original DeepSeek-R1 model, this test adopts three rounds of sampling. Each round contains 75 real attack alarms and 25 non-attack (false alarm) alarms to evaluate the authenticity of alarms and attack results. The data comes from multiple probe devices in the real network environment and is annotated by experts. The alarm information input into the model is preprocessed by protocol field parsing, decoding and decompression to ensure that it is suitable for analysis.

II. Performance bottleneck: Each alarm takes 32 seconds to analyze, and the calculation overhead cannot be ignored

Due to its high accuracy, DeepSeek-R1 is very expensive to run. According to this evaluation, the average time for analyzing a single alarm is 32.03 seconds, which is much longer than that of all tested LLMs. In comparison, the average time consumed by NSFGPT under the same conditions is less than one-tenth.

In addition to the large size of the model, another key reason is that DeepSeek-R1 has a built-in chain of thought (CoT), and its inference output always involves a long “thought process”. Even though the model was required to include step-wise reasoning in its valid output during testing, CoT still accounted for 67.71% of total output, which is more than twice as much as the valid output.

In some extreme cases, the complete result is already given in CoT, but during the valid output part, the model repeatedly outputs the same content, causing unnecessary calculation overhead. Therefore, the cost of relying solely on DeepSeek-R1 to process all alarms will be very high. We need to explore a more comprehensive approach that incorporates small models or traditional rules rather than relying entirely on a single LLM.

III. False alarm problem: High false alarm rate and hallucination

Although DeepSeek-R1 has demonstrated strong capabilities in attack detection, its false positive rate is as high as 54.67%. This means that DeepSeek-R1 often misidentifies normal business behavior as an attack, and this hallucination is pervasive in LLM models. In some cases, the model may generate attack signatures out of thin air or adjudicate innocent requests as real attacks. Therefore, how to effectively suppress this hallucination is the key problem that DeepSeek-R1 must solve in practical applications.

Red circular no entry sign with a white horizontal bar.

Data source: Vectara

IV. In-depth Analysis: CoT mechanism enhances attack payload resolution capability

The CoT (chain of thought) mechanism of DeepSeek-R1 plays an important role in analyzing complex attack payloads. For example, when parsing an obfuscated WebShell file upload attack, DeepSeek-R1 was able to successfully identify and deeply analyze the true intent of a complex attack payload. Through multiple tests, DeepSeek-R1 successfully parsed the confusion part in 6 times, but failed to fully parse the remaining 4 times. With some deviations, DeepSeek-R1 has demonstrated the ability to outperform traditional models without fine tuning.

V. Summary: Balance between high accuracy and calculation bottleneck

Overall, DeepSeek-R1 shows great potential in security alert analysis, especially in attack detection and in-depth analysis. However, its high false alarm rate and calculation overhead are still the main challenges. In order to improve the practical applicability, it can be optimized from the following directions in the future:

Lightweight Model: Reduce the computational overhead of DeepSeek-R1 through knowledge distillation or model pruning.
Comprehensive analysis framework: Combine the small model and rule engine to filter preliminary alarms, reduce the dependence on large models, and improve overall efficiency.
Solve the hallucination problem: strengthen the fine-tuning of the model in complex scenes, and combine with external rule base to suppress the misjudgment of the model.

With further optimization of the technology, DeepSeek-R1 could be potential in intelligent security operations in the future.

References

[1] Forrest Bao,Chenyu Xu,Ofer Mendelevitch. DeepSeek-R1 hallucinates more than DeepSeek-V3, 2025[Z/OL]. (2025). https://www.vectara.com/blog/deepseek-r1-hallucinates-more-than-deepseek-v3.

Post Views: 257