Overview
In recent years, with the wide application of open-source LLMs such as DeepSeek and Ollama, global enterprises are accelerating the private deployment of LLMs. This wave not only improves the efficiency of enterprises, but also increases the risk of data security leakage. According to NSFOCUS Xingyun Lab, from January to February 2025 alone, five major data breaches related to LLMs broke out globally, resulting in the leakage of a large amount of sensitive data, including model chat history, API Keys, credentials and other information. These incidents not only expose the blind spots of enterprises in AI technology application, but also sound the alarm bell for “AI-driven risks”. This post focuses on these five incidents and reveals the security blind spots in AI technology applications through timeline backtracking, leakage scale statistics, attack technique disassembly, and MITRE ATT&CK framework mapping.
Event 1: The Clickhouse database used by Deepseek was configured incorrectly, resulting in serious chat data leakage
Event time: January 29, 2025
Leakage scale: Millions of lines of log stream, including sensitive information such as chat history and keys
Event Review:
On January 29, 2025, the Wiz security research team discovered an exposed Clickho use service on the Internet and identified it as a Chinese AI startup DeepSeek. Using the Clickhouse service, which allows access to data in an underlying database, Wiz security researchers discovered about a million lines of DeepSeek’s log stream containing historical chat history, keys and other sensitive information.
DeepSeek was immediately notified of the problem by Wiz team, and its exposed Clickhouse service was promptly and securely disposed of.
Event Analysis:
ClickHouse is an open source column database management system (DBMS) designed for Online Analytic Processing (OLAP). It can efficiently process large-scale data, support real-time query and analysis, and is suitable for log analysis, user behavior analysis and other scenarios. ClickHouse has an unauthorized access vulnerability. Any user can execute SQL-like commands through the exposed API interface of a ClickHouse service that does not add any access control mechanism.
In this event, the Wiz security research team used technical means to detect ports 80 and 443 of about 30 DeepSeek Internet-oriented subdomains. Most of these exposed services host resources such as chatbot interfaces, status pages, and API documents without associated security risks. In order to further explore the exposure risk of DeepSeek, the Wiz security research team expanded the detection range to unconventional ports other than 80 and 443, such as 8123 and 9000. Eventually, they found that there are exposed services under multiple subdomains of unconventional ports, such as:
- http://dev.deepseek.com:8123
- http://oauth2callback.deepseek.com:9000
- http://dev.deepseek.com:9000
- http://oauth2callback.deepseek.com:8123
After confirming that these exposed services are ClickHouse, the Wiz security research team conducts a query test on the underlying database through the API of ClickHouse service, including querying the database and querying the table in the database, as shown in the following figure:
Figure 1. Suspected Clickhouse Leakage Data 1
Figure 2. Suspected Clickhouse Leakage Data 2
Starting from January 6, 2025, the log information at risk of disclosure includes call logs and text logs for various internal DeepSeek API endpoints, including chat history, API keys, back-end details, and operation metadata.
VERIZON Event Classification: Miscellaneous Errors
MITRE ATT&CK technology used:
Technology | Sub-Technology | Utilization mode |
T1590 collects victim network information | .002 domain name resolution | Attackers may use the primary domain name to carry out sub-domain cracking on the target. |
T1046 Web Service Discovery | N/A | The attacker determines which ports and services are open for the target domain name. |
T1106 Native Interface | N/A | An attacker may use the Clickhouse API to interact with a database. |
T1567 is leaked through web service | N/A | An attacker may exploit the Clickhouse API to steal data. |
Reference link: https://www.wiz.io/blog/wiz-research-uncovers-exposed-deepseek-database-leak
Event 2: Attackers illegally uploaded DeepSeek malicious dependency packages to trigger supply chain attacks, resulting in the leakage of a large number of user credentials.
Event time: February 3, 2025
Leakage scale: Some user credentials and environment variables
Event Review: On January 19, 2025, malicious user bvk uploaded two malicious python packages (deepseek and deepseekai) to PyPI for the first time.
Figure 3. Malware Package Catalog Uploaded in PyPI
On the same day, the threat intelligence team of Positive Technologies Expert Security Center (PT ESC) captured abnormal activities through the system.
On the same day, the technical team of Positive Technologies Expert Security Center (PT ESC) analyzed and verified it and notified PyPI administrator.
On the same day, PyPI administrator deleted the above two malicious python packets and notified PT ESC. According to statistics, after the malicious python package was uploaded, as many as 17 countries around the world downloaded more than 200 times of malware through various download channels. At present, the two malware packages have been isolated.
Figure 4. Isolated Software Packages
Event Analysis: According to the feedback from the PT ESC technical team, software packages uploaded by malicious users mainly involve information collection and environment variable stealing functions. The stolen data includes database credentials, APIs, S3 object storage access credentials, etc. A malicious payload is executed when a user runs Deepseek or Deepseekai from the command line.
Figure 5. Register the Deepseekai console command in the package with the same name
Figure 6. Payload in Malware Package
From the above payload, it is not difficult to see that the attacker used PipeDream as a stolen data command control server. Looking back on the whole incident, we believe that the event causes can be summarized in the following aspects:
- Dependency Obfuscation Attack: Exploiting the Priority Difference between Enterprise Private Package and Public Warehouse Package with the Same Name
- Software package naming counterfeiting strategy: imitating the brand name of Deepseek, a well-known AI company
- PyPI registration mechanism vulnerability: The developer identity and package name validity are not effectively verified.
Insufficient security awareness of developers: Installed malicious packages with similar names by mistake.
VERIZON Event Classification: Social Engineering
MITRE ATT&CK technology used:
Technology | Sub-Technology | Utilization mode |
T1593 | .003 | Search the publicly available python dependency repository and find PyPI. |
T1195 | .002 | Use malware to wrap it as a Python dependency and upload it to the PyPI repository, causing supply chain attacks. |
T1059 | .006 | The attacker implants malicious code in the malicious package. After execution, the user will disclose information such as environment variables and credentials, and leak sensitive data through PipeDream built by the attacker. |
T1041 | N/A | Disclose sensitive information of user environment variables and credentials through C2 channel. |
Reference link: https://global.ptsecurity.com/analytics/pt-esc-threat-intelligence/malicious-packages-deepseek-and-deepseekai-published-in-python-package-index
Event 3: A large number of user credentials were stolen, and the target of LLM hijacking attacks was switched to DeepSeek
Time of event: February 7, 2025
Leakage scale: About 2 billion model tokens have been illegally used
Event Review:
In May 2024, the Sysdig threat research team discovered a new type of network attack against LLMs-LLM jacking, also known as LLM hijacking attack.
In September 2024, the Sysdig Threat Study Team stated that the frequency and prevalence of LLM hijacking attacks are increasing. DeepSeek is also increasingly being targeted.
On December 26, 2024, DeepSeek released the advanced model DeepSeek-V3. A few days later, the Sysdig threat research team discovered that DeepSeek-V3 had been implemented in the OpenAI reverse proxy (ORP) project hosted on Hugging Face.
On January 20, 2025, DeepSeek released an inference model called DeepSeek-R1. The next day, an ORP project supporting DeepSeek-R1 has emerged, multiple ORPs have been populated with the DeepSeek API keys, and attackers have begun to exploit them.
In the research work of Sysdig’s threat research team, it is found that the total number of large model tokens illegally used by ORP has exceeded 2 billion.
Event Analysis:
An LLM hijacking attack is a hijacking attack launched by an attacker using stolen cloud credentials against a cloud-hosted LLM service. The attacker uses the OAI reverse proxy and stolen cloud credentials to sell access to cloud-hosted LLM services to which the victim is subscribed. This attack could result in a significant cloud service cost to the victim.
OAI reverse proxy refers to the reverse proxy of LLM services, which can help attackers centrally manage access to multiple LLM accounts without exposing the underlying credentials and pools. With the OAI reverse proxy, attackers can run high-cost LLM (such as DeepSeek) without paying for it. Attackers can access these LLMs through reverse proxies, actually performing tasks and consuming computing resources, thereby bypassing legitimate service charges. The reverse proxy mechanism acts as an intermediary, redirecting requests and hiding the identity of attackers, allowing them to misuse cloud computing resources without being detected.
Figure 7 Event Attack Path
OAI reverse proxy is a necessary condition for realizing LLM hijacking attack, and the key to realize LLM hijacking attack is how to steal the credentials and keys of all kinds of LLM services purchased by normal users. Attackers often steal credentials through traditional Web service vulnerabilities and configuration errors (such as the CVE-2021-3129 vulnerability of Laravel framework). Once these credentials are obtained, attackers can access LLM services in the cloud environment such as Amazon Bedrock, Google Cloud Vertex AI and others.
Figure 8. Laravel Vulnerability Exploitation Process
Research by the Sysdig Threat Research Team has shown that attackers can soar their victims’ consumption costs to tens of thousands of dollars in just a few hours, and even up to $100,000 per day in some cases. This kind of attack is not only to obtain data, but also to gain economic benefits by selling access rights.
VERIZON Event Classification: Basic Web Application Attacks
MITRE ATT&CK technology used:
Technology | Sub-Technology | Utilization mode |
T1593 Search Open Websites/Domains | 002 Search Engine | Attackers useed the OSINT method to collect exposed service information in the Internet. |
T1133 external remote service | N/A | The attacker identified a vulnerability in the exposed service. |
T1586 Leakage Account | .003 Cloud Account | Attackers exploited vulnerabilities to steal LLM services or cloud service credentials. |
T1588 Acquisition Capability | .002 Tool | The attacker deployed an open-source OAI reverse proxy tool. |
T1090 Proxy | 002 External agency | Attackers used OAI reverse proxy software to centrally manage the access of multiple LLM accounts. |
T1496 Resource Hijacking | N/A | An attacker exploited an access LLM injection attack to hijack LLM resources. |
Reference link: https://sysdig.com/blog/llmjacking-targets-deepseek/
Event 4: Data leakage of OmniGPT: 30000+ user data were publicly sold on the darknet
Event time: February 12, 2025
Leakage scale: Personal information of more than 30000 users, including emails, phone numbers, API keys, encryption keys, vouchers and billing information.
Event Review:
On February 12, 2025, SyntheticEmotions posted an article on BreachForums as shown in Figure 9, where the attacker claimed to have stolen sensitive data from the OmniGPT platform and sold it. The leaked data included emails, phone numbers, API keys, encryption keys, credentials, billing information from more than 30,000 users on the OmniGPT platform and all of their conversations with chatbots (over 34 million lines), according to the article. In addition, links to files uploaded to the platform have been compromised, some of which also contain sensitive information such as vouchers and billing data.
Figure 9. OmniGPT data sold on BreachForums
Event Analysis:
Although the specific attack path of this leak was not explicitly disclosed, judging from the type and scope of leaked data, the attacker may have obtained access to the background database through SQL injection, API abuse or social engineering attacks. Of course, the OmniGPT platform may also have improper configurations or vulnerabilities that could allow an attacker to bypass authentication controls and directly access a database containing user information.
In addition, the “Messages.txt” file involved in the secondary leak contains API keys, database credentials and payment card information, which may be used to further invade other systems or tamper with data. Some documents uploaded by platform users contain sensitive information such as enterprise secrets and project data. If these documents are maliciously used, more business operations may also be affected. The leak is a wake-up call for the industry to place greater emphasis on data security and privacy protection. Users should be particularly vigilant when utilizing AI and big data platforms. It is imperative to establish strict data usage policies and to implement measures such as encryption, minimization, and anonymization for sensitive data. Otherwise, in the event of a similar data breach, it may lead to multiple legal, reputational and economic losses for enterprises.
VERIZON Event Classification: Miscellaneous Errors
MITRE ATT&CK technology used:
Technology | Sub-Technology | Utilization mode |
T1071 Application Layer Protocol | .001 HTTP/S | Attackers may access the leaked user information and sensitive data by interacting with stolen data using OmniGPT’s Web interface. |
T1071 Application Layer Protocol | .002 API Interface | Leaked API keys and database credentials allow attackers to further access the system through the platform’s API interface and perform unauthorized operations. |
T1071 Application Layer Protocol | .002 Service Execution | Attackers may abuse system services or daemon to execute commands or programs. |
Automatic leak from T1020 | .003 file transfer | Leaked file links and sensitive files uploaded by users may become the targets for attackers to access and download, so as to obtain more sensitive data and use these data for subsequent attacks. |
T1083 File and Directory Access | 001 Access to sensitive files | The leaked files contain sensitive data (such as credentials and database information). Attackers may access these files to further obtain key business information and user data. |
Reference link:
https://cyble.com/blog/omnigpt-leak-risk-ai-data/
Event time: February 28, 2025
Disclosure size: Approximately 11,908 valid DeepSeek API keys, credentials, and authentication tokens
Event Review:
The Truffle security team used the open-source tool TruffleHog to scan 400 TB of December 2024 data (covering 2.67 billion web pages from 47.5 million hosts) in Common Crawl, a crawler database. The scan results showed that approximately 11,908 valid DeepSeek API keys, credentials, and authentication tokens were hard-coded into a large number of Web pages.

Figure 10. Suspected AWS Key hardcoded in html
In this study, the leak of Mailchimp API keys cannot be ignored. In the dataset studied, about 1,500 Mailchimp API keys were directly hard-coded in JavaScript code. Mailchimp API keys are usually exploited to carry out network attacks such as phishing and data theft.
Event Analysis:
Common Crawl is a well-known non-profit web crawler database that regularly captures and publishes the data of Internet page. Common Crawl is stored in 90,000 WARC (Web ARChive) files and completely saves the original HTML, JavaScript code and server response content of the crawled website. The datasets in Common Crawl are often used to train AI models. Therefore, the research conducted by Truffle’s security team reveals a growing problem: training models with bug-filled corpora can lead to model inheritance of security vulnerabilities. Even if other LLMs such as DeepSeek utilize additional security measures during training and application, the pervasive hard-coded vulnerabilities in the training corpus can lead large models to perceive such “unsafe” practices as normal.
In addition, hard-coding is an insecure coding style, which is a common problem. The causes of such vulnerabilities are simple and pervasive, but the risks they pose are serious. Hard-coded vulnerabilities can cause data breaches, service disruptions, supply chain attacks, and so on. With the rapid development of large model technology, leaked credentials will bring a new type of attack: LLM hijacking. An LLM hijacking attack is an attack launched by an attacker using stolen cloud credentials against a cloud-hosted LLM service. The attacker uses the OAI reverse proxy and stolen cloud credentials to sell access to the cloud-hosted LLM services subscribed to by the victims. This attack could result in a significant cloud service cost to the victim.
VERIZON Event Classification: Miscellaneous Errors
MITRE ATT&CK technology used:
Technology | Sub-Technology | Utilization mode |
T1596 Search Open Technical Database | .005 scan database | The attacker collected information in the public crawler database. |
T1588 Acquisition Capability | .002 Tool | The attacker deployed a sensitive information discovery tool. |
T1586 Leakage Account | .003 Cloud Account | Attackers used sensitive information discovery tools to find sensitive credentials in public databases. |
T1090 Proxy | .002 External Agents | Attackers used OAI reverse proxy software to centrally manage the access of multiple LLM accounts. |
T1496 Resource Hijacking | N/A | An attacker exploited an access LLM injection attack to hijack LLM resources. |
Reference link: https://cybersecuritynews.com/deepseek-data-leak-api-keys-and-passwords/
Suggestions for Prevention of LLM Data Leakage
Security reinforcement of supply chain
For Event II (malicious dependency package attack) and Event V (public data breach), the following measures can be taken:
1. Trusted verification of dependency packages:
- Use tools such as PyPI/Sonatype Nexus Firewall to intercept unsigned or suspiciously sourced dependent packages.
- Prohibit direct fetching of dependencies from public repositories in the development environment and mandate the use of corporate private repository proxies (such as Artifactory).
2. Threat monitoring of Supply chain:
- Integrate Dependabot/Snyk to automatically scan dependent vulnerabilities and block the introduction of high-risk components.
- Verify the code signature of the open source package to ensure the hash value is consistent with the official one.
3. Data source cleaning:
When training data collection, filter the sensitive information in public datasets (such as Common Crawl), and use regular expressions + AI desensitization tools for double verification.
Least privilege and access control
For Event I (database configuration error) and Event IV (disclosure of third-party tools), the following measures can be taken:
- The database (such as Clickhouse) enables bidirectional TLS authentication by default, and forbids the exposure of the management port on public networks.
- Use Vault/Boundary to dynamically distribute temporary credentials to avoid long-term static key retention.
- Follow the principle of least privilege, and restrict users to access only necessary resources through RBAC (role-based access control).
- Implement IP whitelist and rate limit for API calls to third-party tools (such as OmniGPT).
Full-lifecycle protection of sensitive data
For Event III (LLM hijacking), the following measures can be taken:
- Data desensitization and encryption: User input and output data are forced to be encrypted at the field level (such as AES-GCM), and sensitive fields are shielded in logs.
- Enable real-time desensitization for the interactive contents of LLMs (for example, replace credit card number and mobile phone number as placeholders).
Summary
This report analyzes the events related to large model data breaches from January to February 2025, and systematically discusses the causes of the events, including human factors such as mainstream cloud attack methods and configuration errors. To more clearly describe the attack path of data leakage on the cloud, we quote and explain the attack methods in the MITRE ATT&CK model to help readers better understand these attack mechanisms.
The NSFOCUS Science and Technology Innovation Institute has been conducting research in the area of cloud-based risk discovery and data breaches for many years. With the help of Fusion Data Leakage Detection Platform, we have detected unauthorized access to millions of cloud-exposed assets, including but not limited to DevSecOps components, self-built repositories, public cloud object storage, cloud disks, OLAP/OLTP databases and various storage middleware.
Fusion is an innovative product developed by the NSFOCUS Science and Technology Innovation Institute for data breach mapping. It integrates detection, identification, and data breach reconnaissance. It maps cloud-based components exposed on the Internet, identifies the organizations associated with these components and the impact surface of component risks, and realizes automated asset detection, risk discovery, data breach analysis, responsible subject identification, and full lifecycle process of data breach reconnaissance.
Reference
https://www.gartner.com/cn/newsroom/press-releases/2025-china-ai-predicts