With the wide application of encryption technology and the continuous development of new network technology, the network structure becomes more and more complex and the encrypted traffic explodes. Furthermore, as the evolution and promotion of encryption protocols such as TLS1.3, the era of full encryption is silently coming. When protecting users’ privacy, encryption technology has also profoundly changed the threat landscape of network security, allowing malicious services to take advantage of it. However, traditional detection technology is often powerless when facing malicious encrypted traffic. Against this background, the detection and defense technology based on encrypted traffic is imperative.
In this post, we will start with benefits and risks of traffic encryption, go further into the difficulties in detecting encrypted traffic, and then share some thoughts on detection methods that we think work.
Status Quo of Encrypted Traffic Detection
1) The Trend of Fully Encrypted Traffic
People’s security awareness continues to improve and encryption technology has been widely used. Fueled by demands for security and privacy protection, the encrypted traffic has seen explosive growth in the network.
Especially with the continuous evolution of encryption protocols such as TLS, the encryption of DNS, and the promotion of QUIC protocol, the overall popularity of encryption applications and the encryption of network communication traffic have become an unstoppable trend. We are entering the ear of full encryption.
2) Abuse of Traffic Encryption
The use of encrypted traffic is becoming more and more common; however, encryption brings potential risks to network security while protecting users’ privacy because attackers can hide their behaviors through encryption.
On the one hand, abuse of cross-network traffic encryption may pose a danger to the property of individuals and organizations. Threats such as ransomware, phishing attacks, and data leaks can use encryption to evade detection. The suppliers of Tesla, Boeing, and SpaceX suffered from confidential leaks after refusing to pay ransom. Foxconn were asked for huge ransoms for the same reason. and so was Microsoft, whose 250 million records were leaked.
On the other hand, abuse of full-scale encryption on the Internet poses a threat to cyber security and affects social stability. For example, there may be illegal trading of citizen identification information and digital currencies in the dark web, the spread of online rumors on various social media platforms, and the loss of property to the people from online frauds.
3) Current State of Research
The research of encrypted traffic detection has been a focus of both academia and industry. The object of analysis, i.e., the input form of the identification process, includes packet-level, stream-level, host-level, community-level, etc. The objectives of traffic identification are also diverse, including encrypted and non-encrypted traffic identification, encrypted protocol identification (e.g., SSL, SSH, IPSec, etc.), service identification (e.g., web browsing, streaming media, instant messaging, network storage, etc.), encrypted application software identification, more granular identification of encrypted traffic parameters, website fingerprint identification, abnormal encrypted traffic identification (such as malware communication traffic, traffic generated by hacking tools, etc.), just name a few.
The idea of detecting encrypted traffic by deciphering is not working, and the current research is based on non-deciphering means to achieve the detection and identification of encrypted traffic, and some methods still use the remaining plaintext part of the encrypted traffic, such as the plaintext information transmitted during handshakes.
Difficulties of Encrypted Traffic Detection
1) Insufficient Signatures
The advent of full encryption era leads to insufficient plaintext information. The payload cannot be used as signature to identify encrypted traffic, and other ones such as packet length sequence and packet arrival time are not sufficient to distinguish different encrypted traffic. In one word, the dimensions of available signatures are significantly reduced. And highly discriminative signatures are even rarer. Therefore, the bottleneck to maintain and improve the performance to identify encrypted traffic is the insufficient information of categorized signatures, instead of identification algorithms. We need to dig out the hidden attributes, add classification features, and thus bring incremental information to the identification task.
2) Concept Drift Problem
As the cyber offensive and defensive confrontation becomes more and more intense, the identification targets will continue to iterate, optimize, upgrade, and even change, and accordingly the characteristics of the encrypted traffic will also change. These concept drift problems make the performance of previously trained models gradually degrade in accuracy. The possible solution is to adjust the structure of the model to accommodate the concept drift, such as deepening the layers, widening the layers, and compounding the old and new models according to the change of data distribution.
3) Lack of Labeling Samples
Traditional machine learning methods rely on a large number of well-labeled samples, which not only require a large amount of extremely money-costing manpower, but also may bring risks of violating user privacy. On the other hand, newly recognized targets are small or zero samples in the early stage, which no longer adapt to the machine learning requirements in this new scenario. We need to investigate how to reduce the need for labeled data, and shall consider methods such as small sample learning, active learning, semi-supervised learning, and unsupervised learning.
4) Open Set Identification Problem
There are various algorithms applied to the identification of encrypted traffic, including supervised machine learning, unsupervised machine learning, semi-supervised machine learning, reinforcement learning, self-learning, etc. Among these, the most important research and applications still center on supervised machine learning. Taking application identification as an example. The number of applications in reality is in the millions or more. The theoretical basis of most AI at present is to input all application data to the model for training in order to obtain a useful identification model. However, this is unrealistic. Therefore, for the identification of unknown samples in open environments, it is essential to study how to reduce the dependence on prior knowledge and how to improve the robustness and generalization of identification models.
5) Inference Performance to be Improved
From the business perspective, the reasoning models of AI consumes a lot of computing resources. For now, there is still an order of magnitude difference in AI reasoning performance compared with traditional rule matching and other techniques in spite of many optimization and acceleration methods. Therefore, in project implementation, it is necessary to improve inference performance to ensure the availability of the model, obtain stable and timely calculation results, and meet the challenge of real-time identification of encrypted traffic in high-speed network environment.
The Way out of Encrypted Traffic Detection
1) Building a Dataset in a Real Network Environment
Due to the incredibly complex network environment, models trained under closed data sets do not perform well after going online for many times, and it’s time for us to consider the impact factors brought by the environment. The packet length and load length sequence of the same category of encrypted traffic in different network environments (Wi-Fi, 4G/5G mobile communication, IoT, industrial control network, or blockchain network) have certain differences, and the inconsistency between the training environment and the actual environment may lead to the difference between the training data and the actual data distribution, so it is often difficult to match the theory with the actual network environment.
How can we guarantee the performance of the models after they are in the existing network environment? On the one hand, it’s especially important for the identification scheme itself to adapt to different network environments; on the other hand, we can simulate the real network environment, build attack and defense confrontation scenarios close to the real network environment, and construct data sets in the real environment as much as possible to minimize the impact on the model performance when it goes alive.
2) Finding out Big Data Statistical Patterns Based on AI
The overall encryption of traffic leads to the failure of traditional features with clear meaning. From the view point of technology trend, the next generation technology is underpinned by statistical signatures of big data, that is, doing statistics on a large amount of data, and extracting signatures that can describe the essential reason of sample distribution based on the conclusion of statistics. This process is realized through AI, because the strength of AI is to analyze a large amount of data to find out statistical patterns.
Of course, AI is widely used in image and voice identification and many other fields. But in the field of traffic identification, AI is still in its infancy. And there are still great challenges and technical issues to be broken through. So, AI may not be suitable for the final step for the time being, but helpful for data processing and decision-making. AI-based encrypted traffic detection will also be a long-term research topic.
3) Not the Time to Give Up Traditional Techniques and Existing Capabilities
Encryption is not a new thing. Due to the limitation of cryptography, the development of encryption protocols is also gradual. In addition to some new functions or signatures, there are also some commonalities. First of all, the handshake process of encryption protocols is essential and more or less plaintext is transmitted. Although QUIC protocol adds the first packet obfuscation mechanism, it is not encryption in the strict sense. In spite of this, TLS and QUIC protocols are reducing plaintext as much as possible. For example, TLS1.3 encrypts more handshake process compared with TLS1.2, and QUIC protocol is almost fully encrypted except the first 8 bytes. Second, stream encryption does not change the packet length.
Despite the evolution of encryption protocols, some techniques featured by statistically significant string and packet length have been effective, and we cannot abandon the traditional techniques and existing capabilities. Generally speaking, the evolution of encryption of traffic and the development of identification techniques is a spiral process. In the process, we have to rely on AI and security experts to have deep understandings of traffic data and adopt different detection and identification schemes for traffic that is encrypted by different methods.
4) Building a Layered Detection System
The composition of traffic in the network is very complicated. And the detection and identification of encrypted traffic is never a single problem that can be solved by making a model on one data set. To achieve the goal of identifying different kinds of encrypted traffic, a layered detection system shall be built to solve different problems at different layers.
For different types of data, accurate identification can be achieved in the process of data collection, data processing, signature extraction, signature selection, and fingerprint construction. Taking multi-dimensional signature extraction as an example, metadata signatures like packet load, session signatures like packet length sequence, time signatures like packet response time, and host signatures like historical access behavior can be extracted. Taking fingerprint construction as another example, the directly visible and comprehensible stateful fingerprints can be extracted in the traffic, or conceptualized fingerprints generated by the traffic can be extracted in an indirect way, and even the behavioral fingerprints used to characterize the behavioral information of the object can be extracted by statistics, transformation, mapping, etc.
5) Collaboration to Jointly Maintain Cybersecurity
Synergy with users: with the overall confidentiality of traffic, the commonly used supervised model of labeling is very costly, but not all user behavior is very private. If there are some users willing to contribute their own data or assist in labeling data in specific scenarios, the relevant beneficiaries then pay the user a certain fee, which can greatly reduce the difficulty of labeling.
Synergy with operators: at the large network level or even cyber confrontation between countries, mobilizing telecom operators’ traffic scheduling capacity and labeling capacity will maximize the coverage and accuracy of event observation from God’s perspective, facilitate continuous observation, accumulation and analysis of specific target traffic data across the network, and eventually explore methods of encrypted traffic detection by using big data network behavior analysis.
Synergy of Data: on the one hand, data barrier exists between data holders and data holders, data holders and modeling builders. When facing both regulatory compliance and pursuing corporate profits, they are not open with others their data, except some open-sourced data or shared data between partners; on the other hand, taking network public hazards as an example, an improved ecosystem requires source tracking over the entire chain, rather from single-point investigation with limit information. Therefore, under the premise of privacy protection, we should use privacy-protected computing, multi-party secure computing, federal learning and other technologies to alleviate the problem of data sharing to a certain extent, and realize multi-point collaborative analysis and research.
Synergy with Universities: academia is often doing cutting-edge research, and we do see a lot of excellent research results. But there seems to be no relatively objective judgment standard for its value, while the focus of industry is not exactly the same – whether it can be well implemented in real world is crucial. Therefore, we advocate to strengthen cooperation between business organizations and universities, to integrate the research advantages of universities and the product values of business organizations for better transformation of research results.
With the development of global network security technology, traffic encryption has increased the difficulty of malicious traffic detection and detection while protecting privacy, and the technology implementation also faces many challenges. In this post, we discussed the current situation, difficulties and gave some ideas of detecting encrypted traffic, hoping this will be helpful to maintain network security in the encryption era.