Abstract: Immense growth of network usage and the associated proliferation of network, traffic, traffic classes, and diverse QoS requirements pose numerous challenges for network operators. Though data-driven approaches can provide better solutions for these challenges, limited data has been a barrier to developing those methods with high resiliency. In this work, we propose SyNIG (Synthetic Network Traffic Generation through Time Series Imaging) , which utilizes Generative Adversarial Networks (GANs) for network traffic synthesis by converting time series data to a specific image format called GASF (Gramian Angular Summation Field). With GASF images we encode correlation between samples in 1D signals on a single 2D pixel map. Taking three types of network traffic; video streaming, accessing websites
and IoT, we synthesize over 200,000 traces using over 40,000 original traces generalizing our method for different network traffic. We validate our method by demonstrating the fidelity of the synthetic data and applying them to several network related use cases showing improved performance.
Abstract: Encrypted network traffic has been known to leak information about their underlying content through side-channel information leaks. Traffic fingerprinting attacks exploit this by using machine learning techniques to threaten user privacy by identifying user activities such as website visits, videos streamed, and messenger app activities. Although state-of-the-art traffic fingerprinting attacks have high performances, even undermining the latest defenses, most of them are developed under the closed-set assumption. To deploy them in practical situations, it is important to adapt them to the open-set scenario, which allows the attacker to identify its target content while rejecting other background traffic. At the same time, in practice, these models need to be deployed on in-networking devices such as programmable switches, which have limited memory and computation power. Model weight quantization can reduce the memory footprint of deep learning models while at the same time, allowing inference to be done as integer operations as opposed to floating point operations. Open-set classification in the domain of traffic fingerprinting has not been explored well in prior work and none of them explored the effect of quantization on the open-set performance of such models. In this work, we propose a framework for robust open-set classification of encrypted traffic based on three key ideas. First, we show that a well-regularized deep learning model improves the open-set classification and then we propose a novel open-set classification method with three variants that perform consistently over multiple datasets. Next, we show that traffic fingerprinting models can be quantized without a significant drop in both closed-set and open-set accuracy and therefore, they can be readily deployed on in-network computing devices. Finally, we show that when the above three components are combined, the resulting open-set classifier outperforms all other open-set classification methods evaluated across five datasets with a minimum and maximum increase in F1_Score of 8.9% and 77.3% respectively.
Traffic fingerprinting allows making inferences about encrypted traffic flows through passive observation. They have been used for tasks such as network performance management and analytics and in attacker settings such as censorship and surveillance. A key challenge when implementing traffic fingerprinting in real- time settings is how the state-of-the-art traffic fingerprint models can be ported into programmable in-network computing devices with limited computing resources. Towards this, in this work, we characterize the performance of binarized traffic fingerprinting neural networks that are efficient and well-suited for in-network computing devices and propose a new data encoding method that is better suited for network traffic. Overall, we show that the proposed binary neural network with first-layer binarization and last-layer quantization reduces the performance requirement of hardware equipment while retaining the accuracies of those models of binary datasets over 70%. Furthermore, when combined with our proposed encoding algorithm, accuracies of binarized models of numeric datasets show further improvements to achieve over 65% accuracy.
Abstract: Novelty detection detects outliers located at any location, such as abnormalities (i.e., far distance outliers) and novel/unobserved patterns (i.e., close distance outliers). While many novelty detection approaches have been proposed in the literature, they generally focus on detecting one specific type of outlier, e.g., Multi-Class Open Set Recognition (MCOSR) and One-Class Novelty Detection (OCND) approaches are applied for far and close distance outlier detection, respectively. However, in practice, it is difficult to measure in advance whether the distance between outliers and inliers is far or close. Recent work on outlier detection at any location with a unified model has yielded mixed performance. In this paper, we propose a new unified model, named Calibrated Reconstruction Based Adversarial AutoEncoder (CRAAE), for location agnostic outlier detection. The key idea is to integrate implicit and explicit confidence calibration strategies into a reconstruction based model for building a more accurate decision boundary. We leverage the category information disentangled from feature space to calibrate the decision metric (i.e., reconstruction error) constructed in the original data space. CRAAE also adds Uniform or Dirichlet noise into the artificial outlier generation process to represent various outliers. Experimental results show that CRAAE can outperform state-of-the-art unified models (e.g., GPND) and achieve similar performance with OCND and MCOSR methods in close and far distance outlier detection, respectively.
Existing deep learning approaches have achieved high performance in encrypted network traffic analysis tasks. However, some realistic scenarios, such as open-set recognition on dynamically changing tasks, challenge previous methods. Classic few-shot learning methods are used widely for these tasks in certain domains, such as computer vision and natural language processing. Nonetheless, few-shot open-set recognition for encrypted network traffic is still an unexplored area. This paper proposes a probability based task adaptive Siamese open-set recognition model for encrypted network traffic classification. Our contributions are threefold: First, we introduce generated positive and negative pairs into the Siamese Neural Network training process to shape a more precise similarity boundary through bidirectional dropout data augmentation. Second, we utilize Dirichlet Process Gaussian Mixture Model (DPGMM) distribution to fit the similarity scores of the negative pairs constructed by the support set of each query task, and create a new open- set recognition metric. Third, by leveraging the extracted features from coarse and fine-granular levels, we construct a hierarchical cross entropy loss to improve the confidence of the similarity score. Extensive experiments on a network traffic dataset and the Omniglot dataset demonstrate the superiority of our proposed approaches, which can respectively obtain up to 4.5% and 1.2% performance gain in terms of accuracy as well as 4.0% and 1.8% in terms of area under the receiver operating characteristic (AUROC).
Abstract: The vast majority of Internet traffic is now end-to-end encrypted, and while encryption provides user privacy and security, it has made network surveillance an impossible task. Various parties are using this limitation to distribute problematic content such as fake news, copy-righted material, and propaganda videos. Recent advances in machine learning techniques have shown great promise in extracting content fingerprints from encrypted traffic captured at the various points in IP core networks. Nonetheless, content fingerprinting from listening to encrypted wireless traffic remains a challenging task due to the difficulty in distinguishing re-transmissions and multiple flows on the same link. In this paper, we show the potential of fingerprinting internet traffic by passively sniffing WiFi frames in air, without connecting to the WiFi network by leveraging deep learning methods. First, we show the possibility of building a generic traffic classifier using a hierarchical approach that is able to identity most common traffic types in the Internet and reveal fine-granular details such as identifying the exact content of the traffic. Second, we demonstrate the possibility of using Multi-Layer Perceptron (MLP) and Recurrent Neural Networks (RNNs) to identify streaming traffic, such as video and music, from a closed set, by sniffing WiFi traffic that is encrypted at both Media Access Control (MAC) and Transport layers. Overall, our results demonstrate that we can achieve over 95% accuracy in identifying traffic types such as web, video streaming, and audio streaming as well as identifying the exact content consumed by the user.
Abstract: Video streaming traffic has been dominating the global network and the challenges have exacerbated with the gaining popularity of interactive videos, a.k.a.360 videos, as they require more network resources. However, effective provision of network resources for video streaming traffic is problematic due to the inability to identify video traffic flows through the network because of end-to-end encryption. Despite the promise given for network security and privacy, end-to-end encryption also provides a shield for adversaries. To this end, encrypted traffic classification and content fingerprinting with advanced Machine Learning (ML) methods have been proposed. Nevertheless, achieving high performance requires a significant amount of training data, which is a challenging task in operational networks due to the sheer volume of traffic and privacy concerns. As a solution, in this paper, we propose a novel Generative Adversarial Network (GAN) based data generation solution to synthesize video streaming data for two different tasks, 360/normal video classification and video fingerprinting. The solution consists of a percentile-based data mapping mechanism to enhance the data generation process, which is further supported by novel algorithms for data pre-processing and GAN model training. Taking over 6600 actual video traces and generating over 150,000 new traces, our ML-based traffic classification results show a 5–16% of accuracy improvement in both tasks.
Abstract: The Square Kilometre Array (SKA) Low is a next generation radio telescope, consisting of 512 antenna stations spread over 65 km, to be built in Western Australia. The Correlator and BeamFormer (CBF) design is central to the telescope signal processing. CBF receives 6 Tera-bits-per-second (Tbps) of station data continuously and processes it in real time with a compute load of 2 peta-operations-per-second (Pops). The correlator calculates up to 22 million cross products between all pairs of stations, while the beamformers coherently sum station data to form more than 500 beams. The output of the correlator is up to 7 Tbps, and the beamformer 2 Tbps. The design philosophy, called “Atomic COTS”, is based on commercial-off-the-shelf (COTS) hardware. Data routing is implemented in network switches programmed using the P4 language and the signal processing occurs in COTS FPGA cards. The P4 language allows routing to be determined from the metadata in the Ethernet packets from the stations. That is, metadata describing the contents of the packet determines the routing. Each FPGA card inputs a fraction of the overall bandwidth for all stations and then implements the processing needed to generate complete science data products. Generation of complete science products in a single FPGA is named here as Atomic processing. A Tango distributed control system configures the multitude of processing modes as well as maintaining the overall health of the CBF system hardware. The resulting 6 Tbps in and 9 Tbps out, 2 Pops Atomic COTS network attached accelerator occupies five racks and consumes 60 kW.
HTTPS encrypted traffic flows leak information on underlying contents through various statistical properties such as packet lengths and timing, enabling traffic fingerprinting attacks. Recent traffic fingerprinting attacks leveraged Convolutional Neural Networks (CNNs) to record very high accuracies undermining state-of-the-art defenses. In this paper, we analyze such CNNs to understand their inner workings which helps in building efficient traffic classifiers and effective defenses. First, we experiment on three datasets and show that website fingerprinting CNNs focus majorly on transitions between uploads and downloads in trace fronts while video fingerprinting CNNs focus more on finer shapes of periodic bursts. Next, we show that traffic fingerprinting CNNs exhibit transfer learning capabilities allowing identification of new websites with fewer data. We also demonstrate how traffic fingerprinting CNNs outperform Recurrent Neural Networks (RNNs) due to their resilience to random shifts in data, which is common in network traces. We further generalize these observations on other publicly available network traffic datasets. Leveraging our observations, we propose two new defenses against traffic fingerprinting. Our first defense FRONT-U, defends website visits by obfuscating transitions between uploads and downloads in trace fronts and provides similar privacy as the state-of-the-art defense FRONT, with half the data overhead. Our second defense STOMA, defends streaming traffic by obfuscating the finer sub-bursts within major bursts of a trace using only the nextfew seconds as opposed to using the entire trace as in the state-of-the-art.
In the last few years, Input/Output (I/O) bandwidth limitation of legacy computer architectures forced us to reconsider where and how to store and compute data across a large range of applications. This shift has been made possible with the concurrent development of both smart NICs and programmable switches with a common programming language(P4), and the advent of attached High Bandwidth Memory within smartNICs/FPGAs. Recently, proposals to use this kind of technology have emerged to tackle computer science related issues such as fast consensus algorithm in the net-work, network accelerated key-value stores, machine learn-ing, or data-center data aggregation. In this paper, we intro-duce a novel architecture that leverages these advancements to potentially accelerate and improve the processing of radio-astronomy Digital Signal Processing (DSP), such as correlators or beamformers, at unprecedented continuous rates inwhat we have called the “Atomic COTS” design. We givean overview of this new type of architecture to accelerate digital signal processing, leveraging programmable switches and HBM capable FPGAs. We also discuss how to handle radio astronomy data streams to pre-process this stream ofdata for astronomy science products such as pulsar timingand search. Finally, we illustrate, using a proof of concept,how we can process emulated data from the Square Kilometer Array(SKA) project to time pulsars.
Traffic fingerprinting and developing defenses against them has always been an arms race between the attackers and the defenders. The rapid evolution of deep learning methods makes developing stronger traffic fingerprinting models much easier, while overhead, latency, and deployment constraints restrict the abilities of the defenses. As such, there is always the need of coming up with novel defenses against traffic fingerprinting. In this paper, we propose SMAUG, a novel CGAN-based (Conditional Generative Adversarial Network) defense to protect video streaming traffic against fingerprinting. We first assess the performance of various GANs in video streaming traffic synthesis using multiple GAN quality metrics and show that CGAN outperforms other types of GANs such as basic GANs and WGANs (Wasserstein GAN). Our proposed defense, SMAUG, uses CGANs to synthesize video traffic flows and use those synthesized flows to camouflage the original traffic that needs protection. We compare SMAUG with other state-of-the-art defenses – FPA and d*-private methods, as well as a kernel density estimation-based baseline and show that SMAUG provides better privacy with lower overhead and delay.