Classification of Peer-to-Peer traffic by exploring the heterogeneity of traffic features through entropy

Gomes, João Vasco Paulo

http://hdl.handle.net/10400.6/1935

Use this identifier to reference this record.

Name:	Description:	Size:	Format:
Thesis_J_Gomes_UBI_2012_Final.pdf		8.95 MB	Adobe PDF	Download

Send Feedback

Authors

Gomes, João Vasco Paulo

Advisor(s)

Freire, Mário Marques

Monteiro, Paulo Miguel Nepomuceno Pereira

Abstract(s)

The ability to classify the traf c, based on the application or protocol that generated it, is essential for the effective management of computer networks. Although Internet applications were generally based on the client-server paradigm, generating traf c whose properties were well de ned and easily predictable, the advent of peer-to-peer (P2P) computing brought the power to the edges, facilitating the direct exchange of contents between hosts and modifying the behavior of the traf c load in the networks of Internet Service Providers (ISPs) and organizations. In that context, the ability to identify the nature of the traf c became increasingly important. Nonetheless, the early traf c classi cation methods, based on the association of port numbers of transport-level protocols to applications or protocols, became ineffective when many Internet applications started to use random port numbers or ports normally used by other applications. The natural alternative was to look deep into the contents of the packets to search for data strings that could be used as a signature of the traf c of a target application. However, this approach, usually called Deep Packet Inspection (DPI), requires more computational resources which may make it dif cult to be used for real-time monitoring in high-speed networks. Moreover, several applications have started to encrypt their traf c preventing the use of DPI. In order to overcome these limitations, researchers are proposing new classi cation approaches, sometimes called classi cation in the dark, which are based on the traf c behavior and do not rely on the payload data. Although their accuracy is generally lower, in most cases, they offer a good compromise between effectiveness and computational cost and are not affected by encryption techniques. Nevertheless, the search for more accurate behavioral methods is also leading to an increase in their complexity. This thesis is focused on the identi cation of P2P traf c and aims to propose a classi cation approach capable of identifying traf c generated by P2P applications in real-time, without relying on the payload data. Since one of the differences between client-server and P2P paradigms is the dual role played by P2P hosts, the research work described herein, after a literature review, started with the study of the properties of the traf c from several P2P and non-P2P applications at its source. Instead of collecting the experimental data in an aggregation point, the traf c from each individual host, running a single application or a prede ned set of applications, was captured immediately after its network connection. By doing so, it was possible to assure that the analyzed traf c was generated by the studied applications and that its properties were not affected by the aggregation of different types of traf c. The study included the statistical analysis of the following traf c features: the byte count per time unit, the inter-arrival time, and the packet length. The observation of the source traf c showed that the lengths of the packets generated by P2P and non-P2P applications present distinct patterns. The traf c from non-P2P applications usually results from connections with a stable behavior, mostly formed by small and large packets, used to send requests and acknowledgments and to receive contents, respectively. In these cases, both small and large packets generally present very homogeneous lengths. On the contrary, the P2P traf c is very heterogeneous in terms of packet lengths, as it results from the aggregation of several concurrent connections to different peers. Moreover, the distributed search mechanisms and the replies to requests from other peers also generate a large number of small packets with multiple lengths. Hence, a deeper study focused solely on packet length properties was performed and the set of analyzed applications was extended. The entropy was used to measure the heterogeneity of the packet lengths and the results showed it was possible to differentiate both kinds of traf c. To improve the results in speci c cases, the entropy of the packet lengths was also computed using slots of 200 bytes, which means that all the packet lengths within the same slot are used in the entropy computation as being similar lengths. Based on this approach, it was possible to propose a new behavioral classi er capable of identifying hosts running P2P applications, without using payload data. In order to make the method suitable for real-time analysis, the entropy is computed using a sliding window with a constant size of N packets. Although the proposed classi cation method was able to identify hosts running P2P applications by analyzing the heterogeneity of the packet lengths in the aggregated traf c of each host, it could not classify individual ows as being generated by P2P or non-P2P applications. In fact, the heterogeneity of the packet lengths observed in the traf c of each single host running P2P le-sharing or P2P media streaming applications resulted, mostly, from the aggregation of several connections with different properties, used to share contents with other peers. For this reason, the heterogeneity of individual ows is lower, even for P2P traf c. Nonetheless, in the case of P2P Voice over Internet Protocol (VoIP) traf c, the heterogeneity of the packet lengths results from the use of Variable Bit Rate (VBR) speech codecs and, thus, the heterogeneity is observable in the individual ow used to carry each VoIP session. Therefore, experimental traf c generated by P2P VoIP applications using several VBR and Constant Bit Rate (CBR) speech codecs was collected and used to study the lengths of the packets generated by VoIP sessions. The results of the analysis showed that the packet lengths depend on the speech codec used in each the session. Hence, the heterogeneity of the packet lengths from each VoIP session was measured using entropy, which was computed using a sliding window with a constant size of 500 packets. For each speech codec considered in the study, the intervals of packet lengths and entropy observed during the traf c analysis were compiled and, based on those intervals, a traf c classi er capable of identifying VoIP traf c using a single traf c feature was proposed. The classi er uses a set of behavioral signatures associated with each speech codec, formed by an interval of packet lengths and an interval of the entropy of the packet lengths. Besides of being able to recognize VoIP traf c in the dark, the classi er is also capable of identifying the speech codec used in that VoIP session. After proposing the P2P VoIP traf c classi er, the research work focused on the traf c from P2P le-sharing and P2P media streaming applications. Unlike VoIP, the traf c generated by a single host running one of these applications results from many parallel connections with several peers. Hence, in this thesis, P2P le-sharing and P2P media streaming traf c is also designated by one-to-many P2P traf c. The entropy of the packet lengths of individual ows from these applications is not suf ciently distinct from the entropy obtained from non-P2P individual ows. Therefore, several dimensions of the traf c were separately studied, including incoming, outgoing, or incoming and outgoing packets together, and also packets whose payload length is smaller or equal to 100 bytes, greater than 100 bytes and smaller or equal to 900 bytes, or greater than 900 bytes. The mean of the entropy of the packet lengths for each of these dimensions was computed for each ow of the analyzed applications, using a sliding window with a constant size of 100 packets. Additionally, the mean of the entropy of the inter-arrival times and of the remote host/port pairs to which a local host/port pair communicates was also computed. Based on the obtained results, a traf c classi er that does not rely on payload data was proposed. In the performance evaluation, the classi er was able to identify P2P traf c with an accuracy greater than 95%.