Analysis, characterization and classification of Internet traffic

Finamore, Alessandro

doi:10.6092/polito/porto/2497191

The Internet is a global interconnection of networks representing nowadays one of the most important telecommunication technologies. Born as an U.S. military project, it has evolved in a worldwide communication system used by people every day. This success is based on its ``freedom'' since no single organization or administration entity governs or maintains it. This freedom also motivates the huge heterogeneity of Internet services available today ranging from working activities (e.g., VoIP, e-mail, etc.) to entertainment (e.g., video games, streaming, peer-to-peer, etc.) and commerce (e.g., Amazon, eBay, etc.) just to name a few. The Internet is a fertile and in constant evolution system. Every year new services and software platforms are launched affecting not only the users' activities (e.g. social networks) but also the internal architecture of the networks (e.g., Content Delivery Network vs peer-to-peer) or the devices used to access to the services (e.g., PC vs smartphones and Internet tablets). The richness of the Internet scenario is paid at the cost of its internal complexity. Eric Schmidt, the CEO of Google, said: \emph{``The Internet is the first thing that humanity has built that humanity doesn't understand, the largest experiment in anarchy that we have ever had.''}\footnote{\url{http://www.brainyquote.com/quotes/authors/e/eric_schmidt.html}}. At the origins, the Internet has been designed to operate on few standardized services. None could have i) foreseen the success of this media and ii) designed the network to cope with the plethora of nowadays services. If on the one hand this diversity provides the Internet with a certain level of resiliency and has driven innovation, on the other hand understanding its internal mechanisms is a daunting task, made worse by the fast and constant deployment of new services and applications. However, behind what it could seem a chaotic scenario, the Internet is composed by well defined markets in which big players participate having precise interests: \begin{description} \item \textbf{Users}, representing the majority of the people which assess to the network. They are interested in \emph{Quality of Experience} - QoE, i.e., having good performance when accessing to the network, avoiding for example long delay related to the initial buffering when streaming a video. They are also interested in the \emph{Network Neutrality}, preserving their freedom to use the Internet independently from which service they are accessing; \item \textbf{Internet Service Providers - ISP}, corresponding to organizations which provide Internet access to the customers. They are interested in incrementing the revenues through i) \emph{network engineering} as to optimize the offered services and ii) studying the users' activity as to find new \emph{billing policies}; \item \textbf{Content providers}, corresponding to organizations which sell a specific Internet service, e.g., video streaming, file hosting, etc. As for ISPs, they are interested in finding new way to make revenues. At the same time, they have to cope also with illegal activities as \emph{content piracy}, a common flaw since the early days of peer-to-peer systems; \item \textbf{Government regulation agencies}, corresponding to organizations which regulate some aspects of the Internet activities. For example, they study \emph{Service Level Agreements} - SLA between users and ISPs, comparing the quality of the Internet access offered to the users with respect to the specifications written in the contract signed. \end{description} Other activities as \emph{security} are important for more than one player. Consider for example \emph{malware} and \emph{Denial of Service} - DoS attacks. These can violate the users' privacy, damaging the network and violate some laws. Overall then, there are several motivations to be interested in studying the Internet. Since the early days, the scientific community has made giant steps toward understanding the Internet. We can generalize that two requirements have to be satisfied. First of all, we need \emph{tools and methodologies} as to inspect and characterize the traffic at different granularities, i.e., per-packet, per-flow, per-port, per-user, etc. In particular, \emph{traffic classification} is one of most important activities performed by network operators. It allows to identify which application has generated a given communication and to study not only the whole network traffic aggregate but also how different applications participate in the composition of the total traffic. Leveraging on these tools and methodologies, we can further drill into performing \emph{users and network characterization}. For example, monitoring the traffic over long-term periods, we can study the applications' popularity trends and identify the rise of new technologies. We can perform \emph{anomaly detection}, i.e., study unexpected network condition that might be related to either security issues of malfunctioning hardware. We can optimize routing policies, study inter-ISP traffic, investigate the energy consumption of the network elements or work on caching schemes related social network content, just to name a few of the huge amount of research studies recently conducted in the literature. In this thesis, we present our contributions in studying the Internet discussing the tools and methodologies developed to characterize the network traffic. The thesis is divided in two parts. In the first part we focus on traffic classification methodologies starting from the problem definition and the available solutions in the literature as reported in Chapter~\ref{chapter:traff_class}. In the remaining of the first part we focus on KISS, a novel traffic classification technique we propose based on \emph{Stochastic Packet Inspection} (SPI) analysis. In particular, in Chapter~\ref{chapter:kiss} we describe the framework used by the classifier which is then validated in Chapter~\ref{sec:kiss_udp} and~\ref{sec:kiss_tcp} for UDP and TCP traffic respectively. Chapter~\ref{chapter:compare} is about the comparison of KISS with other state of the art traffic classifier while in Chapter~\ref{sec:clustering} we extend the KISS framework with some clustering techniques. Overall, KISS allows to reach a high level of accuracy in traffic classification which is comparable or even better with respect to other traffic classifiers. It presents a flexible structure which is able to identify a rich set of applications with a limited amount of resource requirements. In the second part of the thesis we study YouTube, the famous video streaming system. Leveraging on Tstat, a passive traffic analyzer, we developed a methodology to identify the YouTube video downloads and we conduct an in depth analysis of many aspects of YouTube. In Chapter~\ref{sec:yt-overview} we start presenting an overview of the system and its components showing the internal mechanisms adopted. Chapter~\ref{sec:yt-methodology} reports an analysis of the available methodologies in the literature to study YouTube and presents our methodology based on monitoring the real users' activities considering different location, access technologies and devices. In the remaining chapters we present the results of our analysis grouped in four different areas of interest: video content properties (Chapter~\ref{sec:yt-content}), internal load balancing and caching policies (Chapter~\ref{sec:yt-cdn}), users' habits and behaviours (Chapter~\ref{sec:user}), and download performance (Chapter~\ref{sec:yt-performance}). Results show that YouTube is a complex system where several components interact with precise policies used to control the communications. Besides its great success, the system is far from being perfect and there is space for further optimization. For example, mobile devices suffer more impairments during the download with respect to PCs. Users stick to the default video resolution and are not interested in changing the quality during the playback. Instead, it is common the abruptly abort of the download. This behaviour is particularly critical because, coupled with aggressive buffering policies used to ensure continuity in the playback, it leads to waste a non negligible amount of traffic, i.e., the users download a portion of the video which it is never played.

PORTO @ Archivio Istituzionale della Ricerca