Name: | Description: | Size: | Format: | |
---|---|---|---|---|
5.05 MB | Adobe PDF |
Authors
Advisor(s)
Abstract(s)
Tendo em conta o forte crescimento dos dados que se observa atualmente, o conceito de big
data vem ganhando popularidade, dando origem às ferramentas capazes de processar, analisar e
armazenar estes grandes volumes de dados. Nesta senda, um dos desafios que se coloca aos profissionais
e serviços que lidam com esse tipo de dados consiste na escolha adequada da melhor
plataforma a utilizar para processamento de big data, tendo sido investigado o desempenho de
Apache Hadoop, Apache Spark e Apache Flink que representam as três plataformas mais utilizadas
para o processamento de big data. Nesta dissertação é avaliado o desempenho do Hadoop,
do Spark e do Flink utilizando a suite de Benchmark Hibench na sua versão Hibench-master 7,
tendo cido selecionado cinco cargas de trabalho nomeadamente: Sort, Terasort, Wordcount,
K-means e Pagerank. Estas plataformas foram instaladas e configuradas num cluster homogéneo
com quatro nós (máquinas físicas), um mestre e três escravos. Para avaliar o desempenho
das plataformas, foram consideradas duas métricas: tempo de execução e a taxa de transferência
dos dados, tendo-se caracterizado a utilização de recursos tais como memória, Central
Processing Unit (CPU), Disco (E/S) e rede, para diferentes escalas de dados tais comosmall, large
e gigantic. Foram realizadas várias experiências, tendo os respetivos resultados mostrado que
o cluster do Spark ao executar as cargas de trabalho wordcount, sort e terasort obteve melhor
desempenho com tamanho de dados gigantic, enquanto que o Hadoop apresentou melhor desempenho
com tamanho de dados small e large, apesar de no wordcount a diferença ser pequena.
Por outro lado, o Spark ao executar algoritmos iterativos como o k-means apresentou melhor
desempenho com entradas de dados small e large e, para o pagerank, apenas com tamanho de
dados small, enquanto que o Hadoop melhorou o seu desempenho com tamanho de dados gigantic
para K-means e large para o pagerank. Os resultados obtidos mostram que o desempenhos das
duas plataformas nesta experiência é relativo dependendo da carga de trabalho, do tamanho
dos dados de entrada e do tamanho da memória. Foram também comparadas as plataformas
Spark e o Flink executando o programa Wordcount dos seus ficheiros de exemplos, tendo-se
observado que o Flink apresentou melhor desempenho que o Hadoop para todos os tipos de
dados de entrada, sendo 2x mais rápido que o Spark. O Spark apresentou melhor desempenho
que o Hadoop para tamanhos de dados de entrada de 2MB e 392MB, mas observou-se que o seu
desempenho degradava-se com o aumento do tamanho de dados de entrada. O desempenho
do Flink melhorou significativamente, sobretudo para tamanhos de dados de entrada de 8GB e
38GB, após o ajuste do valor do parâmetro de fração da memória.
Given the strong data growth that is currently occurring, the concept of big data has gained popularity, giving rise to tools capable of processing, analyzing and storing these large volumes of data. In this way, one of the challenges facing professionals and services dealing with this type of data is the adequate choice of the best platform to use for big data processing, and the performance of Apache Hadoop, Apache Spark and Apache Flink has been investigated, which represent the three most widely used platforms for big data processing. In this dissertation, the performance of Hadoop, Spark and Flink is evaluated using the Hibench Benchmark suite in its Hibench-master 7 version, having selected five workloads namely: sort, terasort, wordcount, Kmeans and pagerank. These platforms were installed and configured in a homogeneous cluster with four nodes (physical machines), one master and three slaves. In order to evaluate the performance of the platforms, two metrics were considered: execution time and throughput, being also characterized the resource consuption such as memory, Central Processing Unit (CPU), Disk (I/O) and network, for different scales of data such as small, large and gigantic. A number of experiments were carried out, with the respective results showing that the Spark cluster performing wordcount, sort and terasort workloads performed better with gigantic data size, while Hadoop performed better with small and large data sizes, although in wordcount the difference is small. On the other hand, Spark when executing iterative algorithms like k-means presented better performance with small and large data sizes and, for pagerank, only with small data size, while Hadoop improved its performance with gigantic data size for K-means and large for the pagerank. The results show that the performance of the two platforms in this experiment is relative depending on the workload, the size of the input data and the size of the memory. The Spark and Flink platforms were also compared by running the Wordcount program of their sample files, and it was observed that Flink performed better than Hadoop for all input data types, being 2x faster than Spark. Spark performed better than Hadoop for 2MB and 392MB input data sizes, sizes, but it was observed that its performance was degraded with the increasing of the size of input data. Flink performance improved significantly, especially for 8GB and 38GB input data sizes, after adjusting the memory fraction parameter value.
Given the strong data growth that is currently occurring, the concept of big data has gained popularity, giving rise to tools capable of processing, analyzing and storing these large volumes of data. In this way, one of the challenges facing professionals and services dealing with this type of data is the adequate choice of the best platform to use for big data processing, and the performance of Apache Hadoop, Apache Spark and Apache Flink has been investigated, which represent the three most widely used platforms for big data processing. In this dissertation, the performance of Hadoop, Spark and Flink is evaluated using the Hibench Benchmark suite in its Hibench-master 7 version, having selected five workloads namely: sort, terasort, wordcount, Kmeans and pagerank. These platforms were installed and configured in a homogeneous cluster with four nodes (physical machines), one master and three slaves. In order to evaluate the performance of the platforms, two metrics were considered: execution time and throughput, being also characterized the resource consuption such as memory, Central Processing Unit (CPU), Disk (I/O) and network, for different scales of data such as small, large and gigantic. A number of experiments were carried out, with the respective results showing that the Spark cluster performing wordcount, sort and terasort workloads performed better with gigantic data size, while Hadoop performed better with small and large data sizes, although in wordcount the difference is small. On the other hand, Spark when executing iterative algorithms like k-means presented better performance with small and large data sizes and, for pagerank, only with small data size, while Hadoop improved its performance with gigantic data size for K-means and large for the pagerank. The results show that the performance of the two platforms in this experiment is relative depending on the workload, the size of the input data and the size of the memory. The Spark and Flink platforms were also compared by running the Wordcount program of their sample files, and it was observed that Flink performed better than Hadoop for all input data types, being 2x faster than Spark. Spark performed better than Hadoop for 2MB and 392MB input data sizes, sizes, but it was observed that its performance was degraded with the increasing of the size of input data. Flink performance improved significantly, especially for 8GB and 38GB input data sizes, after adjusting the memory fraction parameter value.
Description
Keywords
Benchmarks Cargas de Trabalho Computação Na Cloud Desempenho Flink Hadoop Spark