Title:

Performance Optimization System for Hadoop and Spark Frameworks

Author:

Astsatryan Hrachya ; Kocharyan Aram ; Hagimont Daniel ; Lalayan Arthur

Type:

article

Uncontrolled Keywords:

Hadoop ; Spark ; data compression ; CPU/IO tradeoff ; performance optimization

Abstract:

The optimization of large-scale data sets depends on the technologies and methods used. The MapReduce model, implemented on Apache Hadoop or Spark, allows splitting large data sets into a set of blocks distributed on several machines. Data compression reduces data size and transfer time between disks and memory but requires additional processing. Therefore, finding an optimal tradeoff is a challenge, as a high compression factor may underload Input/Output but overload the processor. The paper aims to present a system enabling the selection of the compression tools and tuning the compression factor to reach the best performance in Apache Hadoop and Spark infrastructures based on simulation analyzes.

Publisher:

De Gruyter

Date submitted:

06.07.2020

Date accepted:

25.09.2020

Date modified:

10.09.2020

DOI:

10.2478/cait-2020-0056

Language:

English

Journal or Publication Title:

Cybernetics and Information Technologies

Volume:

20

Number:

6

URL:


Additional Information:

The paper is supported by the European Union’s Horizon 2020 researchinfrastructures programme under grant agreement No 857645, project NI4OS Europe(National Initiatives for Open Science in Europe).

Affiliation:

Institute for Informatics and Automation Problems of the National Academy of Sciences of the Republicof Armenia ; Université Fédérale Toulouse Midi-Pyrénées, Toulouse, France ; National Polytechnic University of Armenia

Year:

2020

References:

Chen, J., Y. Chen, X. Du, C. Li, J. Lu, S. Zhao, X. Zhou. Big Data Challenge: A Data Management Perspective. – Frontiers of Computer Science, Vol. 7, 2013, No 2, pp. 157-164. ; Lublinsky, B., K. T. Smith, A. Yakubovich. Professional Hadoop Solutions. Indiana, USA, John Wiley & Sons, 2013, p. 504. ; Zaharia, M., R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi. Apache Spark: A Unified Engine for Big Data Processing. – Communications of the ACM, Vol. 59, 2016, No 11, pp. 56-65. ; Cheng, D., X. Zhou, P. Lama, J. Mike, C. Jiang. Energy Efficiency Aware Task Assignment with DVFS in Heterogeneous Hadoop Clusters. – IEEE Transactions on Parallel and Distributed Systems, Vol. 29, 2017, No 1, pp. 70-82. ; Nitu, V., A. Kocharyan, H. Yaya, A. Tchana, D. Hagimont, H. Astsatryan. Working Set Size Estimation Techniques in Virtualized Environments: One Size Does Not Fit All – ACM Meas. Anal. Comput. Syst., Vol. 2, 2018, pp. 1-21. ; Kothuri, P., D. Garcia, J. Hermans. Developing and Optimizing Applications in Hadoop.– Journal of Physics: Conference Series, Vol. 898, 2017, No 5. ; Dean, J., S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. – Communications of the ACM, Vol. 51, 2008, No 1, pp. 107-113. ; Won, H., M. C. Nguyen, M. S. Gil, Y. S. Moon, K. Y. Whang. Moving Metadata from Ad Hoc Files to Database Tables for Robust, Highly Available, and Scalable HDFS. – The Journal of Supercomputing, Vol. 73, 2017, No 6, pp. 2657-2681. ; Uthayakumar, J., T. Vengattaraman, P. Dhavachelvan. A Survey on Data Compression Techniques: From the Perspective of Data Quality, Coding Schemes, Data Type and Applications. – Journal of King Saud University – Computer and Information Sciences, 2018. ; Liu, L. Y., J. F. Wang, R. J. Wang, J. Y. Lee. Design and Hardware Architectures for Dynamic Huffman Coding – IEEE Proceedings-Computers and Digital Techniques, Vol. 142, 1995, No 6, pp. 411-418. ; Fenwick, P. M. The Burrows-Wheeler Transform for Block Sorting Text Compression: Principles and Improvements. – The Computer Journal, Vol. 39, 1996, No 9, pp. 731-740. ; Fang, J., J. Chen, Z. Al-Ars, P. Hofstee, J. Hidders. Work-in-Progress: A High-Bandwidth Snappy Decompressor in Reconfigurable Logic. – In: Proc. of IEEE International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Turin, Italy, 30 September – 5 October 2018, pp. 1-2. ; Liu, W., F. Mei, C. Wang, M. O’Neill, E. E. Swartzlander. Data Compression Device Based on Modified LZ4 Algorithm. – IEEE Transactions on Consumer Electronics, Vol. 64, 2018, No 1, pp. 110-117. ; Rattanaopas, K., S. Kaewkeeree. Improving Hadoop MapReduce Performance with Data Compression: A Study Using Wordcount Job. – In: Proc. of 14th IEEE International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON’17), 2017, pp. 564-567. ; Haider, A., X. Yang, N. Liu, X. H. Sun, S. He. IC-Data: Improving Compressed Data Processing in Hadoop. – In: Proc. of 22nd IEEE International Conference on High Performance Computing (HiPC’15), 2015, pp. 356-365. ; Chen, Y., A. Ganapathi, R. H. Katz. To Compress or Not to Compress-Compute vs IO Tradeoffs for Mapreduce Energy Efficiency. – In: Proc. of 1st ACM SIGCOMM Workshop on Green Networking, 2010, pp. 23-28. ; Lang, W., J. M. Patel. Energy Management for MapReduce Clusters. – In: Proc. of VLDB Endowment, Vol. 3, 2010, No 1-2, pp. 129-139. ; Li, W., H. Yang, Z. Luan, D. Qian. Energy Prediction for Mapreduce Workloads. – In: Proc. of 9th IEEE International Conference on Dependable, Autonomic and Secure Computing, 2011, pp. 443-448. ; Wirtz, T., R. Ge. Improving Mapreduce Energy Efficiency for Computation Intensive Workloads. – In: Proc. of IEEE International Green Computing Conference and Workshops, 2011, pp. 1-8. ; Leverich, J., C. Kozyrakis. On the Energy (in) Efficiency of Hadoop Clusters. – ACM SIGOPS Operating Systems Review, Vol. 44, 2010, No 1, pp. 61-65. ; Tiwari, N., S. Sarkar, U. Bellur, M. Indrawan. An Empirical Study of Hadoop’s Energy Efficiency on a HPC Cluster. – Procedia Computer Science, Vol. 29, 2014, pp. 62-72. ; Tatineni, M., J. Greenberg, R. Wagner, E. Hocks, C. Irving. Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer. – In: Proc. of Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery, 2013, pp. 1-3. ; Narkhede, S., T. Baraskar. HMR Log Analyzer: Analyze Web Application Logs over Hadoop MapReduce. – International Journal of UbiComp (IJU), Vol. 4, 2013, No 3, pp. 41-51. ; Krishna, K., M. N. Murty. Genetic k-Means Algorithm. – IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), Vol. 29, No 3, 1999, pp. 433-439. ; Zhao, W., H. Ma., Q. He. Parallel K-Means Clustering Based on MapReduce. – In: CloudCom 2009. LNCS 5931. Berlin, Springer, 2009, pp. 674-679. ; Astsatryan, H., V. Sahakyan, Y. Shoukourian, P. H. Cros, M. Dayde, J. Dongarra, P. Oster. Strengthening Compute and Data Intensive Capacities of Armenia. – In: Proc. of 14th IEEE RoEduNet International Conference – Networking in Education and Research (NER’15), Craiova, Romania; September 2015, pp. 28-33. ; Astsatryan, H., W. Narsisian, A. Kocharyan, G. da Costa, A. Hankel, A. Oleksiak. Energy Optimization Methodology for e-Infrastructure Providers. – Willey Concurrency and Computation: Practice and Experience, Vol. 29, 2017, No 10. DOI: 10.1002/cpe.4073. ; Nitu, V., A. Kocharyan, H. Yaya, A. Tchana, D. Hagimont, H. Astsatryan. Working Set Size Estimation Techniques in Virtualized Environments: One Size Does Not Fit All. – Proceedings of the ACM on Measurement and Analysis of Computing Systems, Vol. 2, 2018, No 1, pp. 1-22.

Indexing:

ACM Digital Library ; Baidu Scholar ; Cabell's Whitelist ; CNKI Scholar (China National Knowledge Infrastructure) ; CNPIEC - cnpLINKer ; Dimensions ; DOAJ (Directory of Open Access Journals) ; EBSCO (relevant databases) ; EBSCO Discovery Service ; Engineering Village ; Genamics JournalSeek ; Google Scholar ; Inspec ; Japan Science and Technology Agency (JST) ; J-Gate ; JournalGuide ; JournalTOCs ; KESLI-NDSL (Korean National Discovery for Science Leaders) ; Mathematical Reviews (MathSciNet) ; Microsoft Academic ; MyScienceWork ; Naver Academic ; Naviga (Softweco) ; Primo Central (ExLibris) ; ProQuest (relevant databases) ; Publons ; QOAM (Quality Open Access Market) ; ReadCube ; SCImago (SJR) ; Scopus ; Semantic Scholar ; Sherpa/RoMEO ; Summon (ProQuest) ; TDNet ; Ulrich's Periodicals Directory/ulrichsweb ; WanFang Data ; Web of Science - Emerging Sources Citation Index ; WorldCat (OCLC)