- ris
- BibTeX

Performance Optimization System for Hadoop and Spark Frameworks

Astsatryan Hrachya; Kocharyan Aram; Hagimont Daniel; Lalayan Arthur

Performance Optimization System for Hadoop and Spark Frameworks

Astsatryan Hrachya ; Kocharyan Aram ; Hagimont Daniel ; Lalayan Arthur

Hadoop ; Spark ; data compression ; CPU/IO tradeoff ; performance optimization

The optimization of large-scale data sets depends on the technologies and methods used. The MapReduce model, implemented on Apache Hadoop or Spark, allows splitting large data sets into a set of blocks distributed on several machines. Data compression reduces data size and transfer time between disks and memory but requires additional processing. Therefore, finding an optimal tradeoff is a challenge, as a high compression factor may underload Input/Output but overload the processor. The paper aims to present a system enabling the selection of the compression tools and tuning the compression factor to reach the best performance in Apache Hadoop and Spark infrastructures based on simulation analyzes.

10.2478/cait-2020-0056

Cybernetics and Information Technologies

The paper is supported by the European Union’s Horizon 2020 researchinfrastructures programme under grant agreement No 857645, project NI4OS Europe(National Initiatives for Open Science in Europe).

Institute for Informatics and Automation Problems of the National Academy of Sciences of the Republicof Armenia ; Université Fédérale Toulouse Midi-Pyrénées, Toulouse, France ; National Polytechnic University of Armenia

Chen, J., Y. Chen, X. Du, C. Li, J. Lu, S. Zhao, X. Zhou. Big Data Challenge: A Data Management Perspective. – Frontiers of Computer Science, Vol. 7, 2013, No 2, pp. 157-164. ; Lublinsky, B., K. T. Smith, A. Yakubovich. Professional Hadoop Solutions. Indiana, USA, John Wiley & Sons, 2013, p. 504. ; Zaharia, M., R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi. Apache Spark: A Unified Engine for Big Data Processing. – Communications of the ACM, Vol. 59, 2016, No 11, pp. 56-65. ; Cheng, D., X. Zhou, P. Lama, J. Mike, C. Jiang. Energy Efficiency Aware Task Assignment with DVFS in Heterogeneous Hadoop Clusters. – IEEE Transactions on Parallel and Distributed Systems, Vol. 29, 2017, No 1, pp. 70-82. ; Nitu, V., A. Kocharyan, H. Yaya, A. Tchana, D. Hagimont, H. Astsatryan. Working Set Size Estimation Techniques in Virtualized Environments: One Size Does Not Fit All – ACM Meas. Anal. Comput. Syst., Vol. 2, 2018, pp. 1-21. ; Kothuri, P., D. Garcia, J. Hermans. Developing and Optimizing Applications in Hadoop.– Journal of Physics: Conference Series, Vol. 898, 2017, No 5. ; Dean, J., S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. – Communications of the ACM, Vol. 51, 2008, No 1, pp. 107-113. ; Won, H., M. C. Nguyen, M. S. Gil, Y. S. Moon, K. Y. Whang. Moving Metadata from Ad Hoc Files to Database Tables for Robust, Highly Available, and Scalable HDFS. – The Journal of Supercomputing, Vol. 73, 2017, No 6, pp. 2657-2681. ; Uthayakumar, J., T. Vengattaraman, P. Dhavachelvan. A Survey on Data Compression Techniques: From the Perspective of Data Quality, Coding Schemes, Data Type and Applications. – Journal of King Saud University – Computer and Information Sciences, 2018. ; Liu, L. Y., J. F. Wang, R. J. Wang, J. Y. Lee. Design and Hardware Architectures for Dynamic Huffman Coding – IEEE Proceedings-Computers and Digital Techniques, Vol. 142, 1995, No 6, pp. 411-418. ; Fenwick, P. M. The Burrows-Wheeler Transform for Block Sorting Text Compression: Principles and Improvements. – The Computer Journal, Vol. 39, 1996, No 9, pp. 731-740. ; Fang, J., J. Chen, Z. Al-Ars, P. Hofstee, J. Hidders. Work-in-Progress: A High-Bandwidth Snappy Decompressor in Reconfigurable Logic. – In: Proc. of IEEE International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Turin, Italy, 30 September – 5 October 2018, pp. 1-2. ; Liu, W., F. Mei, C. Wang, M. O’Neill, E. E. Swartzlander. Data Compression Device Based on Modified LZ4 Algorithm. – IEEE Transactions on Consumer Electronics, Vol. 64, 2018, No 1, pp. 110-117. ; Rattanaopas, K., S. Kaewkeeree. Improving Hadoop MapReduce Performance with Data Compression: A Study Using Wordcount Job. – In: Proc. of 14th IEEE International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON’17), 2017, pp. 564-567. ; Haider, A., X. Yang, N. Liu, X. H. Sun, S. He. IC-Data: Improving Compressed Data Processing in Hadoop. – In: Proc. of 22nd IEEE International Conference on High Performance Computing (HiPC’15), 2015, pp. 356-365. ; Chen, Y., A. Ganapathi, R. H. Katz. To Compress or Not to Compress-Compute vs IO Tradeoffs for Mapreduce Energy Efficiency. – In: Proc. of 1st ACM SIGCOMM Workshop on Green Networking, 2010, pp. 23-28. ; Lang, W., J. M. Patel. Energy Management for MapReduce Clusters. – In: Proc. of VLDB Endowment, Vol. 3, 2010, No 1-2, pp. 129-139. ; Li, W., H. Yang, Z. Luan, D. Qian. Energy Prediction for Mapreduce Workloads. – In: Proc. of 9th IEEE International Conference on Dependable, Autonomic and Secure Computing, 2011, pp. 443-448. ; Wirtz, T., R. Ge. Improving Mapreduce Energy Efficiency for Computation Intensive Workloads. – In: Proc. of IEEE International Green Computing Conference and Workshops, 2011, pp. 1-8. ; Leverich, J., C. Kozyrakis. On the Energy (in) Efficiency of Hadoop Clusters. – ACM SIGOPS Operating Systems Review, Vol. 44, 2010, No 1, pp. 61-65. ; Tiwari, N., S. Sarkar, U. Bellur, M. Indrawan. An Empirical Study of Hadoop’s Energy Efficiency on a HPC Cluster. – Procedia Computer Science, Vol. 29, 2014, pp. 62-72. ; Tatineni, M., J. Greenberg, R. Wagner, E. Hocks, C. Irving. Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer. – In: Proc. of Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery, 2013, pp. 1-3. ; Narkhede, S., T. Baraskar. HMR Log Analyzer: Analyze Web Application Logs over Hadoop MapReduce. – International Journal of UbiComp (IJU), Vol. 4, 2013, No 3, pp. 41-51. ; Krishna, K., M. N. Murty. Genetic k-Means Algorithm. – IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), Vol. 29, No 3, 1999, pp. 433-439. ; Zhao, W., H. Ma., Q. He. Parallel K-Means Clustering Based on MapReduce. – In: CloudCom 2009. LNCS 5931. Berlin, Springer, 2009, pp. 674-679. ; Astsatryan, H., V. Sahakyan, Y. Shoukourian, P. H. Cros, M. Dayde, J. Dongarra, P. Oster. Strengthening Compute and Data Intensive Capacities of Armenia. – In: Proc. of 14th IEEE RoEduNet International Conference – Networking in Education and Research (NER’15), Craiova, Romania; September 2015, pp. 28-33. ; Astsatryan, H., W. Narsisian, A. Kocharyan, G. da Costa, A. Hankel, A. Oleksiak. Energy Optimization Methodology for e-Infrastructure Providers. – Willey Concurrency and Computation: Practice and Experience, Vol. 29, 2017, No 10. DOI: 10.1002/cpe.4073. ; Nitu, V., A. Kocharyan, H. Yaya, A. Tchana, D. Hagimont, H. Astsatryan. Working Set Size Estimation Techniques in Virtualized Environments: One Size Does Not Fit All. – Proceedings of the ACM on Measurement and Analysis of Computing Systems, Vol. 2, 2018, No 1, pp. 1-22.

ACM Digital Library ; Baidu Scholar ; Cabell's Whitelist ; CNKI Scholar (China National Knowledge Infrastructure) ; CNPIEC - cnpLINKer ; Dimensions ; DOAJ (Directory of Open Access Journals) ; EBSCO (relevant databases) ; EBSCO Discovery Service ; Engineering Village ; Genamics JournalSeek ; Google Scholar ; Inspec ; Japan Science and Technology Agency (JST) ; J-Gate ; JournalGuide ; JournalTOCs ; KESLI-NDSL (Korean National Discovery for Science Leaders) ; Mathematical Reviews (MathSciNet) ; Microsoft Academic ; MyScienceWork ; Naver Academic ; Naviga (Softweco) ; Primo Central (ExLibris) ; ProQuest (relevant databases) ; Publons ; QOAM (Quality Open Access Market) ; ReadCube ; SCImago (SJR) ; Scopus ; Semantic Scholar ; Sherpa/RoMEO ; Summon (ProQuest) ; TDNet ; Ulrich's Periodicals Directory/ulrichsweb ; WanFang Data ; Web of Science - Emerging Sources Citation Index ; WorldCat (OCLC)