Apache Spark is an open source distributed general-purpose cluster-computing framework. It’s one of the most efficient technologies for processing massive amounts of distributed data. Spark provides DataSet and GraphX API-s which allow high level functions for manipulations of data distributed among the nodes of a cluster. It provides powerful optimization which considers the distributed nature of data. Currently there are sources of huge amounts of biological data which are publicly available for usage. One of such sources is UniProt providing a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. Uniprot provides its semantic data in RDF format with a SPARQL interface for querying among it. In current work we store UniProt’s datasets into our cluster’s distributed storage deployed in the cloud. We provide some basic functionality implemented in Spark using DataSet and GraphX API-s for Scala language to provide queries on the UniProt’s data.

Institute for Informatics and Automation Problems of NAS RA

Armenia

2019

September 23-27

CSIT Conference 2019

Yerevan, Armenia