Title:

Using Apache Spark for Biological Data Processing

Author:

Shahinyan Tigran

Type:

Conference

Co-author(s) :

Berberyan Levon

Uncontrolled Keywords:

Distributed computing ; Spark ; semantic web

Abstract:

Apache Spark is an open source distributed general-purpose cluster-computing framework. It’s one of the most efficient technologies for processing massive amounts of distributed data. Spark provides DataSet and GraphX API-s which allow high level functions for manipulations of data distributed among the nodes of a cluster. It provides powerful optimization which considers the distributed nature of data. Currently there are sources of huge amounts of biological data which are publicly available for usage. One of such sources is UniProt providing a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. Uniprot provides its semantic data in RDF format with a SPARQL interface for querying among it. In current work we store UniProt’s datasets into our cluster’s distributed storage deployed in the cloud. We provide some basic functionality implemented in Spark using DataSet and GraphX API-s for Scala language to provide queries on the UniProt’s data.

Language:

English

URL:

click here to follow the link

Affiliation:

Institute for Informatics and Automation Problems of NAS RA

Country:

Armenia

Year:

2019

Time period:

September 23-27

Conference title:

CSIT Conference 2019

Place:

Yerevan, Armenia