Using Apache Spark for Biological Data Processing

Shahinyan Tigran

Using Apache Spark for Biological Data Processing

Download

Description
Information

Title: Using Apache Spark for Biological Data Processing

Abstract:

Apache Spark is an open source distributed general-purpose cluster-computing framework. It’s one of the most efficient technologies for processing massive amounts of distributed data. Spark provides DataSet and GraphX API-s which allow high level functions for manipulations of data distributed among the nodes of a cluster. It provides powerful optimization which considers the distributed nature of data. Currently there are sources of huge amounts of biological data which are publicly available for usage. One of such sources is UniProt providing a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. Uniprot provides its semantic data in RDF format with a SPARQL interface for querying among it. In current work we store UniProt’s datasets into our cluster’s distributed storage deployed in the cloud. We provide some basic functionality implemented in Spark using DataSet and GraphX API-s for Scala language to provide queries on the UniProt’s data.