Performance Evaluation of Parallel Strategies in Public Clouds: a Study with Phylogenomic Workflows

[doi]

OLIVEIRA, D. ; OCAÑA, K. ; OGASAWARA, E. ; DIAS, J. ; GONCALVES, J. ; BAIAO, F. ; MATTOSO, M. . Performance Evaluation of Parallel Strategies in Public Clouds: a Study with Phylogenomic Workflows. Future Generation Computer Systems, v. 29(7), p. 1816-1825, 2013.
Keywords: cloud computing; scientific workflows; MapReduce; Hadoop; phylogenomics

Abstract

Data analysis is an exploratory process that demands high performance computing (HPC). SciPhylomics, for example, is a data-intensive workflow that aims at producing phylogenomic trees based on an input set of protein sequences of genomes to infer evolutionary relationships among living organisms. SciPhylomics can benefit from parallel processing techniques provided by existing approaches such as SciCumulus cloud workflow engine and MapReduce implementations such as Hadoop. Despite some performance fluctuations, computing clouds provide a new dimension for HPC due to its elasticity and availability features. In this paper, we present a performance evaluation for SciPhylomics executions in a real cloud environment. The workflow was executed using two parallel execution approaches (SciCumulus and Hadoop) at the Amazon EC2 cloud. Our results reinforce the benefits of parallelizing data for the phylogenomic inference workflow using MapReduce-like parallel approaches in the cloud. The performance results demonstrate that this class of bioinformatics experiment is suitable to be executed in the cloud despite its need for high performance capabilities. The evaluated workflow shares many features of several data intensive workflows, which present first insights that these cloud execution results can be extrapolated to other classes of experiments.

Highlights

  • Parallel models compared: Hadoop and SciCumulus adaptive cloud workflow engine.
  • Parallel approaches in the cloud speed up bioinformatics workflows.
  • SciCumulus Adaptive techniques adjust the number of virtual machines and load.
  • Adaptive solution outperformed static Hadoop, making better use of cloud elasticity.
  • Scientists may finally analyze provenance of computations while running workflows.
Advertisements