Computational Methods for Bioinformatics Analysis and Neuromorphic Computing

Urgese, Gianvito

The latest biological discoveries and the exponential growth of more and more sophisticated biotechnologies led in the current century to a revolution that totally reshaped the concept of genetic study. This revolution, which began in the last decades, is still continuing thanks to the introduction of new technologies capable of producing a huge amount of biological data in a relatively short time and at a very low price with respect to some decades ago. These new technologies are known as Next Generation Sequencing (NGS). These platforms perform massively parallel sequencing of both RNA and DNA molecules, thus allowing to retrieve the nucleic acid sequence of millions of fragments of DNA or RNA in a single machine run. The introduction of such technologies rapidly changed the landscape of genetic research, providing the ability to answer questions with heretofore unimaginable accuracy and speed. Moreover, the advent of NGS with the consequent need for ad-hoc strategies for data storage, sharing, and analysis is transforming genetics in a big data research field. Indeed, the large amount of data coming from sequencing technologies and the complexity of biological processes call for novel computational tools (Bioinformatics tools) and informatics resources to exploit this kind of information and gain novel insights into human beings, living organisms, and pathologies mechanisms. At the same time, a new scientific discipline called Neuromorphic Computing has been established to develop SW/HW systems having brain-specific features, such as high degree of parallelism and low power consumption. These platforms are usually employed to support the simulation of the nervous system, thus allowing the study of the mechanisms at the basis of the brain functioning. In this scenario, my research program focused on the development of optimized HW/SW algorithms and tools to process the biological information from Bioinformatics and Neuromorphic studies. The main objective of the methodologies proposed in this thesis consisted in achieving a high level of sensitivity and specificity in data analysis while minimizing the computational time. To reach these milestones, then, some bottlenecks identified in the state-of-the-art tools have been solved through a careful design of three new optimised algorithms. The work that led to this thesis is part of three collaborative projects. Two concerning the design of Bioinformatics sequence alignment algorithms and one aimed at optimizing the resources usage of a Neuromorphic platform. In the next paragraphs, the projects are briefly introduced. Dynamic Gap Selector Project This project concerned the design and implementation of a new gap model implemented in the dynamic programming sequence alignment algorithms. Smith-Waterman (S-W) and Needleman-Wunsch (N-W) are widespread methods to perform Local and Global alignments of biological sequences such as proteins, DNA and RNA molecules that are represented such as sequences of letters. Both the algorithms make use of scoring procedures to evaluate matches and errors that can be encountered during the sequence alignment process. These scoring strategies are designed to consider insertions and deletions through the identification of gaps in the aligned sequences. The Affine gap model is considered the most accurate model for the alignment of biomolecules. However, its application to S-W and N-W algorithms is quite expensive both in terms of computational time as well as in terms of memory requirements when compared to other less demanding models as the Linear gap one. In order to overcome these drawbacks, an optimised version of the Affine gap model called Dynamic Gap Selector (DGS) has been developed. The alignment scores computed using DGS are very similar to those computed using the gold standard Affine gap model. However, the implementation of this novel gap model during the S-W and N-W alignment procedures leads to the reduction of the memory requirements by a factor of 3. Moreover, the DGS model application accounts for a reduction by a factor of 2 in the number of operations required with respect to the standard Affine gap model. isomiR-SEA Project One of the most attractive research fields that is currently investigated by several interdisciplinary research teams is the study of small and medium RNA sequences with regulatory functions on the production of proteins. These RNA molecules are respectively called microRNAs (miRNAs) and long non-coding RNAs (lncRNAs). In the second project, an alignment algorithm specific for miRNAs detection and characterization have been designed and implemented. miRNAs are a class of short RNAs (18-25 bases) that play essential roles in a variety of cellular processes such as development, metabolism, regulation of immunological response and tumor genesis. Several tools have been developed in the last years to align and analyse the huge amount of data coming from the sequencing of short RNA molecules. However, these tools still lack accuracy and completeness because they use general alignment procedures that do not take into account the structural characteristics of miRNA molecules. Moreover, they are not able to detect specific miRNA variants, called isomiRs, that have recently been found to be relevant for miRNA targets regulation. To overcome these limitations, a miRNA-based alignment algorithm has been designed and developed. The isomiR-SEA algorithm is specifically tailored to detect different miRNAs variants (isomiRs) in the RNA-Seq data and to provide users with a detailed picture of the isomiRs spectrum characterizing the sample under investigation. The accuracy proper of the implemented alignment policy is reflected in the precise miRNAs and isomiRs quantification, and in the detailed profiling of miRNAtarget mRNA interaction sites. This information, hidden in raw miRNA sequencing data, can be very useful to properly characterize miRNAs and to adopt them as reliable biomarkers able to describe multifactorial pathologies such as cancer. SNN Partitioning and Placement Project In the Neuromorphic Computing field, SpiNNaker is one of the state-of-the-art massively parallel neuromorphic platform. It is designed to simulate Spiking Neural Networks (SNN) but it is characterized by several bottlenecks in the neuron partitioning and placement phases executed during the simulation configuration. In this activity, related to the European Flagship project Human Brain Project, a top-down methodology has been developed to improve the scalability and reliability of SNN simulations on massively many-core and densely interconnected platforms. In this context, SNNs mimic the brain activity by emulating spikes sent among neurons populations. Many-core platforms are emerging computing resources to achieve real-time SNNs simulations. Neurons are mapped to parallel cores and spikes are sent in the form of packets over the on-chip and off-chip network. However, due to the heterogeneity and complexity of neuron populations activity, achieving an efficient exploitation of platforms resources is a challenge, often impacting simulation reliability and limiting the biological network size. To address this challenge, the proposed methodology makes use of customized SNN configurations capable of extracting detailed profiling information about network usage of on-chip and off-chip resources. Thus, allowing to recognize the bottlenecks in the spike propagation system. These bottlenecks have been then considered during the SNN Partitioning and Placement of a graph describing the SNN interconnection on chips and cores available on the SpiNNaker board. The advantages of the proposed SNN Partitioning and Placement applied to the SpiNNaker has been evaluated in terms of traffic reduction and consequent simulation reliability. The results demonstrate that it is possible to consistently reduce packet traffic and improve simulation reliability by means of an effective neuron placement.

PORTO @ Archivio Istituzionale della Ricerca