Bioinformatics methods and tools for identifying disease-associated B-cell populations, miRNA patterns and gene fusions in RNA-Seq data

Paciello, Giulia

The advent of Next Generation Sequencing (NGS) technologies dramatically reshaped genomics by allowing to produce huge amounts of sequencing data with reduced per-base costs. The capability to determine Deoxyribonucleic Acid (DNA) and Ribonucleic Acid (RNA) sequences without any a priori knowledge on the cytogenetic profile of the host cell shed new light on several biological processes involving both coding and non-coding genes. NGS also revolutionised cancer genomics, where these techniques have been widely used to identify chromosomal aberrations such as gene fusions, Single Nucleotide Polymorphisms (SNPs) or Copy Number Variations (CNVs), responsible for cancer onset and progression. The possibility to recognize these alterations and to focus on those with higher driver impact in the pathology under investigation, accounted in the last decade for the implementation of a novel type of medical approach referred to as Personalized Medicine, where patient specific sequencing data are exploited to prevent, diagnose and treat, with tailored therapeutic procedures, the disease. The complexity and huge dimensions characterizing NGS datasets, make essential the constant interaction among several professions, such as biologists, biotechnologists, mathematicians, physicists, physicians and computer scientists to design opportune strategies for data storing, management, analysis and interpretation. This novel multidisciplinary research field in which computational methodologies and algorithms are applied to gain novel insights into living systems and pathologies is called Bioinformatics. The work proposed in this thesis fits one of bioinformatics objectives, that is the development of computational methods and tools for the identification of genomic alterations in sequencing data, specifically RNA Sequencing (RNA-Seq) data. The implemented methodologies and tools, namely VDJSeq-Solver, isomiR-SEA and FuGePrior, were designed to investigate three genomic aberrations (i.e., abnormal B cell clonal populations, deregulated miRNA and isomiR expression patterns, and gene fusion occurrence) known to be correlated to the onset progression and prognosis of multi-factorial diseases such as cancer. These tools implement a novel in silico approach for disease characterization from sequencing data and, even exploiting informatics algorithms, were designed considering latest biological knowledge. Their performance were assessed on both private and public datasets and the provided results validated by wet-lab experiments or, when not possible, evaluated by considering up-to-date scientific literature. The activities summarized in this manuscript arise from a productive interaction with different Universities, Hospitals and Research Institutes. Specifically, VDJSeq-Solver tool design and testing took advantage from sharing medical and molecular expertise with the Department of Diagnostics and Public Health of the University of Verona (Italy). The first release of isomiR-SEA benefited from my collaboration with the Laboratory of Oncogenomics of the Institute for Cancer Research (IRCC) in Candiolo (Italy). Finally, the concepts at the basis of FuGePrior tool stem from my research activity within the European Project Next Generation Sequencing for Targeted Personalized Therapy of Leukaemia (NGS-PTL). All the research proposed in this thesis was published in peer-reviewed scientific journals.

PORTO @ Archivio Istituzionale della Ricerca