Non-Linear Feature Selection for Biomedical Data Interpretation

Rosati, Samanta

doi:10.6092/polito/porto/2533487

The rapid development in computing-related disciplines and technologies provides large amounts of data and variables. Knowledge Discovery in Databases (KDD) is a new research field aiming to provide tools and methods for “identifying valid, novel, potential useful, and ultimately understandable patterns in data”. KDD is a process that starts from the analysis of the application domain and the identification of the target data. Then, it proceeds with the improvement of the data quality by dimensionality reduction and finally, it applies data mining methods for obtaining new information and knowledge. The dimensionality reduction is one of the most important step in KDD for the successfully discovery of new knowledge. In fact, in most situations sev-eral variables collected in datasets are irrelevant, redundant or could be source of noise for the following analysis phases. Automated methods for Feature Selec-tion (FS) allow highlighting only those aspects really relevant in data, deleting all redundant and irrelevant variables. This aspect is important, not only for Data Mining (DM) purposes, but also for improving the performances of signal processing algorithms. In the biomedical field the application of such techniques results very complicated because medical datasets are often characterized by a small num-ber of objects (usually patients involved in the study) associated with a very large number of variables extracted from medical analysis and classes that are usually unbalanced. Moreover, medical data are frequently affected from in-completeness and uncertainty derived from measurement or human errors. The aim of this thesis is to provide a new methodology for analyzing bio-medical data. This approach can be successfully used both for extracting useful knowledge from real datasets and for improving signal and image processing techniques. In the whole KDD process, we focused on the application of non-linear and automated methods for FS, essentially based on Rough Set Theory (RST). RST is a powerful methodology that does not require any a-priori infor-mation or model assumption about data, but it uses only knowledge directly derivable from the given data. Moreover, RST is able to model imperfect and incomplete knowledge, which usually characterizes medical datasets. The first application is related to the characterization of cerebral hemody-namics of migraine sufferers. A dataset of 26 parameters was built starting from the Near-InfraRed Spectroscopy (NIRS) signals measured on 65 migraineurs and 15 healthy subjects. The QuickReduct Algorithm (QRA), an automated and non-linear FS method based on RST, was applied and compared with the con-ventional ANOVA analysis. The results show that the variables selected by the QRA, apart from allowing classifying the subjects more accurately than ANO-VA ones, were actually related to physiological aspects connected with the pa-thology. The second application deals with the characterization of diabetic oxy-genation patterns during ankle flexo-extensions. The NIRS signals were ac-quired in 31 diabetic patients and 16 control subjects, before and after training protocols. Starting from a dataset made of 24 variables, the most discriminative feature subsets were selected by using five automated FS algorithms, all based on RST. A good discriminative power is obtained for all subsets, together with very useful information for the assessment of diabetic peripheral vascular im-pairment. Finally, combining these techniques of automatic FS with knowledge-based systems and the traditional image processing methods, a tool for the au-tomatic segmentation of different kinds of ultrasound carotid images is pro-posed. Several methods have been presented in literature for the automated analysis of US images, whose performances, however, are highly sensitive to the variability introduced by noise, different morphology of the vessel and the presence of disease (plaques). In this study, we started identifying four different classes of pixels, according to their physiological meaning: lumen, intima-media complex, adventitia and noisy lumen. Then, FS by QRA was performed on a dataset made of 600 pixels per class characterized by 211 features. For each single pixel, other than its intensity, we considered as features different parameters essentially based on the intensity of the pixels around it and belonging to two categories: statistical moments estimates and texture features. This process led to the selection of 12 variables that were used for classifying each pixel in one of the identified classes by means of three Feed-Forward Neural Networks (FFNNs) used in parallel. When all pixels in the region of interest were classified, it was therefore possible to automatically identify the lumen-intima (LI) and media-adventitia (MA) interfaces of the carotid wall. The results are very encouraging: the profiles identified by our tool are comparable with those drawn manually by human operators, also in the presence of noise or plaques in the image. However, we further improved the segmentation performance obtained with this tool, developing a dual-snake system that evolves by means of a Fuzzy Inference System (FIS). This FIS takes as input seven variables related both to internal and external forces connected with the snake definition and, by a set of 26 rules, gives as output the movements of the LI and MA profiles on the image. By this approach, we overcome the main limitations of the classical snake methodology, which needs the user initialization of the snakes and the choice of the correct parameters for the snake evolution. On the contrary, the proposed fuzzy-snake system is completely automatic and independent from the user because the two snakes start their evolution from the LI and MA interfaced identified with the pixel classification tool. Moreover, no parameters are required for the force equilibrium, thanks to the FIS capability to balance the different contributions. We validated our system on 180 pathological and non-pathological images and the results showed very good performances compared with the manual segmentation and the classical semi-automatic techniques, improving the results obtained with the pixel classification tool.

PORTO @ Archivio Istituzionale della Ricerca