Towards Self-Learning Data Transformation

BARALIS, ELENA MARIA; CERQUITELLI, TANIA; CHIUSANO, SILVIA ANNA; DI CORSO, EVELINA

Large volumes of data are being collected at an ever increasing rate in various modern applications, ranging from social networks to the scientific computing and smart environments. Since they are generated by a large variety of events, real datasets are usually characterized by an inherent sparseness. Furthermore, the features used to model real things/objects and human actions may have very large domains and variable distributions. The variability in data distribution increases with data volume, thus increasing the complexity of data analytics. Data-driven analysis is a multi-step process, in which data scientists tackle the complex task of configuring the analytics system to transform data into actionable knowledge. Until now, a plethora of algorithms are available for performing a given data analysis phase (e.g., data transformation) but the algorithm selection is usually tailored to the data under analysis. In many analytics processes tailored to sparse data collections, like collections of documents and medical treatment collections, suitable transformations of input data need to be explored to gain insights from data, reduce sparseness, and make the overall analysis problem more effectively tractable. Furthermore, different weighting functions (e.g., term/item frequencies, GF-IDF) can be exploited to highlight the relevance of specific objects in the collection. However, different methods exist and the selection of the optimal ones is guided by the domain expert. In this work we argue towards a new self-learning engine able to suggest to the analyst good transformation methods and weighting schemas for a given data collection which yield higher quality knowledge. This new generation of systems, named SELF-DATA (SELF-learning DAta TrAnsformation) rely on: (i) an engine capable of characterizing data distributions through various indices (e.g., hapax legomena, Guiraud’s index of lexical richness), exploring different data weighting strategies (e.g., normalized term frequencies, logarithmic entropy) and data transformation methods (e.g., PCA, LSI) before applying a given data mining algorithm (e.g., cluster analysis), evaluating and comparing solutions through different quality indices (e.g., WSSSE, rand index, f-measure, precision, recall), (ii) a knowledge database storing results of experiments on previously processed datasets, including data characterization and the selected results, (iii) a classification algorithm trained on the knowledge base content to forecast the best methods for future analyses. We implemented a preliminary version of SELF-DATA on the Apache Spark framework, supporting parallel and scalable processing and different data transformation analytics activities. It is able to characterize the data distribution through different quality indices and perform all the tests combining a given weighting strategy with a data transformation method before applying the cluster analysis through K-means. The identified solutions are compared, and ranked in terms of quality of the extracted knowledge (i.e., quality of the discovered clusters). For each analyzed dataset, the 2-top solutions are selected and stored in the knowledge base. The preliminary validation performed on 10 collections of news highlight that the term frequency and logarithmic entropy weighting methods are effective to better measure item relevance with very sparse datasets, and the PCA method outperforms LSI in presence of a larger data domain.

PORTO @ Archivio Istituzionale della Ricerca