Towards Self-Learning Data Transformation

Tipo di pubblicazione: Articolo in atti di convegno
Tipologia MIUR: Contributo in Atti di Convegno (Proceeding) > Abstract in atti di convegno
Titolo: Towards Self-Learning Data Transformation
Autori: Baralis, Elena; Cerquitelli, Tania; Chiusano, Silvia; Di Corso, Evelina
Autori di ateneo:
Tipo di referee: Comitato scientifico
Titolo del convegno: Women in machine learning Workshop 2016
Luogo dell'evento: Barcelona, Spain
Data dell'evento: 4-5 December, 2016
Abstract: Large volumes of data are being collected at an ever increasing rate in various modern applications, ranging from social networks to the scientific computing and smart environments. Since they are generated by a large variety of events, real datasets are usually characterized by an inherent sparseness. Furthermore, the features used to model real things/objects and human actions may have very large domains and variable distributions. The variability in data distribution increases with data volume, thus increasing the complexity of data analytics. Data-driven analysis is a multi-step process, in which data scientists tackle the complex task of configuring the analytics system to transform data into actionable knowledge. Until now, a plethora of algorithms are available for performing a given data analysis phase (e.g., data transformation) but the algorithm selection is usually tailored to the data under analysis. In many analytics processes tailored to sparse data collections, like collections of documents and medical treatment collections, suitable transformations of input data need to be explored to gain insights from data, reduce sparseness, and make the overall analysis problem more effectively tractable. Furthermore, different weighting functions (e.g., term/item frequencies, GF-IDF) can be exploited to highlight the relevance of specific objects in the collection. However, different methods exist and the selection of the optimal ones is guided by the domain expert. In this work we argue towards a new self-learning engine able to suggest to the analyst good transformation methods and weighting schemas for a given data collection which yield higher quality knowledge. This new generation of systems, named SELF-DATA (SELF-learning DAta TrAnsformation) rely on: (i) an engine capable of characterizing data distributions through various indices (e.g., hapax legomena, Guiraud's index of lexical richness), exploring different data weighting strategies (e.g., normalized term frequencies, logarithmic entropy) and data transformation methods (e.g., PCA, LSI) before applying a given data mining algorithm (e.g., cluster analysis), evaluating and comparing solutions through different quality indices (e.g., WSSSE, rand index, f-measure, precision, recall), (ii) a knowledge database storing results of experiments on previously processed datasets, including data characterization and the selected results, (iii) a classification algorithm trained on the knowledge base content to forecast the best methods for future analyses. We implemented a preliminary version of SELF-DATA on the Apache Spark framework, supporting parallel and scalable processing and different data transformation analytics activities. It is able to characterize the data distribution through different quality indices and perform all the tests combining a given weighting strategy with a data transformation method before applying the cluster analysis through K-means. The identified solutions are compared, and ranked in terms of quality of the extracted knowledge (i.e., quality of the discovered clusters). For each analyzed dataset, the 2-top solutions are selected and stored in the knowledge base. The preliminary validation performed on 10 collections of news highlight that the term frequency and logarithmic entropy weighting methods are effective to better measure item relevance with very sparse datasets, and the PCA method outperforms LSI in presence of a larger data domain.
Data: 2016
Status: Pubblicato
Lingua della pubblicazione: Inglese
Parole chiave: self-learning methodologies, data transformation methods, sparse data distribution., self-learning methodologies, data transformation methods, sparse data distribution.
Dipartimenti (originale): DAUIN - Dipartimento di Automatica Informatica
Dipartimenti: DAUIN - Dipartimento di Automatica e Informatica
URL correlate:
    Area disciplinare: Area 09 - Ingegneria industriale e dell'informazione > SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI
    Data di deposito: 30 Set 2016 16:38
    Data ultima modifica (IRIS): 15 Mag 2017 16:19:13
    Data inserimento (PORTO): 17 Mag 2017 02:03
    Permalink: http://porto.polito.it/id/eprint/2651485
    Link resolver URL: Link resolver link

    Allegati

    [img] PDF (WiML_Cerquitelli_etAl_PORTO.pdf) - Preprint
    Accesso al documento: Non visibile (accessibile solo al proprietario del dato)
    Licenza: Non pubblico - Accesso privato / Ristretto.

    Download (96Kb (98775 bytes)) | Spedisci una richiesta all'autore per una copia del documento

    Azioni (richiesto il login)

    Visualizza il documento (riservato amministratori) Visualizza il documento (riservato amministratori)