In this paper we propose a new self-learning engine to streamline the analytics process, as it enables analysts to mine massive data repositories with minimal user intervention. In the context of cluster analysis on a collection of documents this new system, named SELF-DATA (SELF-learning DAta TrAnsformation), suggests to the analyst how to con€figure the whole mining process for a given dataset. SELF-DATA relies on an engine exploring different data weighting schemas (e.g., normalized term frequencies) and data transformation methods (e.g., PCA) before applying the cluster analysis, evaluating and comparing solutions through different quality indices (e.g., weighted Silhouette), and presenting the k-top solutions to the analyst. SELF-DATA will also include a knowledge base storing results of experiments on previously processed datasets, and a classifi€cation algorithm trained on the knowledge base content to forecast the best con€figuration for the whole mining process for an unexplored dataset. The first development of SELF-DATA running on Apache Spark has been validated on 5 collections of documents. Experimental results highlight that TF-IDF and logarithmic entropy are effective to measure item relevance with sparse datasets, and the LSI method outperforms PCA with a large dictionary.

Data miners' little helper: data transformation activity cues for cluster analysis on document collections / Cerquitelli, Tania; DI CORSO, Evelina; Ventura, Francesco; Chiusano, SILVIA ANNA. - STAMPA. - (2017), pp. 1-6. (Intervento presentato al convegno Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics tenutosi a Amantea nel June 19-22, 2017) [10.1145/3102254.3102288].

Data miners' little helper: data transformation activity cues for cluster analysis on document collections

CERQUITELLI, TANIA;DI CORSO, EVELINA;VENTURA, FRANCESCO;CHIUSANO, SILVIA ANNA
2017

Abstract

In this paper we propose a new self-learning engine to streamline the analytics process, as it enables analysts to mine massive data repositories with minimal user intervention. In the context of cluster analysis on a collection of documents this new system, named SELF-DATA (SELF-learning DAta TrAnsformation), suggests to the analyst how to con€figure the whole mining process for a given dataset. SELF-DATA relies on an engine exploring different data weighting schemas (e.g., normalized term frequencies) and data transformation methods (e.g., PCA) before applying the cluster analysis, evaluating and comparing solutions through different quality indices (e.g., weighted Silhouette), and presenting the k-top solutions to the analyst. SELF-DATA will also include a knowledge base storing results of experiments on previously processed datasets, and a classifi€cation algorithm trained on the knowledge base content to forecast the best con€figuration for the whole mining process for an unexplored dataset. The first development of SELF-DATA running on Apache Spark has been validated on 5 collections of documents. Experimental results highlight that TF-IDF and logarithmic entropy are effective to measure item relevance with sparse datasets, and the LSI method outperforms PCA with a large dictionary.
2017
978-1-4503-5225-3
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2678073
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo