This paper proposes PASTA (PArameter-free Solutions for Textual Analysis), a large scale engine providing strategies to automatically tune the algorithm parameters for the whole text clustering process. A data weighting strategy (e.g., TF-IDF) and a transformation method of input data (e.g., LSI) is explored before performing the cluster analysis to reduce sparseness, and make the overall analysis problem more eectively tractable. PASTA includes auto-selection strategies to o-load the end-user from parameter tuning and achieve a good quality of the clustering results. PASTA's current implementation runs on Apache Spark, a state-of-the-art distributed computing framework. As a case study, PASTA has been validated on three collections of Wikipedia documents. The experimental results show the eectiveness and the eciency of the proposed solution in analyzing collections of documents without tuning algorithm parameters and in discovering cohesive and well-separated groups of documents.

Self-tuning techniques for large scale cluster analysis on textual data collections / DI CORSO, Evelina; Cerquitelli, Tania; Ventura, Francesco. - STAMPA. - (2017), pp. 1-6. (Intervento presentato al convegno ACM SIGAPP Symposium On Applied Computing tenutosi a Marrakesh, Morocco nel April 3rd-7th, 2017) [10.1145/3019612.3019661].

Self-tuning techniques for large scale cluster analysis on textual data collections

DI CORSO, EVELINA;CERQUITELLI, TANIA;VENTURA, FRANCESCO
2017

Abstract

This paper proposes PASTA (PArameter-free Solutions for Textual Analysis), a large scale engine providing strategies to automatically tune the algorithm parameters for the whole text clustering process. A data weighting strategy (e.g., TF-IDF) and a transformation method of input data (e.g., LSI) is explored before performing the cluster analysis to reduce sparseness, and make the overall analysis problem more eectively tractable. PASTA includes auto-selection strategies to o-load the end-user from parameter tuning and achieve a good quality of the clustering results. PASTA's current implementation runs on Apache Spark, a state-of-the-art distributed computing framework. As a case study, PASTA has been validated on three collections of Wikipedia documents. The experimental results show the eectiveness and the eciency of the proposed solution in analyzing collections of documents without tuning algorithm parameters and in discovering cohesive and well-separated groups of documents.
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2662148
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo