Model selection and mixture approaches in the spectral clustering algorithm

Di Nuzzo, Cinzia

The central aim of this thesis is the study of spectral clustering. Spectral clustering is a well-known method in machine learning, here, it is dealt with from a statistical point of view. Spectral clustering methods have become very popular for finding non-convex clusters of data. These methods are based on the graph theory, where the data are represented by the vertices in an undirected graph and the edges are weighted by the similarities between pairs of units. In particular, the spectral approach is based on the properties of the pairwise similarity matrix coming from a suitable kernel function and then the clustering problem is reformulated as a graph partition problem. The spectral clustering algorithm does not work directly on the raw data but on the embedded data in a suitable feature space having a smaller (and even a very smaller) dimension than the space of the original data. This implies that this algorithm is a handy approach for handling high-dimensional data. However, the spectral clustering algorithm runs once two features have been selected as input: the number of clusters and the similarity measure that describes the local neighborhood relationships between the units. The similarity measure depends on the choice of a suitable kernel function based on some hyper-parameter(s). Throughout this thesis, the selection of the number of clusters and the selection of the kernel function will be referred to the model selection, even if there is no underlying probabilistic assumption on the data and therefore the term ``model selection" should be interpreted in a broad sense. In this framework, the thesis focuses on research on two significant topics: model selection and applications to different kinds of data sets. As regards the first topic, in this thesis, two criteria for the selection of the number of clusters and of the kernel function with its hyper-parameter(s) have been proposed. The first criterion combines different approaches taking information from different graphical and geometrical features deriving from the embedded data, and for this reason, I refer to this approach as joint-graphical. Secondly, some results in the framework of an automatic approach for the selection of the hyper-parameter(s) of the spectral clustering algorithm through mixture models according to the maximum likelihood approach have been presented. In this context, the spectral clustering algorithm is based on two main steps: the first step concerns essentially a data transformation with a dimensionality reduction based on an embedding coming from a kernel function; the second step concerns an appropriate clustering algorithm of the embedded data. Usually, the latter step is performed using the $k$-means algorithm, here, mixture models have been taken into account showing to be more robust approaches with respect to the choice of the hyper-parameter(s) that is in general quite critical. The second topic analyzed concerns applications of spectral clustering in different kinds of data sets: text data, mixed-type data and three-way data. The thesis is organized as follows. Chapter 1 is devoted to summarize the main results of the spectral clustering literature. In Chapter 2 some practical issues about the selection of kernel function and its hyper-parameters as well as the number of clusters proposed in the literature are presented. Chapter 3 concerns the document clustering and the model-based clustering of the embedded data is introduced; moreover, a new suitable kernel function for text clustering is also proposed. Chapter 4 presents an application about the mixed-type data and some categorical kernel functions are also introduced. Chapter 5 and Chapter 6 present methodological contributions regarding the model selection. More precisely, in Chapter 5 a joint-graphical method to select both the number of clusters and the hyper-parameter in the kernel function is proposed; while Chapter 6 concerns the proposal of an automatic method for the selection of the hyper-parameter(s) in the kernel function via Gaussian mixture model according to the maximum likelihood approach. Chapter 6 presents also original application of the spectral clustering technique in three-way data clustering. In Chapter 7 some ideas for future research are presented. Further examples and comparisons of the spectral clustering algorithms and kernel functions are provided in the Appendix A. Finally, in Appendix B, model-based approach to clustering is summarized.