Cluster analysis is a type of multivariate analysis technique that can be applied in many fields: from computer science, medicine and biology, from archaeology to marketing, whenever it is necessary to classify a large amount of information into distinguishable groups.
Cluster analysis is used to group statistical units (records) that have common characteristics and assign them to categories not defined a priori.
In Cluster Analysis you can use:
-quantitative variables, therefore numeric;
-qualitative variables, which present modalities (e.g. gender, level of education, marital status, etc.)
The distance matrix D is useful to know how many statistical units are different from each other, it is decisive for the choice of variables to be considered.
Before creating the distance matrix, the starting matrix must be standardized, so each variable will have the same weight as the others.
To obtain the distance matrix D it is necessary to calculate the distances between the points.
Thanks to the link rule we can choose the type of link that we will use to form clusters, among the following:
The groups are put together according to the minimum distance between the observations, this link favors the homogeneity of the elements of each group to the detriment of differentiation.
The groups are put together according to the minimum maximum distance between the points, so first the greatest distances between the groups are calculated and then those that have the least distance are chosen. This type of link highlights differences between groups rather than internal homogeneity.
The groups are put together according to the minimum average distance, i.e. first calculate the average distance between all the observations and then between them we take the minimum distance. This type of link is less sensitive to extreme values, so it will be more robust.
The distance between clusters tends to increase and for this reason we choose a stop rule that allows us to choose the number of groups we want to get.
To do this we use the tree cutting technique:
-Looking at the longest branches;
-Through the criterion of parsimony (usually 4-5 homogeneous clusters inside and heterogeneous outside);
With the fusion distances Scree-plot (when the graph flattens, or if in the transition from g to g+1 groups there is a strong increase);
-Taking care that there are no outliers (clusters composed of a single point).
After importing the dataset into R, we start with the analysis in Cluster:
The result obtained with the Simple linkage:
The same procedure is carried out for the complete linkage and the average linkage.
Comparing the three links, the most suitable is the complete likage as it divides the clusters better, avoiding that there is too much internal homogeneity at the expense of external heterogeneity. It also prevents the formation of outliers, i.e. clusters composed of a single point.
Statistical units, cluster, intra-cluster, inter-cluster, dissimilarity index, merge distance, dendogram.Objectives/goals:
The aim of this module is to introduce and explain the technique of Cluster Analysis.
At the end of this module you will be able to:
Conduct a Cluster Analysis
In this training module you will be presented the multidimensional analysis technique called Cluster Analysis, also called automatic group analysis.
Cluster analyses are used to group statistical units that have characteristics in common and assign them to categories not defined a priori. The groups that are formed must be as homogeneous as possible inside (intra-cluster) and heterogeneous outside (inter-cluster).
The application of this type of analysis is manifold: computer science, medicine, biology, marketing.
The last part of the module will be dedicated to the application of cluster analysis with the R software.