= Performing a cluster analysis in R = The code below (thanks to Emily Ruzich) uses R to perform various commonly used procedures to identify variable clusters including k-means and hierarchical approaches and also assess these using a scree plot based upon the within cluster variance and a measure of the between to within cluster variances (the average width of silhouette) as suggested by Kaufman and Rousseau (1990). Plots can also be saved to pdf files within R. A pdf primer (Zakharov, 2016) on using k-means clustering in psychology is [[attachment:kmeans.pdf | here.]] {{{ aqdat <- read.csv("U://My Documents//aq 4 point data_reduced.csv") # the deparse and sink commands not needed and can be ignored unless putting all the commands here in a single function # the deparse function tells R to treat all inputs to deparse as string variables regardless of whether or not they are functions or data sets; useful for using as strings later in a function #filename <- deparse(substitute(aqdat)) #sink(paste("sib_output_cluster_",filename,".txt",sep=""), split=F, append=T, type="output") # Clustering ------------------------------------------------------------ ## http://www.statmethods.net/advstats/cluster.html ## See also Grayson ppt # compute sums of squares of each variables = (number of rows-1) x sum of variances of each column = (N-1)(Sum of variable Variances) wss <- (nrow(aqdat)-1)*sum(apply(aqdat,2,var)) # as a check for the below wss will correspond to the sum of variable variances for one cluster ie for i=1 (the first element of wss below) # works out within cluster sum of squares for 2 to 10 cluster solutions for (i in 2:10) wss[i] <- sum(kmeans(aqdat,centers=i)$withinss) # scree plot comparing the total of within cluster sums of squares with the number of clusters; As with factor analysis choose the number of clusters # where there is a kink as shows the clusters are stabilising plot(1:10, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares") #can save the plot in a pdf file # alternate method - prints the suggested number of clusters based on optimum average silhouette width #Average Silhouette Width measures how much more similar a data point is to #the points in its own cluster than to points in a neighbour cluster. The average #silhouette width is calculated as the average of the silhouette width for all the #points in the data - the higher the better - see L. Kaufman, P. Rousseeuw. #Finding Groups in Data: An Introduction to Cluster Analysis. Wiley. 1990. # SEE ALSO JESS SHIN THESUS PDF FILE IN THE FOLDER (EMILY RUZICH CLUSTERING IN H RING #FOLDER/CLUSTER ANALYSES TAKEN FROM #http://www.eecis.udel.edu/~compbio/thesis/JessShenThesis.pdf) install.packages("fpc") library(fpc) aa <- pamk(aqdat,krange=2:10); print("OUTPUT FOR PAMK FUNCTION"); print(aa) ### 2 clusters ---------------------------------------------------------- ### K-Means Cluster Analysis fit_kmeans_2 <- kmeans(aqdat, 2) # get cluster descriptive data km2_means <- aggregate(aqdat,by=list(fit_kmeans_2$cluster),FUN=mean); print("KM2 MEANS"); print(km2_means) km2_sds <- aggregate(aqdat,by=list(fit_kmeans_2$cluster),FUN=sd); print("KM2 SDS"); print(km2_sds) km2_ranges <- aggregate(aqdat,by=list(fit_kmeans_2$cluster),FUN=range); print("KM2 RANGES"); print(km2_ranges) # append kmeans cluster assignment aqdat2 <- data.frame(aqdat, fit_kmeans_2$cluster) # repeat for 3 cluster solution fit_kmeans_3 <- kmeans(aqdat, 3) # get cluster descriptive data km3_means <- aggregate(aqdat,by=list(fit_kmeans_3$cluster),FUN=mean); print("KM3 MEANS"); print(km3_means) km3_sds <- aggregate(aqdat,by=list(fit_kmeans_3$cluster),FUN=sd); print("KM3 SDS"); print(km3_sds) km3_ranges <- aggregate(aqdat,by=list(fit_kmeans_3$cluster),FUN=range); print("KM3 RANGES"); print(km3_ranges) # append kmeans cluster assignment mydata <- data.frame(aqdat, fit_kmeans_3$cluster) ### Ward Hierarchical Clustering library(fpc) d <- dist(aqdat, method = "euclidean") # distance matrix fit_ward_2 <- hclust(d, method="ward") plot(fit_ward_2) # display dendogram clustnumber2 <- cutree(fit_ward_2, k=2) # cut tree into 2 clusters rect.hclust(fit_ward_2, k=2, border="red") # draw dendogram with red borders around the 2 clusters aqdat2 <- data.frame(aqdat, clustnumber2) fit_ward_3 <- hclust(d, method="ward") #title <- paste("sibplots_cluster_",filename,"_warddendrogram",sep="") #pdf(paste(title,".pdf",sep="")) plot(fit_ward_3) # display dendogram clustnumber3 <- cutree(fit_ward_3, k=3) # cut tree into 3 clusters rect.hclust(fit_ward_3, k=3, border="red") # draw dendogram with red borders around the 3 clusters dev.off() library(cluster) par(mfrow=c(1,2)) # comparing cluster solutions from, k-means # displays 2 and 3 cluster solutions on one plot clusplot(aqdat, fit_kmeans_2$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) clusplot(aqdat, fit_kmeans_3$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) #sink() write.csv(mydata,file=paste("U://My Documents//aqdat,".with.cluster.assignments.csv",sep="")) }}} In addition Silhouette plots can be used (e.g. using the pam function in R) to choose an optimal number of clusters [[http://ecology.msu.montana.edu/labdsv/R/labs/lab13/lab13.html | (see here).]] Cluster solutions which have clusters which have higher silhouette widths (the nearer to 1 the better) are preferred since higher silhouette widths represent more distinctive clusters. __References__ Kaufman, L. and Rousseeuw, P. (1990) Finding Groups in Data: An Introduction to Cluster Analysis. Wiley. Smith, L. (2018) [[attachment:Clustering.pdf | Cooking up statistics: the science and the art.]] ''Significance'' '''15(5)''' 28-32. Award winning article featuring uses of agglomeration clustering, dendrograms and Euclidean and Manhattan distances. Zakharov, K. (2016) Application of k-means clustering in psychological studies. ''The Quantitative Methods for Psychology'', '''12(2)''' 87-100. doi:10.20982/tqmp.12.2.p087 More on clustering methods is [[attachment:shin.pdf | given in the pdf here]] which was downloaded from [[http://www.eecis.udel.edu/~compbio/thesis/JessShenThesis.pdf | here.]]