= Performing a cluster analysis in R =

The code below (thanks to Emily Ruzich) uses R to perform various commonly used procedures to identify variable clusters including k-means and hierarchical approaches and also assess these using a scree plot based upon the within cluster variance and a measure of the between to within cluster variances (the average width of silhouette) as suggested by Kaufman and Rousseau (1990). Plots can also be saved to pdf files within R. A pdf primer (Zakharov, 2016) on using k-means clustering in psychology is [[attachment:kmeans.pdf | here.]]

{{{
aqdat <- read.csv("U://My Documents//aq 4 point data_reduced.csv")

# the deparse and sink commands not needed and can be ignored unless putting all the commands here in a single function
# the deparse function tells R to treat all inputs to deparse as string variables regardless of whether or not they are functions or data sets; useful for using as strings
later in a function

#filename <- deparse(substitute(aqdat)) 
#sink(paste("sib_output_cluster_",filename,".txt",sep=""), split=F, append=T, type="output")

# Clustering ------------------------------------------------------------
## http://www.statmethods.net/advstats/cluster.html
## See also Grayson ppt


# compute sums of squares of each variables = (number of rows-1) x sum of variances of each column = (N-1)(Sum of variable Variances)

wss <- (nrow(aqdat)-1)*sum(apply(aqdat,2,var))

# as a check for the below wss will correspond to the sum of variable variances for one cluster ie for i=1 (the first element of wss below)
# works out within cluster sum of squares for 2 to 10 cluster solutions
for (i in 2:10) wss[i] <- sum(kmeans(aqdat,centers=i)$withinss)

# scree plot comparing the total of within cluster sums of squares with the number of clusters; As with factor analysis choose the number of clusters
# where there is a kink as shows the clusters are stabilising
plot(1:10, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
#can save the plot in a pdf file

# alternate method - prints the suggested number of clusters based on optimum average silhouette width
#Average Silhouette Width measures how much more similar a data point is to 
#the points in its own cluster than to points in a neighbour cluster. The average 
#silhouette width is calculated as the average of the silhouette width for all the 
#points in the data - the higher the better - see L. Kaufman, P. Rousseeuw. 
#Finding Groups in Data: An Introduction to Cluster Analysis. Wiley. 1990.

# SEE ALSO JESS SHIN THESUS PDF FILE IN THE FOLDER (EMILY RUZICH CLUSTERING IN H RING #FOLDER/CLUSTER ANALYSES TAKEN FROM #http://www.eecis.udel.edu/~compbio/thesis/JessShenThesis.pdf)

install.packages("fpc") 
library(fpc)
aa <- pamk(aqdat,krange=2:10); print("OUTPUT FOR PAMK FUNCTION"); print(aa)

### 2 clusters ----------------------------------------------------------
  
### K-Means Cluster Analysis
fit_kmeans_2 <- kmeans(aqdat, 2)

# get cluster descriptive data
km2_means <- aggregate(aqdat,by=list(fit_kmeans_2$cluster),FUN=mean); print("KM2 MEANS"); print(km2_means)
km2_sds <- aggregate(aqdat,by=list(fit_kmeans_2$cluster),FUN=sd); print("KM2 SDS"); print(km2_sds)
km2_ranges <- aggregate(aqdat,by=list(fit_kmeans_2$cluster),FUN=range); print("KM2 RANGES"); print(km2_ranges)

# append kmeans cluster assignment
aqdat2   <- data.frame(aqdat, fit_kmeans_2$cluster)

# repeat for 3 cluster solution
fit_kmeans_3 <- kmeans(aqdat, 3)
  
# get cluster descriptive data
km3_means <- aggregate(aqdat,by=list(fit_kmeans_3$cluster),FUN=mean); print("KM3 MEANS"); print(km3_means)
km3_sds <- aggregate(aqdat,by=list(fit_kmeans_3$cluster),FUN=sd); print("KM3 SDS"); print(km3_sds)
km3_ranges <- aggregate(aqdat,by=list(fit_kmeans_3$cluster),FUN=range); print("KM3 RANGES"); print(km3_ranges)
  
  # append kmeans cluster assignment
mydata <- data.frame(aqdat, fit_kmeans_3$cluster)


### Ward Hierarchical Clustering
library(fpc)
d <- dist(aqdat, method = "euclidean") # distance matrix
  
fit_ward_2 <- hclust(d, method="ward") 
plot(fit_ward_2) # display dendogram
clustnumber2 <- cutree(fit_ward_2, k=2) # cut tree into 2 clusters
rect.hclust(fit_ward_2, k=2, border="red") # draw dendogram with red borders around the 2 clusters 
aqdat2 <- data.frame(aqdat, clustnumber2)

fit_ward_3 <- hclust(d, method="ward") 
  
 #title <- paste("sibplots_cluster_",filename,"_warddendrogram",sep="")
#pdf(paste(title,".pdf",sep=""))
plot(fit_ward_3) # display dendogram
clustnumber3 <- cutree(fit_ward_3, k=3) # cut tree into 3 clusters
rect.hclust(fit_ward_3, k=3, border="red") # draw dendogram with red borders around the 3 clusters 
dev.off()

library(cluster)
par(mfrow=c(1,2)) 

# comparing cluster solutions from, k-means
# displays 2 and 3 cluster solutions on one plot
clusplot(aqdat, fit_kmeans_2$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
clusplot(aqdat, fit_kmeans_3$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

#sink()
write.csv(mydata,file=paste("U://My Documents//aqdat,".with.cluster.assignments.csv",sep=""))
}}}

In addition Silhouette plots can be used (e.g. using the pam function in R) to choose an optimal number of clusters  [[http://ecology.msu.montana.edu/labdsv/R/labs/lab13/lab13.html | (see here).]] Cluster solutions which have clusters which have higher silhouette widths (the nearer to 1 the better) are preferred since higher silhouette widths represent more distinctive clusters.

__References__

Kaufman, L. and Rousseeuw, P. (1990) Finding Groups in Data: An Introduction to Cluster Analysis. Wiley.

Smith, L. (2018) [[attachment:Clustering.pdf | Cooking up statistics: the science and the art.]] ''Significance'' '''15(5)''' 28-32. Award winning article featuring uses of agglomeration clustering, dendrograms and Euclidean and Manhattan distances.

Zakharov, K. (2016) Application of k-means clustering in psychological studies. ''The Quantitative Methods for Psychology'', '''12(2)''' 87-100. doi:10.20982/tqmp.12.2.p087

More on clustering methods is [[attachment:shin.pdf | given in the pdf here]] which was downloaded from [[http://www.eecis.udel.edu/~compbio/thesis/JessShenThesis.pdf | here.]]