Title: | Hierarchical Cluster Analysis of Nominal Data |
---|---|
Description: | Similarity measures for hierarchical clustering of objects characterized by nominal (categorical) variables. Evaluation criteria for nominal data clustering. |
Authors: | Zdenek Sulc [aut, cre], Jana Cibulkova [aut], Hana Rezankova [aut], Jaroslav Hornicek [aut] |
Maintainer: | Zdenek Sulc <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.8.0 |
Built: | 2024-11-07 04:41:29 UTC |
Source: | https://github.com/cran/nomclust |
The function calculates a dissimilarity matrix based on the AN similarity measure.
anderberg(data)
anderberg(data)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
The Anderberg similarity measure was presented in (Anderberg, 1973). The measure assigns higher weights to infrequent matches and mismatches. It takes on values from zero to one. The minimum similarity is attained when there are no matches and vice versa, see (Borian et al., 2008).
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Andergerg M.R. (1973). Cluster analysis for applications. Academic Press, New York.
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
burnaby
,
eskin
,
gambaryan
,
goodall1
,
goodall2
,
goodall3
,
goodall4
,
iof
,
lin
,
lin1
,
of
,
sm
,
smirnov
,
ve
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.anderberg <- anderberg(data20)
# sample data data(data20) # dissimilarity matrix calculation prox.anderberg <- anderberg(data20)
Converts objects of the class "nomclust" to the class "agnes, twins".
as.agnes(x, ...)
as.agnes(x, ...)
x |
The "nomclust" object containing components "dend" and "prox". |
... |
Further arguments passed to or from other methods. |
The function returns an object of class "agnes, twins".
Zdenek Sulc.
Contact: [email protected]
# sample data data(data20) # creating an object with results of hierarchical clustering of hca.object <- nomclust(data20, measure = "lin", method = "average", clu.high = 5, prox = TRUE) # nomclust plot plot(hca.object) # obtaining the agnes, twins object hca.object.agnes <- as.agnes(hca.object) # agnes plot plot(hca.object.agnes) # obtaining the hclust object hca.object.hclust <- as.hclust(hca.object) # hclust plot plot(hca.object.hclust)
# sample data data(data20) # creating an object with results of hierarchical clustering of hca.object <- nomclust(data20, measure = "lin", method = "average", clu.high = 5, prox = TRUE) # nomclust plot plot(hca.object) # obtaining the agnes, twins object hca.object.agnes <- as.agnes(hca.object) # agnes plot plot(hca.object.agnes) # obtaining the hclust object hca.object.hclust <- as.hclust(hca.object) # hclust plot plot(hca.object.hclust)
The function calculates a dissimilarity matrix based on the BU similarity measure.
burnaby(data, var.weights = NULL)
burnaby(data, var.weights = NULL)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
var.weights |
A numeric vector setting weights to the used variables. One can choose the real numbers from zero to one. |
The Burnaby similarity measure was presented in (Burnaby, 1970). The measure assigns low similarity to mismatches on rare values and high similarity to mismatches on frequent values, see (Borian et al., 2008).
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Burnaby T. (1970). On a method for character weighting a similarity coefficient, employing the concept of information.
Mathematical Geology, 2(1), 25-38.
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
anderberg
,
eskin
,
gambaryan
,
goodall1
,
goodall2
,
goodall3
,
goodall4
,
iof
,
lin
,
lin1
,
of
,
sm
,
smirnov
,
ve
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.burnaby <- burnaby(data20) # dissimilarity matrix calculation with variable weights weights.burnaby <- burnaby(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
# sample data data(data20) # dissimilarity matrix calculation prox.burnaby <- burnaby(data20) # dissimilarity matrix calculation with variable weights weights.burnaby <- burnaby(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
The dataset contains five different characteristics of 24 clustering algorithms. The "Type" variable expresses the principle on which the clustering is based. There are five possible categories: density, grid, hierarchical, model-based, and partitioning. The binary variable "OptClu" indicates if the clustering algorithm offers the optimal number of clusters. The variable "Large" indicates if the clustering algorithm was designed to cluster large datasets. The "TypicalType" variable presents the typical data type for which the clustering algorithm was determined. There are three possible categories: categorical, mixed, and quantitative. Since some clustering algorithms support more data types, the binary variable "MoreTypes" indicates this support.
data("CA.methods")
data("CA.methods")
A data frame containing 5 variables and 24 cases.
created by the authors of the nomclust package
This dataset consists of 5 nominal variables and 20 cases. Its main aim is to demonstrate the desired entry data structure for the nomclust package.
data(data20)
data(data20)
A data frame containing 5 variables and 20 cases.
created by the authors of the nomclust package
The function dend.plot()
visualizes the hierarchy of clusters using a dendrogram. The function also enables a user to mark the individual clusters with colors.
The number of displayed clusters can be defined either by a user or by one of the five evaluation criteria.
dend.plot( x, clusters = "BIC", style = "greys", colorful = TRUE, clu.col = NA, main = "Dendrogram", ac = TRUE, ... )
dend.plot( x, clusters = "BIC", style = "greys", colorful = TRUE, clu.col = NA, main = "Dendrogram", ac = TRUE, ... )
x |
An output of the |
clusters |
Either a numeric value or a character string with the name of the evaluation criterion expressing the number of displayed clusters in a dendrogram. The following evaluation criteria can be used: |
style |
A character string or a vector of colors defines a graphical style of the produced plots. There are two predefined styles in the nomclust package, namely |
colorful |
A logical argument specifying if the output will be colorful or black and white. |
clu.col |
An optional vector of colors which allows a researcher to apply user-defined colors for displayed (marked) clusters in a dendrogram. |
main |
A character string with the chart title. |
ac |
A logical argument indicating if an agglomerative coefficient will be present in the output. |
... |
Other graphical arguments compatible with the generic |
The function can be applied to a nomclust()
or nomprox()
output containing the dend
component. This component is not available when the optimization process is used.
The function returns a dendrogram describing the hierarchy of clusters that can help to identify the optimal number of clusters.
Jana Cibulkova and Zdenek Sulc.
Contact: [email protected]
# sample data data(data20) # creating an object with results of hierarchical clustering hca.object <- nomclust(data20, measure = "iof", eval = TRUE) # a basic plot dend.plot(hca.object) # a dendrogram with color-coded clusters according to the BIC index dend.plot(hca.object, clusters = "BIC", colorful = TRUE) # using a dark style and specifying own colors in a solution with three clusters dend.plot(hca.object, clusters = 3, style = "dark", clu.col = c("blue", "red", "green")) # a black and white dendrogram dend.plot(hca.object, clusters = 3, style = "dark", colorful = FALSE)
# sample data data(data20) # creating an object with results of hierarchical clustering hca.object <- nomclust(data20, measure = "iof", eval = TRUE) # a basic plot dend.plot(hca.object) # a dendrogram with color-coded clusters according to the BIC index dend.plot(hca.object, clusters = "BIC", colorful = TRUE) # using a dark style and specifying own colors in a solution with three clusters dend.plot(hca.object, clusters = 3, style = "dark", clu.col = c("blue", "red", "green")) # a black and white dendrogram dend.plot(hca.object, clusters = 3, style = "dark", colorful = FALSE)
The function calculates a dissimilarity matrix based on the ES similarity measure.
eskin(data, var.weights = NULL)
eskin(data, var.weights = NULL)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
var.weights |
A numeric vector setting weights to the used variables. One can choose the real numbers from zero to one. |
The Eskin similarity measure was proposed by Eskin et al. (2002) and examined by Boriah et al., (2008). It is constructed to assign higher weights to mismatches on variables with more categories.
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
Eskin E., Arnold A., Prerau M., Portnoy L. and Stolfo S. (2002). A geometric framework for unsupervised anomaly detection.
In D. Barbara and S. Jajodia (Eds): Applications of Data Mining in Computer Security, p. 78-100. Norwell: Kluwer Academic Publishers.
anderberg
,
burnaby
,
gambaryan
,
goodall1
,
goodall2
,
goodall3
,
goodall4
,
iof
,
lin
,
lin1
,
of
,
sm
,
smirnov
,
ve
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.eskin <- eskin(data20) # dissimilarity matrix calculation with variable weights weights.eskin <- eskin(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
# sample data data(data20) # dissimilarity matrix calculation prox.eskin <- eskin(data20) # dissimilarity matrix calculation with variable weights weights.eskin <- eskin(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
The function visualizes the values of up to eight evaluation criteria for the range of cluster solutions defined by the user in the nomclust, evalclust or nomprox functions. It also indicates the optimal number of clusters determined by these criteria. The charts for the evaluation criteria in the nomclust package.
eval.plot( x, criteria = "all", style = "greys", opt.col = "red", main = "Cluster Evaluation", ... )
eval.plot( x, criteria = "all", style = "greys", opt.col = "red", main = "Cluster Evaluation", ... )
x |
An output of the "nomclust" object containing the |
criteria |
A character string or character vector specifying the criteria that are going to be visualized. It can be selected one particular criterion, a vector of criteria, or all the available criteria by typing |
style |
A character string or a vector of colors defines the graphical style of the produced plots. There are two predefined styles in the nomclust package, namely |
opt.col |
An argument specifying a color that is used for the optimal number of clusters identification. |
main |
A character string with the chart title. |
... |
Other graphical arguments compatible with the generic |
The function can display up to eight evaluation criteria. Namely, Within-cluster mutability coefficient (WCM), Within-cluster entropy coefficient (WCE),
Pseudo F Indices based on the mutability (PSFM) and the entropy (PSFE), Bayesian (BIC), and Akaike (AIC) information criteria for categorical data, the BK index, and the silhouette index (SI).
The function returns a series of up to eight plots with evaluation criteria values and the graphical indication of the optimal numbers of clusters (for AIC, BIC, BK, PSFE, PSFM, SI).
Jana Cibulkova and Zdenek Sulc.
Contact: [email protected]
dend.plot
, nomclust
, evalclust
, nomprox
.
# sample data data(data20) # creating an object with results of hierarchical clustering hca.object <- nomclust(data20, measure = "iof", eval = TRUE) # a default series of plots eval.plot(hca.object) # changing the color indicating the optimum number of clusters eval.plot(hca.object, opt.col= "darkorange") # selecting only AIC and BIC criteria with the dark style eval.plot(hca.object, criteria = c("AIC", "BIC"), style = "dark") # selecting only SI eval.plot(hca.object, criteria = "SI")
# sample data data(data20) # creating an object with results of hierarchical clustering hca.object <- nomclust(data20, measure = "iof", eval = TRUE) # a default series of plots eval.plot(hca.object) # changing the color indicating the optimum number of clusters eval.plot(hca.object, opt.col= "darkorange") # selecting only AIC and BIC criteria with the dark style eval.plot(hca.object, criteria = c("AIC", "BIC"), style = "dark") # selecting only SI eval.plot(hca.object, criteria = "SI")
The function evaluates clustering results by a set of evaluation criteria (cluster validity indices).
evalclust(data, clusters, diss = NULL)
evalclust(data, clusters, diss = NULL)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
clusters |
A data.frame or a list of cluster memberships obtained based on the dataset defined in the parameter |
diss |
An optional parameter. A matrix or a dist object containing dissimilarities calculated based on the dataset defined in the parameter |
The function calculates a set of evaluation criteria if the original dataset and the cluster membership variables are provided. The function calculates up to 13 evaluation criteria described by (Sulc et al., 2018) and (Corter and Gluck, 1992) and provides the optimal number of clusters based on these criteria. It is primarily focused on evaluating hierarchical clustering results obtained by similarity measures different from those that occur in the nomclust package. Thus, it can serve for the comparison of various similarity measures for categorical data.
The function returns a list with three components.
The eval
component contains up to 13 evaluation criteria as vectors in a list. Namely, Within-cluster mutability coefficient (WCM), Within-cluster entropy coefficient (WCE),
Pseudo F Indices based on the mutability (PSFM) and the entropy (PSFE), Bayesian (BIC), and Akaike (AIC) information criteria for categorical data, the BK index, Category Utility (CU), Category Information (CI), Hartigan Mutability (HM), Hartigan Entropy (HE) and, if the prox component is present, the silhouette index (SI) and the Dunn index (DI).
The opt
component is present in the output together with the eval
component. It displays the optimal number of clusters for the evaluation criteria from the eval
component, except for WCM and WCE, where the optimal number of clusters is based on the elbow method.
The call
component contains the function call.
Zdenek Sulc.
Contact: [email protected]
Corter J.E., Gluck M.A. (1992). Explaining basic categories: Feature predictability and information. Psychological Bulletin 111(2), p. 291–303.
Sulc Z., Cibulkova J., Prochazka J., Rezankova H. (2018). Internal Evaluation Criteria for Categorical Data in Hierarchical Clustering: Optimal Number of Clusters Determination, Metodoloski Zveski, 15(2), p. 1-20.
# sample data data(data20) # creating an object with results of hierarchical clustering hca.object <- nomclust(data20, measure = "iof", method = "average", clu.high = 7) # the cluster memberships data20.clu <- hca.object$mem # obtaining evaluation criteria for the provided dataset and cluster memberships data20.eval <- evalclust(data20, clusters = data20.clu) # visualization of the evaluation criteria eval.plot(data20.eval) # silhouette index can be calculated if the dissimilarity matrix is provided data20.eval <- evalclust(data20, clusters = data20.clu, diss = hca.object$prox) eval.plot(data20.eval, criteria = "SI")
# sample data data(data20) # creating an object with results of hierarchical clustering hca.object <- nomclust(data20, measure = "iof", method = "average", clu.high = 7) # the cluster memberships data20.clu <- hca.object$mem # obtaining evaluation criteria for the provided dataset and cluster memberships data20.eval <- evalclust(data20, clusters = data20.clu) # visualization of the evaluation criteria eval.plot(data20.eval) # silhouette index can be calculated if the dissimilarity matrix is provided data20.eval <- evalclust(data20, clusters = data20.clu, diss = hca.object$prox) eval.plot(data20.eval, criteria = "SI")
The function calculates a dissimilarity matrix based on the GA similarity measure.
gambaryan(data)
gambaryan(data)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
The Gambaryan similarity measure was presented in (Gambaryan, 1964). The measure assigns low weight to matches where the matching value occurs in about half the dataset, i.e., in between being frequent and rare, see (Borian et al., 2008).
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Gambaryan P. (1964). A mathematical model of taxonomy.
SSR, 17(12), 47-53.
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
anderberg
,
burnaby
,
eskin
,
goodall1
,
goodall2
,
goodall3
,
goodall4
,
iof
,
lin
,
lin1
,
of
,
sm
,
smirnov
,
ve
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.gambaryan <- gambaryan(data20)
# sample data data(data20) # dissimilarity matrix calculation prox.gambaryan <- gambaryan(data20)
The function calculates a dissimilarity matrix based on the G1 similarity measure.
goodall1(data, var.weights = NULL)
goodall1(data, var.weights = NULL)
data |
A data.frame or a matrix with cases in rows and variables in column. |
var.weights |
A numeric vector setting weights to the used variables. One can choose the real numbers from zero to one. |
The Goodall 1 similarity measure was presented in (Boriah et al., 2008). It is a simple modification of the original Goodall measure (Goodall, 1966). The measure assigns higher weights to infrequent matches.
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
Goodall V.D. (1966). A new similarity index based on probability. Biometrics, 22(4), p. 882.
anderberg
,
burnaby
,
eskin
,
gambaryan
,
goodall2
,
goodall3
,
goodall4
,
iof
,
lin
,
lin1
,
of
,
sm
,
smirnov
,
ve
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.goodall1 <- goodall1(data20) # dissimilarity matrix calculation with variable weights weights.goodall1 <- goodall1(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
# sample data data(data20) # dissimilarity matrix calculation prox.goodall1 <- goodall1(data20) # dissimilarity matrix calculation with variable weights weights.goodall1 <- goodall1(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
The function calculates a dissimilarity matrix based on the G2 similarity measure.
goodall2(data, var.weights = NULL)
goodall2(data, var.weights = NULL)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
var.weights |
A numeric vector setting weights to the used variables. One can choose the real numbers from zero to one. |
The Goodall 2 similarity measure was presented in (Boriah et al., 2008). It is a simple modification of the original Goodall measure (Goodall, 1966). The measure assigns weight to infrequent matches under the condition that there are also other categories, which are even less frequent than the examined one.
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
Goodall V.D. (1966). A new similarity index based on probability. Biometrics, 22(4), p. 882.
anderberg
,
burnaby
,
eskin
,
gambaryan
,
goodall1
,
goodall3
,
goodall4
,
iof
,
lin
,
lin1
,
of
,
sm
,
smirnov
,
ve
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.goodall2 <- goodall2(data20) # dissimilarity matrix calculation with variable weights weights.goodall2 <- goodall2(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
# sample data data(data20) # dissimilarity matrix calculation prox.goodall2 <- goodall2(data20) # dissimilarity matrix calculation with variable weights weights.goodall2 <- goodall2(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
The function calculates a dissimilarity matrix based on the G3 similarity measure.
goodall3(data, var.weights = NULL)
goodall3(data, var.weights = NULL)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
var.weights |
A numeric vector setting weights to the used variables. One can choose the real numbers from zero to one. |
The Goodall 3 similarity measure was presented in (Boriah et al., 2008). It is a simple modification of the original Goodall measure (Goodall, 1966). The measure assigns higher weight if the infrequent categories match regardless on frequencies of other categories.
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
Goodall V.D. (1966). A new similarity index based on probability. Biometrics, 22(4), p. 882.
anderberg
,
burnaby
,
eskin
,
gambaryan
,
goodall1
,
goodall2
,
goodall4
,
iof
,
lin
,
lin1
,
of
,
sm
,
smirnov
,
ve
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.goodall3 <- goodall3(data20) # dissimilarity matrix calculation with variable weights weights.goodall3 <- goodall3(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
# sample data data(data20) # dissimilarity matrix calculation prox.goodall3 <- goodall3(data20) # dissimilarity matrix calculation with variable weights weights.goodall3 <- goodall3(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
The function calculates a dissimilarity matrix based on the G4 similarity measure.
goodall4(data, var.weights = NULL)
goodall4(data, var.weights = NULL)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
var.weights |
A numeric vector setting weights to the used variables. One can choose the real numbers from zero to one. |
The Goodall 4 similarity measure was presented in (Boriah et al., 2008). It is a simple modification of the original Goodall measure (Goodall, 1966). It assigns higher weights to the frequent categories matches.
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
Goodall V.D. (1966). A new similarity index based on probability. Biometrics, 22(4), p. 882.
anderberg
,
burnaby
,
eskin
,
gambaryan
,
goodall1
,
goodall2
,
goodall3
,
iof
,
lin
,
lin1
,
of
,
sm
,
smirnov
,
ve
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.goodall4 <- goodall4(data20) # dissimilarity matrix calculation with variable weights weights.goodall4 <- goodall4(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
# sample data data(data20) # dissimilarity matrix calculation prox.goodall4 <- goodall4(data20) # dissimilarity matrix calculation with variable weights weights.goodall4 <- goodall4(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
The function calculates a dissimilarity matrix based on the IOF similarity measure.
iof(data, var.weights = NULL)
iof(data, var.weights = NULL)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
var.weights |
A numeric vector setting weights to the used variables. One can choose the real numbers from zero to one. |
The IOF (Inverse Occurrence Frequency) measure was originally constructed for the text mining tasks, see (Sparck-Jones, 1972), later, it was adjusted for categorical variables, see (Boriah et al., 2008). The measure assigns higher weight to mismatches on less frequent values and vice versa.
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
Spark-Jones K. (1972). A statistical interpretation of term specificity and its application in retrieval.
In Journal of Documentation, 28(1), 11-21. Later: Journal of Documentation, 60(5) (2002), 493-502.
anderberg
,
burnaby
,
eskin
,
gambaryan
,
goodall1
,
goodall2
,
goodall3
,
goodall4
,
lin
,
lin1
,
of
,
sm
,
smirnov
,
ve
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.iof <- iof(data20) # dissimilarity matrix calculation with variable weights weights.iof <- iof(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
# sample data data(data20) # dissimilarity matrix calculation prox.iof <- iof(data20) # dissimilarity matrix calculation with variable weights weights.iof <- iof(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
The function calculates a dissimilarity matrix based on the LIN similarity measure.
lin(data, var.weights = NULL)
lin(data, var.weights = NULL)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
var.weights |
A numeric vector setting weights to the used variables. One can choose the real numbers from zero to one. |
The Lin measure was introduced by Lin (1998) and presented in (Boriah et al., 2008). The measure assigns higher weights to more frequent categories in case of matches and lower weights to less frequent categories in case of mismatches.
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
Lin D. (1998). An information-theoretic definition of similarity.
In: ICML '98: Proceedings of the 15th International Conference on Machine Learning. San Francisco, p. 296-304.
anderberg
,
burnaby
,
eskin
,
gambaryan
,
goodall1
,
goodall2
,
goodall3
,
goodall4
,
iof
,
lin1
,
of
,
sm
,
smirnov
,
ve
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.lin <- lin(data20) # dissimilarity matrix calculation with variable weights weights.lin<- lin(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
# sample data data(data20) # dissimilarity matrix calculation prox.lin <- lin(data20) # dissimilarity matrix calculation with variable weights weights.lin<- lin(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
The function calculates a dissimilarity matrix based on the LIN1 similarity measure.
lin1(data, var.weights = NULL)
lin1(data, var.weights = NULL)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
var.weights |
A numeric vector setting weights to the used variables. One can choose the real numbers from zero to one. |
The Lin 1 similarity measure was introduced in (Boriah et al., 2008) as a modification of the original Lin measure (Lin, 1998). In has a complex system of weights. In case of mismatch, lower similarity is assigned if either the mismatching values are very frequent or their relative frequency is in between the relative frequencies of mismatching values. Higher similarity is assigned if the mismatched categories are infrequent and there are a few other infrequent categories. In case of match, lower similarity is given for matches on frequent categories or matches on categories that have many other values of the same frequency. Higher similarity is given to matches on infrequent categories.
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
Lin D. (1998). An information-theoretic definition of similarity.
In: ICML '98: Proceedings of the 15th International Conference on Machine Learning. San Francisco, p. 296-304.
anderberg
,
burnaby
,
eskin
,
gambaryan
,
goodall1
,
goodall2
,
goodall3
,
goodall4
,
iof
,
lin
,
of
,
sm
,
smirnov
,
ve
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.lin1 <- lin1(data20) # dissimilarity matrix calculation with variable weights weights.lin1 <- lin1(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
# sample data data(data20) # dissimilarity matrix calculation prox.lin1 <- lin1(data20) # dissimilarity matrix calculation with variable weights weights.lin1 <- lin1(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
The function performs and evaluates hierarchical cluster analysis of nominal data.
nomclust( data, measure = "lin", method = "average", clu.high = 6, eval = TRUE, prox = 100, var.weights = NULL )
nomclust( data, measure = "lin", method = "average", clu.high = 6, eval = TRUE, prox = 100, var.weights = NULL )
data |
A data.frame or a matrix with cases in rows and variables in columns. |
measure |
A character string defining the similarity measure used for computation of proximity matrix in HCA:
|
method |
A character string defining the clustering method. The following methods can be used: |
clu.high |
A numeric value expressing the maximal number of cluster for which the cluster memberships variables are produced. |
eval |
A logical operator; if TRUE, evaluation of the clustering results is performed. |
prox |
A logical operator or a numeric value. If a logical value TRUE indicates that the proximity matrix is a part of the output. A numeric value (integer) of this argument indicates the maximal number of cases in a dataset for which a proximity matrix will occur in the output. |
var.weights |
A numeric vector setting weights to the used variables. One can choose the real numbers from zero to one. |
The function runs hierarchical cluster analysis (HCA) with objects characterized by nominal variables (without natural order of categories). It completely covers the clustering process, from the dissimilarity matrix calculation to the cluster quality evaluation. The function enables a user to choose from the similarity measures for nominal data summarized by (Boriah et al., 2008) and by (Sulc and Rezankova, 2019). Next, it offers to choose from three linkage methods that can be used for categorical data. It is also possible to assign user-defined variable weights. The obtained clusters can be evaluated by up to 13 evaluation criteria (Sulc et al., 2018) and (Corter and Gluck, 1992). The output of the nomclust() function may serve as an input for the visualization functions dend.plot and eval.plot in the nomclust package.
The function returns a list with up to six components.
The mem
component contains cluster membership partitions for the selected numbers of clusters in the form of a list.
The eval
component contains up to 13 evaluation criteria as vectors in a list. Namely, Within-cluster mutability coefficient (WCM), Within-cluster entropy coefficient (WCE),
Pseudo F Indices based on the mutability (PSFM) and the entropy (PSFE), Bayesian (BIC), and Akaike (AIC) information criteria for categorical data, the BK index, Category Utility (CU), Category Information (CI), Hartigan Mutability (HM), Hartigan Entropy (HE) and, if the prox component is present, the silhouette index (SI) and the Dunn index (DI).
The opt
component is present in the output together with the eval
component. It displays the optimal number of clusters for the evaluation criteria from the eval
component, except for WCM and WCE, where the optimal number of clusters is based on the elbow method.
The dend
component can be found in the output together with the prox
component. It contains all the necessary information for dendrogram creation.
The prox
component contains the dissimilarity matrix in the form of the "dist" object.
The call
component contains the function call.
Zdenek Sulc.
Contact: [email protected]
Boriah S., Chandola V. and Kumar, V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
Corter J.E., Gluck M.A. (1992). Explaining basic categories: Feature predictability and information. Psychological Bulletin 111(2), p. 291–303.
Sulc Z., Cibulkova J., Prochazka J., Rezankova H. (2018). Internal Evaluation Criteria for Categorical Data in Hierarchical Clustering: Optimal Number of Clusters Determination, Metodoloski Zveski, 15(2), p. 1-20.
Sulc Z. and Rezankova H. (2019). Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering. Journal of Classification, 35(1), p. 58-72. DOI: 10.1007/s00357-019-09317-5.
evalclust
, nomprox
, eval.plot
, dend.plot
.
# sample data data(data20) # creating an object with results of hierarchical clustering of hca.object <- nomclust(data20, measure = "lin", method = "average", clu.high = 5, prox = TRUE) # assigning variable weights hca.weights <- nomclust(data20, measure = "lin", method = "average", clu.high = 5, prox = TRUE, var.weights = c(0.7, 1, 0.9, 0.5, 0)) # quick clustering summary summary(hca.object) # quick cluster quality evaluation print(hca.object) # visualization of the evaluation criteria eval.plot(hca.object) # a quick dendrogram plot(hca.object) # a dendrogram with three designated clusters dend.plot(hca.object, clusters = 3) # obtaining values of evaluation indices as a data.frame data20.eval <- as.data.frame(hca.object$eval) # getting the optimal numbers of clusters as a data.frame data20.opt <- as.data.frame(hca.object$opt) # extracting cluster membership variables as a data.frame data20.mem <- as.data.frame(hca.object$mem) # obtaining a proximity matrix data20.prox <- as.matrix(hca.object$prox) # setting the maximal number of objects for which a proximity matrix is provided in the output to 30 hca.object <- nomclust(data20, measure = "iof", method = "complete", clu.high = 5, prox = 30) # transforming the nomclust object to the class "hclust" hca.object.hclust <- as.hclust(hca.object) # transforming the nomclust object to the class "agnes, twins" hca.object.agnes <- as.agnes(hca.object)
# sample data data(data20) # creating an object with results of hierarchical clustering of hca.object <- nomclust(data20, measure = "lin", method = "average", clu.high = 5, prox = TRUE) # assigning variable weights hca.weights <- nomclust(data20, measure = "lin", method = "average", clu.high = 5, prox = TRUE, var.weights = c(0.7, 1, 0.9, 0.5, 0)) # quick clustering summary summary(hca.object) # quick cluster quality evaluation print(hca.object) # visualization of the evaluation criteria eval.plot(hca.object) # a quick dendrogram plot(hca.object) # a dendrogram with three designated clusters dend.plot(hca.object, clusters = 3) # obtaining values of evaluation indices as a data.frame data20.eval <- as.data.frame(hca.object$eval) # getting the optimal numbers of clusters as a data.frame data20.opt <- as.data.frame(hca.object$opt) # extracting cluster membership variables as a data.frame data20.mem <- as.data.frame(hca.object$mem) # obtaining a proximity matrix data20.prox <- as.matrix(hca.object$prox) # setting the maximal number of objects for which a proximity matrix is provided in the output to 30 hca.object <- nomclust(data20, measure = "iof", method = "complete", clu.high = 5, prox = 30) # transforming the nomclust object to the class "hclust" hca.object.hclust <- as.hclust(hca.object) # transforming the nomclust object to the class "agnes, twins" hca.object.agnes <- as.agnes(hca.object)
The function performs hierarchical cluster analysis based on a dissimilarity matrix.
nomprox( diss, data = NULL, method = "average", clu.high = 6, eval = TRUE, prox = 100 )
nomprox( diss, data = NULL, method = "average", clu.high = 6, eval = TRUE, prox = 100 )
diss |
A proximity matrix or a dist object calculated based on the dataset defined in a parameter |
data |
A data.frame or a matrix with cases in rows and variables in columns. |
method |
A character string defining the clustering method. The following methods can be used: |
clu.high |
A numeric value that expresses the maximal number of clusters for which the cluster membership variables are produced. |
eval |
A logical operator; if TRUE, evaluation of clustering results is performed. |
prox |
A logical operator or a numeric value. If a logical value TRUE indicates that the proximity matrix is a part of the output. A numeric value (integer) of this argument indicates the maximal number of cases in a dataset for which a proximity matrix will occur in the output. |
The function performs hierarchical cluster analysis in situations when the proximity (dissimilarity) matrix was calculated externally. For instance, in a different R package, in an own-created function, or in other software. It offers three linkage methods that can be used for categorical data. The obtained clusters can be evaluated by up to 13 evaluation criteria (Sulc et al., 2018) and (Corter and Gluck, 1992).
The function returns a list with up to six components:
The mem
component contains cluster membership partitions for the selected numbers of clusters in the form of a list.
The eval
component contains up to 13 evaluation criteria as vectors in a list. Namely, Within-cluster mutability coefficient (WCM), Within-cluster entropy coefficient (WCE),
Pseudo F Indices based on the mutability (PSFM) and the entropy (PSFE), Bayesian (BIC), and Akaike (AIC) information criteria for categorical data, the BK index, Category Utility (CU), Category Information (CI), Hartigan Mutability (HM), Hartigan Entropy (HE) and, if the prox component is present, the silhouette index (SI) and the Dunn index (DI).
The opt
component is present in the output together with the eval
component. It displays the optimal number of clusters for the evaluation criteria from the eval
component, except for WCM and WCE, where the optimal number of clusters is based on the elbow method.
The dend
component can be found in the output only together with the prox
component. It contains all the necessary information for dendrogram creation.
The prox
component contains the dissimilarity matrix in the form of the "dist" object.
The call
component contains the function call.
Zdenek Sulc.
Contact: [email protected]
Corter J.E., Gluck M.A. (1992). Explaining basic categories: Feature predictability and information. Psychological Bulletin 111(2), p. 291–303.
Sulc Z., Cibulkova J., Prochazka J., Rezankova H. (2018). Internal Evaluation Criteria for Categorical Data in Hierarchical Clustering: Optimal Number of Clusters Determination, Metodoloski Zveski, 15(2), p. 1-20.
nomclust
, evalclust
, eval.plot
.
# sample data data(data20) # computation of a dissimilarity matrix using the iof similarity measure diss.matrix <- iof(data20) # creating an object with results of hierarchical clustering hca.object <- nomprox(diss = diss.matrix, data = data20, method = "complete", clu.high = 5, eval = TRUE, prox = FALSE) # quick clustering summary summary(hca.object) # quick cluster quality evaluation print(hca.object) # visualization of the evaluation criteria eval.plot(hca.object) # a dendrogram can be displayed if the object contains the prox component hca.object <- nomprox(diss = diss.matrix, data = data20, method = "complete", clu.high = 5, eval = TRUE, prox = TRUE) # a quick dendrogram plot(hca.object) # a dendrogram with three designated clusters dend.plot(hca.object, clusters = 3)
# sample data data(data20) # computation of a dissimilarity matrix using the iof similarity measure diss.matrix <- iof(data20) # creating an object with results of hierarchical clustering hca.object <- nomprox(diss = diss.matrix, data = data20, method = "complete", clu.high = 5, eval = TRUE, prox = FALSE) # quick clustering summary summary(hca.object) # quick cluster quality evaluation print(hca.object) # visualization of the evaluation criteria eval.plot(hca.object) # a dendrogram can be displayed if the object contains the prox component hca.object <- nomprox(diss = diss.matrix, data = data20, method = "complete", clu.high = 5, eval = TRUE, prox = TRUE) # a quick dendrogram plot(hca.object) # a dendrogram with three designated clusters dend.plot(hca.object, clusters = 3)
The function calculates a dissimilarity matrix based on the OF similarity measure.
of(data, var.weights = NULL)
of(data, var.weights = NULL)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
var.weights |
A numeric vector setting weights to the used variables. One can choose the real numbers from zero to one. |
The OF (Occurrence Frequency) measure was originally constructed for the text mining tasks, see (Sparck-Jones, 1972), later, it was adjusted for categorical variables, see (Boriah et al., 2008) It assigns higher weight to mismatches on less frequent values and otherwise.
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
Spark-Jones K. (1972). A statistical interpretation of term specificity and its application in retrieval.
In Journal of Documentation, 28(1), p. 11-21. Later: Journal of Documentation, 60(5) (2002), p. 493-502.
anderberg
,
burnaby
,
eskin
,
gambaryan
,
goodall1
,
goodall2
,
goodall3
,
goodall4
,
iof
,
lin
,
lin1
,
sm
,
smirnov
,
ve
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.of <- of(data20) # dissimilarity matrix calculation with variable weights weights.of <- of(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
# sample data data(data20) # dissimilarity matrix calculation prox.of <- of(data20) # dissimilarity matrix calculation with variable weights weights.of <- of(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
The function calculates a dissimilarity matrix based on the SM similarity measure.
sm(data, var.weights = NULL)
sm(data, var.weights = NULL)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
var.weights |
A numeric vector setting weights to the used variables. One can choose the real numbers from zero to one. |
The simple matching coefficient (Sokal, 1958) represents the simplest way of measuring similarity. It does not impose any weights. By a given variable, it assigns the value 1 in case of match and value 0 otherwise.
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
Sokal R., Michener C. (1958). A statistical method for evaluating systematic relationships. In: Science bulletin, 38(22),
The University of Kansas.
anderberg
,
burnaby
,
eskin
,
gambaryan
,
goodall1
,
goodall2
,
goodall3
,
goodall4
,
iof
,
lin
,
lin1
,
of
,
smirnov
,
ve
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.sm <- sm(data20) # dissimilarity matrix calculation with variable weights weights.sm <- sm(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
# sample data data(data20) # dissimilarity matrix calculation prox.sm <- sm(data20) # dissimilarity matrix calculation with variable weights weights.sm <- sm(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
The function calculates a dissimilarity matrix based on the SV similarity measure.
smirnov(data)
smirnov(data)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
The Smirnov similarity measure was presented in (Smirnov, 1968). The measure assigns high similarity to matches when the frequency of the matching value is low, and the other values occur frequently, see (Borian et al., 2008).
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Smirnov E.S. (1968). On exact methods in systematics.
Systematic Zoology, 17(1), 1-13.
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
anderberg
,
burnaby
,
eskin
,
gambaryan
,
goodall1
,
goodall2
,
goodall3
,
goodall4
,
iof
,
lin
,
lin1
,
of
,
sm
,
ve
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.smirnov <- smirnov(data20)
# sample data data(data20) # dissimilarity matrix calculation prox.smirnov <- smirnov(data20)
The function calculates a dissimilarity matrix based on the VE similarity measure.
ve(data, var.weights = NULL)
ve(data, var.weights = NULL)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
var.weights |
A numeric vector setting weights to the used variables. One can choose the real numbers from zero to one. |
The Variable Entropy similarity measure was introduced in (Sulc and Rezankova, 2019). It treats the similarity between two categories based on the within-cluster variability expressed by the normalized entropy. The measure assigns higher weights to rare categories.
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Boriah S., Chandola V., Kumar V. (2008). Similarity measures for categorical data: A comparative evaluation.
In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.
Sulc Z. and Rezankova H. (2019). Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering. Journal of Classification. 2019, 35(1), p. 58-72. DOI: 10.1007/s00357-019-09317-5.
anderberg
,
burnaby
,
eskin
,
gambaryan
,
goodall1
,
goodall2
,
goodall3
,
goodall4
,
iof
,
lin
,
lin1
,
of
,
sm
,
smirnov
,
vm
.
# sample data data(data20) # dissimilarity matrix calculation prox.ve <- ve(data20) # dissimilarity matrix calculation with variable weighting prox.ve.2 <- ve(data20, var.weights = c(1, 0.8, 0.6, 0.4, 0.2)) # dissimilarity matrix calculation with variable weights weights.ve <- ve(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
# sample data data(data20) # dissimilarity matrix calculation prox.ve <- ve(data20) # dissimilarity matrix calculation with variable weighting prox.ve.2 <- ve(data20, var.weights = c(1, 0.8, 0.6, 0.4, 0.2)) # dissimilarity matrix calculation with variable weights weights.ve <- ve(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
The function calculates a dissimilarity matrix based on the VM similarity measure.
vm(data, var.weights = NULL)
vm(data, var.weights = NULL)
data |
A data.frame or a matrix with cases in rows and variables in columns. |
var.weights |
A numeric vector setting weights to the used variables. One can choose the real numbers from zero to one. |
The Variable Mutability similarity measure was introduced in (Sulc and Rezankova, 2019). It treats the similarity between two categories based on the within-cluster variability expressed by the normalized mutability. The measure assigns higher weights to rarer categories.
The function returns an object of the class "dist".
Zdenek Sulc.
Contact: [email protected]
Sulc Z. and Rezankova H. (2019). Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering. Journal of Classification. 2019, 35(1), p. 58-72. DOI: 10.1007/s00357-019-09317-5.
anderberg
,
burnaby
,
eskin
,
gambaryan
,
goodall1
,
goodall2
,
goodall3
,
goodall4
,
iof
,
lin
,
lin1
,
of
,
sm
,
smirnov
,
ve
,
#sample data data(data20) # dissimilarity matrix calculation prox.vm <- vm(data20) # dissimilarity matrix calculation with variable weights weights.vm <- vm(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))
#sample data data(data20) # dissimilarity matrix calculation prox.vm <- vm(data20) # dissimilarity matrix calculation with variable weights weights.vm <- vm(data20, var.weights = c(0.7, 1, 0.9, 0.5, 0))