clara(x, k, metric="euclidean", stand=F, samples=5, sampsize=40 + 2 * k)
x
|
data matrix or dataframe, each row corresponds to an observation,
and each column corresponds to a variable. All variables must be numeric.
Missing values (NAs) are allowed.
|
k
|
integer, the number of clusters.
|
metric
|
character string specifying the metric to be used for calculating
dissimilarities between objects.
The currently available options are "euclidean" and "manhattan".
Euclidean distances are root sum-of-squares of differences, and
manhattan distances are the sum of absolute differences.
|
stand
|
logical flag: if TRUE, then the measurements in x are standardized before
calculating the dissimilarities. Measurements are standardized for each
variable (column), by subtracting the variable's mean value and dividing by
the variable's mean absolute deviation.
|
samples
|
integer, number of samples to be drawn from the dataset.
|
sampsize
|
integer, number of objects in each sample. sampsize should be higher than
the number of clusters (k ) and at most the number of objects (nrow(x )).
|
clara
is fully described in chapter 3 of Kaufman and Rousseeuw (1990).
Compared to other partitioning methods such as pam
, it can deal with
much larger datasets. Internally, this is achieved by considering
sub-datasets of fixed size, so that the time and storage requirements
become linear in nrow(x
) rather than quadratic.
Each sub-dataset is partitioned into k
clusters using the same
algorithm as in the pam
function.
Once k
representative objects have been selected from the
sub-dataset, each object of the entire dataset is assigned
to the nearest medoid.
The sum of the dissimilarities of the objects to their closest medoid, is
used as a measure of the quality of the clustering. The sub-dataset
for which the sum is minimal, is retained.
A further analysis is carried out on the final partition.
Each sub-dataset is forced to contain the medoids obtained from the best
sub-dataset until then.
Randomly drawn objects are added to this set until sampsize
has been reached.
"clara"
representing the clustering.
See clara.object for details.
pam
, clara
, and
fanny
require that the number of clusters be given by the user.
Hierarchical methods like agnes
, diana
, and mona
construct a
hierarchy of clusterings, with the number of clusters ranging from one to
the number of objects.
pam
can be used directly.
clara.object
, partition.object
, pam
, plot.partition
.
#generate 500 objects, divided into 2 clusters. x <- y_rbind(cbind(rnorm(200,0,8),rnorm(200,0,8)), cbind(rnorm(300,50,8),rnorm(300,50,8))) clarax <- clara(x, 2) clarax clarax$clusinfo plot(clarax)