

clara(cluster)                               R Documentation

_C_l_u_s_t_e_r_i_n_g _L_a_r_g_e _A_p_p_l_i_c_a_t_i_o_n_s

_D_e_s_c_r_i_p_t_i_o_n_:

     Returns a list representing a clustering of the data
     into `k' clusters.

_U_s_a_g_e_:

     clara(x, k, metric = "euclidean", stand = F, samples = 5,
           sampsize = 40 + 2 * k)

_A_r_g_u_m_e_n_t_s_:

       x: data matrix or dataframe, each row corresponds to
          an observation, and each column corresponds to a
          variable. All variables must be numeric.  Missing
          values (NAs) are allowed.

       k: integer, the number of clusters.  It is required
          that 0 < k < n where n is the number of observa-
          tions.

  metric: character string specifying the metric to be used
          for calculating dissimilarities between observa-
          tions.  The currently available options are
          "euclidean" and "manhattan".  Euclidean distances
          are root sum-of-squares of differences, and man-
          hattan distances are the sum of absolute differ-
          ences.

   stand: logical flag: if TRUE, then the measurements in
          `x' are standardized before calculating the dis-
          similarities. Measurements are standardized for
          each variable (column), by subtracting the vari-
          able's mean value and dividing by the variable's
          mean absolute deviation.

 samples: integer, number of samples to be drawn from the
          dataset.

sampsize: integer, number of observations in each sample.
          `sampsize' should be higher than the number of
          clusters (`k') and at most the number of observa-
          tions (nrow(`x')).

_D_e_t_a_i_l_s_:

     `clara' is fully described in chapter 3 of Kaufman and
     Rousseeuw (1990).  Compared to other partitioning meth-
     ods such as `pam', it can deal with much larger
     datasets. Internally, this is achieved by considering
     sub-datasets of fixed size, so that the time and stor-
     age requirements become linear in nrow(`x') rather than
     quadratic.

     Each sub-dataset is partitioned into `k' clusters using
     the same algorithm as in the `pam' function.  Once `k'
     representative objects have been selected from the sub-
     dataset, each observation of the entire dataset is
     assigned to the nearest medoid.  The sum of the dissim-
     ilarities of the observations to their closest medoid,
     is used as a measure of the quality of the clustering.
     The sub-dataset for which the sum is minimal, is
     retained.  A further analysis is carried out on the
     final partition.  Each sub-dataset is forced to contain
     the medoids obtained from the best sub-dataset until
     then.  Randomly drawn observations are added to this
     set until `sampsize' has been reached.

_V_a_l_u_e_:

     an object of class `"clara"' representing the cluster-
     ing.  See clara.object for details.

_B_A_C_K_G_R_O_U_N_D_:

     Cluster analysis divides a dataset into groups (clus-
     ters) of observations that are similar to each other.
     Partitioning methods like `pam', `clara', and `fanny'
     require that the number of clusters be given by the
     user.  Hierarchical methods like `agnes', `diana', and
     `mona' construct a hierarchy of clusterings, with the
     number of clusters ranging from one to the number of
     observations.

_N_O_T_E_:

     For small datasets (say with fewer than 200 observa-
     tions), the function `pam' can be used directly.

_R_e_f_e_r_e_n_c_e_s_:

     Kaufman, L. and Rousseeuw, P.J. (1990).  Finding Groups
     in Data: An Introduction to Cluster Analysis.  Wiley,
     New York.

     Struyf, A., Hubert, M. and Rousseeuw, P.J. (1997).
     Integrating Robust Clustering Techniques in S-PLUS,
     Computational Statistics and Data Analysis, 26, 17-37.

_S_e_e _A_l_s_o_:

     `clara.object', `pam', `partition.object', `plot.parti-
     tion'.

_E_x_a_m_p_l_e_s_:

     ## generate 500 objects, divided into 2 clusters.
     x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
                cbind(rnorm(300,50,8), rnorm(300,50,8)))
     clarax <- clara(x, 2)
     clarax
     clarax$clusinfo
     plot(clarax)

     ## `xclara' is an artificial data set with 3 clusters of 1000 bivariate
     ## objects each.
     data(xclara)
     ## Plot similar to Figure 5 in Struyf et al (1996)
     plot(clara(xclara, 3), ask = TRUE)

     ## generate 500 objects, divided into 2 clusters.
     x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
                cbind(rnorm(300,50,8), rnorm(300,50,8)))
     clarax <- clara(x, 2)
     clarax
     clarax$clusinfo
     plot(clarax)

     ## `xclara' is an artificial data set with 3 clusters of 1000 bivariate
     ## objects each.
     data(xclara)
     ## Plot similar to Figure 5 in Struyf et al (1996)
     plot(clara(xclara, 3), ask = TRUE)

