

pam(cluster)                                 R Documentation

_P_a_r_t_i_t_i_o_n_i_n_g _A_r_o_u_n_d _M_e_d_o_i_d_s

_D_e_s_c_r_i_p_t_i_o_n_:

     Returns a list representing a clustering of the data
     into `k' clusters.

_U_s_a_g_e_:

     pam(x, k, diss = F, metric = "euclidean", stand = F)

_A_r_g_u_m_e_n_t_s_:

       x: data matrix or dataframe, or dissimilarity matrix,
          depending on the value of the `diss' argument.

          In case of a matrix or dataframe, each row corre-
          sponds to an observation, and each column corre-
          sponds to a variable. All variables must be
          numeric.  Missing values (NAs) are allowed.

          In case of a dissimilarity matrix, `x' is typi-
          cally the output of `daisy' or `dist'. Also a vec-
          tor with length n*(n-1)/2 is allowed (where n is
          the number of observations), and will be inter-
          preted in the same way as the output of the above-
          mentioned functions. Missing values (NAs) are not
          allowed.

       k: integer, the number of clusters.  It is required
          that 0 < k < n where n is the number of observa-
          tions.

    diss: logical flag: if TRUE, then `x' will be considered
          as a dissimilarity matrix. If FALSE, then `x' will
          be considered as a matrix of observations by vari-
          ables.

  metric: character string specifying the metric to be used
          for calculating dissimilarities between observa-
          tions.  The currently available options are
          "euclidean" and "manhattan".  Euclidean distances
          are root sum-of-squares of differences, and man-
          hattan distances are the sum of absolute differ-
          ences.  If `x' is already a dissimilarity matrix,
          then this argument will be ignored.

   stand: logical flag: if TRUE, then the measurements in
          `x' are standardized before calculating the dis-
          similarities. Measurements are standardized for
          each variable (column), by subtracting the vari-
          able's mean value and dividing by the variable's
          mean absolute deviation.  If `x' is already a dis-
          similarity matrix, then this argument will be
          ignored.

_D_e_t_a_i_l_s_:

     `pam' is fully described in chapter 2 of Kaufman and
     Rousseeuw (1990).  Compared to the k-means approach in
     `kmeans', the function `pam' has the following fea-
     tures: (a) it also accepts a dissimilarity matrix; (b)
     it is more robust because it minimizes a sum of dissim-
     ilarities instead of a sum of squared euclidean dis-
     tances; (c) it provides a novel graphical display, the
     silhouette plot (see `plot.partition') which also
     allows to select the number of clusters.

     The `pam'-algorithm is based on the search for `k' rep-
     resentative objects or medoids among the observations
     of the dataset. These observations should represent the
     structure of the data. After finding a set of `k'
     medoids, `k' clusters are constructed by assigning each
     observation to the nearest medoid. The goal is to find
     `k' representative objects which minimize the sum of
     the dissimilarities of the observations to their clos-
     est representative object.  The algorithm first looks
     for a good initial set of medoids (this is called the
     BUILD phase). Then it finds a local minimum for the
     objective function, that is, a solution such that there
     is no single switch of an observation with a medoid
     that will decrease the objective (this is called the
     SWAP phase).

_V_a_l_u_e_:

     an object of class `"pam"' representing the clustering.
     See `pam.object' for details.

_B_A_C_K_G_R_O_U_N_D_:

     Cluster analysis divides a dataset into groups (clus-
     ters) of observations that are similar to each other.
     Partitioning methods like `pam', `clara', and `fanny'
     require that the number of clusters be given by the
     user.  Hierarchical methods like `agnes', `diana', and
     `mona' construct a hierarchy of clusterings, with the
     number of clusters ranging from one to the number of
     observations.

_N_O_T_E_:

     For datasets larger than (say) 200 observations, `pam'
     will take a lot of computation time. Then the function
     `clara' is preferable.

_R_e_f_e_r_e_n_c_e_s_:

     Kaufman, L. and Rousseeuw, P.J. (1990).  Finding Groups
     in Data: An Introduction to Cluster Analysis.  Wiley,
     New York.

     Anja Struyf, Mia Hubert & Peter J. Rousseeuw (1996):
     Clustering in an Object-Oriented Environment.  Journal
     of Statistical Software, 1.  <URL:
     http://www.stat.ucla.edu/journals/jss/>

     Struyf, A., Hubert, M. and Rousseeuw, P.J. (1997).
     Integrating Robust Clustering Techniques in S-PLUS,
     Computational Statistics and Data Analysis, 26, 17-37.

_S_e_e _A_l_s_o_:

     `pam.object', `clara', `daisy', `partition.object',
     `plot.partition', `dist'.

_E_x_a_m_p_l_e_s_:

     # generate 25 objects, divided into 2 clusters.
     x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
                cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))
     pamx <- pam(x, 2)
     pamx
     summary(pamx)
     plot(pamx)

     pam(daisy(x, metric = "manhattan"), 2, diss = T)

     data(ruspini)
     ## Plot similar to Figure 4 in Stryuf et al (1996)
     plot(pam(ruspini, 4), ask = TRUE)

     # generate 25 objects, divided into 2 clusters.
     x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
                cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))
     pamx <- pam(x, 2)
     pamx
     summary(pamx)
     plot(pamx)

     pam(daisy(x, metric = "manhattan"), 2, diss = T)

     data(ruspini)
     ## Plot similar to Figure 4 in Stryuf et al (1996)
     plot(pam(ruspini, 4), ask = TRUE)

