

partition(multiv)                            R Documentation

_P_a_r_t_i_t_i_o_n_i_n_g _b_y _I_t_e_r_a_t_i_v_e _O_p_t_i_m_i_z_a_t_i_o_n

_D_e_s_c_r_i_p_t_i_o_n_:

     Returns cluster memberships.  In the case of a data
     matrix, the user specifies either a requested number of
     clusters; or - implicitly - as many clusters as there
     are rows in an initial guess at the cluster centers.
     Incorporates S function `kmeans'.

_U_s_a_g_e_:

     partition(a, ng, iter.max=15, option="mindst", diagnostics=F)

     partition(a, centers, iter.max=15, diagnostics=F)

_A_r_g_u_m_e_n_t_s_:

       a: matrix of multivariate data.  Each row corresponds
          to an observation, and each column corresponds to
          a variable.

      ng: Number of groups or clusters.

 centers: matrix of initial guesses for the cluster centers.
          Each row represents a cluster center, and thus
          `centers' must have the same number of columns as
          `a'.  The number of rows in `centers' is the num-
          ber of clusters that will be formed.

iter.max: maximum number of iterations.

  option: Either `mindst' (default) or `exch'.  Options for
          the Spaeth (1985) algorithm. In the former case,
          the variances of the groups are optimized by
          assigning the objects to groups such that they are
          minimally distant from group centers.  In the lat-
          ter case, the variances are optimized by exchang-
          ing the objects between groups such that they are
          minimally distant from group centers.

diagnostics: FALSE (default) implies that cluster cardinali-
          ties, cluster center coordinates, and cluster com-
          pactness values will not be output to the command
          window.

        :

 cluster: vector of integers, ranging from `1' to `ng' (or
          number of rows in `centers'), with length the same
          as the number of rows of `a'.  The `i'th value
          indicates the group in which the `i'th observation
          belongs.

 centers: matrix containing the coordinates of the final
          group centers.

withinss: vector of length `ng' or number of rows in `cen-
          ters'.  The `i'th value gives the within group sum
          of squares for the `i'th group.

    size: vector of length `ng', or the number of rows in
          `centers'.  The `i'th value gives the number of
          observations in group `i'.

          In the case of the Spaeth algorithm (where the
          number of clusters is given), the first of these,
          only, is returned.

_M_E_T_H_O_D_:

     Consider first the clustering of coordinate data.  The
     object is to find a partition of the observations with
     `ng' (or the number of rows in `centers') groups that
     minimizes the sum of `withinss' values.  To actually
     guarantee the minimum would be computationally infeasi-
     ble in many settings; this function finds a local mini-
     mum, that is, a solution such that there is no single
     switch of an observation from one group to another
     group that will decrease the objective criterion.  In
     the case of the Spaeth (1985) algorithms, the local
     minimum arrived at is dependent on the initial arbi-
     trary assignment of observations to groups.  This arbi-
     trary assignment is carried out using a random number
     generator.  Subsequent executions of the `partition'
     function will make use of different initial arbitrary
     assignments.  The function `partition' should therefore
     be run a number of times, and the best result kept.

     The initial data is not standardized: it may be advis-
     able to divide all column values by the range of the
     column's values, or to standardize in some other way.

     Note that this routine is best used by specifying a
     small number of groups; specifying a large number of
     groups may cause many empty classes.  When this hap-
     pens, group centroids and compactness values may be
     undefined.  This may also happen when there are very
     large values in the input data set.  Cluster member-
     ships, though, can still be used.

     Sample timings for the Spaeth (1985) algorithms
     (accessed by specifying the desired number of groups):
     33000 rows, 4 columns, 3 groups took about 40 secs. on
     a Sun SPARCstation 1; 50 groups took about 140 secs.

     When deciding on the number of clusters, Hartigan
     (1975, pp. 90-91) suggests the following rough rule of
     thumb.  If `k' is the result of `partition' with `k'
     groups, and `kplus1' is the result with `k+1' groups,
     then it is justifiable to add the extra group when

     (sum(k$withinss)/sum(kplus1$withinss)-1)*(nrow(a)-k+1)

     is greater than 10.

     When a partition is being obtained from a dendrogram,
     the procedure is straightforward: a slice is made,
     orthogonal to the cluster criterion value axis, and the
     clusters read off.

_R_e_f_e_r_e_n_c_e_s_:

     Spaeth, H. (1985).  Cluster Dissection and Analysis:
     Theory, Fortran Programs, Examples.  Ellis Horwood,
     Chichester.

     Hartigan, J.A. (1975).  Clustering Algorithms.  Wiley,
     New York.

     Hartigan, J.A. and Wong, M.A. (1979).  `A k-means clus-
     tering algorithm', Applied Statistics, vol. 28, pp.
     100-108.

_S_e_e _A_l_s_o_:

     `kmeans' (incorporated into this function); `hierclust'
     (hierarchical clustering routines, incorporating
     `hclust'); `modclust' (hierarchical clustering, incor-
     porating `mclust').

_E_x_a_m_p_l_e_s_:

     # Firstly, specifying the desired number of groups.
     ng <- 3
     pp <- partition(a,ng)
     # How well did the partitioning do?
     pp$ctot
     # Plot the results in the plane comprising the first two columns of `a'
     x <- a[,1]       # for convenience
     y <- a[,2]
     plot(x, y, type="n")     # set up plot scales, etc.
     points(x[pp$memgp==1], y[pp$memgp==1],pch="*")
     points(x[pp$memgp==2], y[pp$memgp==2],pch=".")
     points(x[pp$memgp==3], y[pp$memgp==3],pch="o")

     # Secondly, specifying guesses at cluster centers.
     irismean <- t(apply(iris, c(2, 3), 'mean'))
     x <- rbind(iris[,,1], iris[,,2], iris[,,3])
     km <- partition(x, irismean)
     wrong <- km$cluster!=rep(1:3, c(50, 50, 50))
     spin(x, highlight=wrong)
     plot(x[,2], x[,3], type="n")
     text(x[!wrong, 2], x[!wrong, 3], km$cluster)
     # identify cluster membership that is correct
     points(x[wrong, 2], x[wrong, 3], pch=15)
     # boxes for points in error
     title(main="K-Means Clustering of the Iris Data")

     # Now the hierarchy slicing case.
     # Produce a hierarchical clustering and plot it.
     clhier    <- hierclust(a)
     motif()
     plclust(clhier)
     # Derive the 4-cluster partition from this.  Update plot of dendrogram.
     clresult2 <- partition(clhier, 4, showslice=T)

