

clustindex(cclust)                           R Documentation

_C_l_u_s_t_e_r _I_n_d_e_x_e_s

_D_e_s_c_r_i_p_t_i_o_n_:

     `clres' is the result of a clustering algorithm of
     class such as "cclust".  This function is calculating
     the values of several clustering indexes. The values of
     the indexes can be independenly used in order to deter-
     mine the number of clusters existing in a data set.

_U_s_a_g_e_:

      clustindex ( clres, x, index = "all" )

_A_r_g_u_m_e_n_t_s_:

   clres: An object of a clustering result

       x: Data matrix

   index: The indexes being calculated "calinski", "cindex",
          "db", "hartigan", "ratkowsky", "scott", "marriot",
          "ball", "trcovw", "tracew", "friedman", "rubin",
          "ssi", "likelihood", and "all" for all the
          indexes.

_D_e_t_a_i_l_s_:

     The description of the indexes is categorized into 3
     groups, based on the statistics mainly used to compute
     them.
     The first group is based on the sum of squares within
     (SSW) and between (SSB) the clusters. These statistics
     measure the dispersion of the data points in a cluster
     and between the clusters respectively. These indexes
     are:

        * calinski: (SSB/(k-1))/(SSW/(n-k)), where n is the
          number of data points and k is the number of clus-
          ters.

        * hartigan: then log(SSB/SSW).

        * ratkowsky: mean(sqrt(varSSB/varSST)), where varSSB
          stands for the SSB for every variable and varSST
          for the total sum of squares for every variable.

        * ball: SSW/k, where k is the number of clusters.

     The second group is based on the statistics of T, i.e.,
     the scatter matrix of the data points, and W, which is
     the sum of the scatter matrices in every group. These
     indexes are:

        * scott: nlog(|T|/|W|), where n is the number of
          data points and |cdot| stands for the determinant
          of a matrix.

        * marriot: k^2 |W|, where k is the number of clus-
          ters.

        * trcovw: Trace Cov W.

        * tracew: Trace W.

        * friedman: Trace W^(-1) B, where B is the scatter
          matrix of the cluster centers.

        * rubin: |T|/|W|.

     The third group consists of four algorithms not belong-
     ing to the previous ones and not having anything in
     common.

        * cindex: if the data set is binary, then while the
          C-Index is a cluster similarity measure, is
          expressed as:
          [d_(w)-min(d_(w))]/[max(d_(w))-min(d_(w))], where
          d_(w) is the sum of all n_(d) within cluster dis-
          tances, min(d_(w)) is the sum of the n_(d) small-
          est pairwise distances in the data set, and max
          (d_(w)) is the sum of the n_(d) biggest pairwise
          distances. In order to compute the C-Index all
          pairwise distances in the data set have to be com-
          puted and stored. In the case of binary data, the
          storage of the distances is creating no problems
          since there are only a few possible distances.
          However, the computation of all distances can make
          this index prohibitive for large data sets.

        * db: R=(1/n)*sum(R_(i)) where R_(i) stands for the
          maximum value of R_(ij) for ineq j, and R_(ij) for
          R_(ij)=(SSW_(i)+SSW_(j))/DC_(ij), where DC_(ij) is
          the distance between the centers of two clusters
          i, j.

        * likelihood: under the assumption of independence
          of the variables within a cluster, a cluster solu-
          tion can be regarded as a mixture model for the
          data, where the cluster centers give the probabil-
          ities for each variable to be 1. Therefore, the
          negative Log-likelihood can be computed and used
          as a quantity measure for a cluster solution. Note
          that the assumptions for applying special penalty
          terms, like in AIC or BIC, are not fulfilled in
          this model, and also they show no effect for these
          data sets.

        * ssi: this ``Simple Structure Index'' combines
          three elements which influence the interpretabil-
          ity of a solution, i.e., the maximum difference of
          each variable between the clusters, the sizes of
          the most contrasting clusters and the deviation of
          a variable in the cluster centers compared to its
          overall mean. These three elements are multiplica-
          tively combined and normalized to give a value
          between 0 and 1.

_V_a_l_u_e_:

     Returns an vector with the indexes values.

_A_u_t_h_o_r_(_s_)_:

     Evgenia Dimitriadou and Andreas Weingessel

_R_e_f_e_r_e_n_c_e_s_:

     Andreas Weingessel, Evgenia Dimitriadou and Sara Dol-
     nicar, An Examination Of Indexes For Determining The
     Number Of Clusters In Binary Data Sets,
     <URL: http://www.wu-wien.ac.at/am/workpap.html#29>
     and the references therein.

_S_e_e _A_l_s_o_:

     `cclust', `kmeans'

_E_x_a_m_p_l_e_s_:

     # a 2-dimensional example
     x<-rbind(matrix(rnorm(100,sd=0.3),ncol=2),
              matrix(rnorm(100,mean=1,sd=0.3),ncol=2))
     cl<-cclust(x,2,20,verbose=TRUE,method="kmeans")
     resultindexes <- clustindex(cl,x, index="all")
     resultindexes

