

hierclust(multiv)                            R Documentation

_H_i_e_r_a_r_c_h_i_c_a_l _C_l_u_s_t_e_r_i_n_g

_D_e_s_c_r_i_p_t_i_o_n_:

     Agglomerates the closest pair of observations, and
     replaces them with a cluster.  Agglomerates the next
     closest pair of observations or clusters.  Continues
     through `n-1' agglomerations, assuming there are `n'
     observations.

_U_s_a_g_e_:

     hierclust(a, method=1, bign = F, diagnostics=F, sim=" ",
                  movie=F, option="thru", alpha=1.0, repres="chull",
                  show="change")

     hierclust(dst, method="compact", sim=)

_A_r_g_u_m_e_n_t_s_:

       a: data matrix to be analyzed, the rows representing
          observations and the columns variables; or, alter-
          natively, lower-diagonal dissimilarity matrix in
          vector or list format.  In latter case, the number
          of values is `n(n-1)/2' where `n' is the number of
          observations.

          (See below for second variant of usage of this
          function.)

          First variant:

  method: First possibility: integer taking the value
          between 1 and 7.  These methods correspond to:
          Ward's minimum variance or error sum of squares
          method (1); single linkage or nearest neighbor
          method (2); complete linkage or diameter (3);
          average linkage, group average, or UPGMA method
          (4); McQuitty's or WPGMA method (5); median,
          Gower's or WPGMC method (6); and centroid or UPGMC
          method (7).  If the `movie' option is used, only
          `method's 1 to 4 are supported.  Second possibil-
          ity: string `compact' (default; complete linkage),
          `average' (average linkage), or `connected' (sin-
          gle linkage).

     sim: structure giving similarities rather than dissimi-
          larities.  This can either be a symmetric matrix
          or a vector with a `Size' attribute.  Only one of
          dissimilarities or similarities may be specified
          (or, of course, as an alternative, a data matrix.

    bign: Is `n' big?  If storage is problemsome, a differ-
          ent implementation of the Ward criterion may be
          tried.  This determines dissimilarities on the
          fly, and hence requires `O(n)' storage.

diagnostics: show, or suppress, some information directed to
          the command window.  Suppression is useful if
          functional programming capabilities are availed
          of, leading to lazy evaluation; this often results
          in multiple evaluations, with consequent repeti-
          tion of diagnostic information.

   movie: whether interactive display of construction of
          hierarchy is wanted.  This necessitates an open
          plot window.  It does not produce an output
          object.  All remaining arguments apply only in the
          case of the `movie=T'.

  option: (Only for `movie'.) "thru" (default) or "prompt":
          Go right through all n-1 agglomerations, or ask
          the user whether to continue at each agglomera-
          tion?  Here is what we recommend: On the first
          occasion, go right through.  Check out the cluster
          criterion values reported on in the command win-
          dow.  Then using the 'prompt' option, go as far as
          the desired partition, and print it out.

   alpha: (Only for `movie'.) 1.0 (default) or 0.0 < alpha
          <= 1.0.  Only used by the minimum variance agglom-
          erative criterion. Alpha downweights the variance
          component on the principal axis of the cluster,
          and thereby redefines the cluster criterion such
          that linear clusters are preferred.  A value of
          alpha = 1.0 gives exactly the traditional Ward's
          minimum variance method.  See Murtagh and Raftery
          (1984) for more on this altered criterion.

  repres: (Only for `movie'.) repres = "chull" (default),
          "lines", "gauss": Representation of clusters: For
          input data which has dimensionality greater than
          2, only "chull" is allowed (since the representa-
          tions were found to be too problematic, given that
          the clustering takes place in m-dimensions, and
          the the representation is necessarily a 2-dimen-
          sional projection of this).  Representation
          "chull" = convex hull, or straight line in case of
          collinear points.  (We test for collinearity using
          principal components analysis.)  Representation
          "lines" = lines of best fit (least squares fit is
          used, and the extremities are defined by the
          intersection of the min. and max. x values on this
          LS fit line). "gauss" = 1-sigma circles.  Side-
          effect of latter: warning messages indicate when-
          ever circles overlap plot boundaries.

    show: (Only for `movie'.) "change" or "all": In this
          method's present implementation, the clusters at
          each stage are `highlighted' by means of a differ-
          ent symbol.  Option show = "change" has this high-
          lighting changed at each agglomeration, so that
          only the most recent agglomerands are shown.
          Option show = "all" does not turn off the high-
          lighting in this way.

          Second variant (which re-wraps S function
          `hclust': cf. help file on `hclust'):

          Exactly one of dst or sim must be specified.

     dst: a distance structure or distance matrix.  Normally
          this will be the result of the function `dist',
          but it can be any data of the form returned by
          `dist', or  a  full,  symmetric matrix.  Missing
          values are not allowed.

  method: character string giving  the  clustering  method.
          The three  methods  currently implemented are
          "average", "con- nected" (single linkage) and
          "compact" (complete linkage).  (The first three
          characters of the method are sufficient.)

    sim=: structure giving similarities  rather  than  dis-
          tances.  This  can  either be a symmetric matrix
          or a vector with a "Size" attribute.  Missing val-
          ues are not allowed.

_V_a_l_u_e_:

     When `movie' is not specified in the first variant,
     this is a list describing the hierarchical clustering
     of the observations.  The `movie' option is consider-
     ably slower than alternative options, due to the pro-
     cessing involved.  Hence its use is recommended only on
     small data sets (less than 100 observations).

   merge: an `n-1' by 2 matrix. Row `i' of `merge' describes
          the merging of clusters at step `i' of the clus-
          tering.  If an element `j' in the row is negative,
          then observation `-j' was merged at this stage.
          If `j' is positive then the merge was with the
          cluster formed at the (earlier) stage `j' of the
          algorithm. Thus negative entries in `merge' indi-
          cate agglomerations of singletons, and positive
          entries indicate agglomerations of non-singletons.

  height: a set of `n-1' non-decreasing real values.  The
          clustering "height": that is, the value of the
          criterion associated with the clustering `method'
          for the particular agglomeration.

   order: a vector giving the permutation of the original
          observations suitable for plotting, in the sense
          that a cluster plot using this ordering and matrix
          `merge' will not have crossings of the branches.

          In hierarchical cluster displays, a decision is
          needed at each merge to specify which subtree
          should go on the left and which on the right.
          Since, for `n' observations there are `n-1'
          merges, there are `2^(n-1)' possible orderings for
          the leaves in a cluster tree, or dendrogram.  The
          default algorithm in `hc' is to order the subtree
          so that the tighter cluster is on the left (the
          last, i.e. most recent, merge of the left subtree
          is at a lower value than the last merge of the
          right subtree).  Observations are the tightest
          clusters possible, and merges involving two obser-
          vations place them in order by their observation
          sequence number.

_N_O_T_E_:

     The squared Euclidean distance is used in the `movie'
     case, and in the case when a data matrix is input. Note
     that the `movie' option returns invisibly.

_M_E_T_H_O_D_:

     In the case of all methods when `bign = F' (default),
     the Lance-Williams dissimilarity update formula is used
     to allow the 7 methods to be implemented.

     In the direct data matrix case, a list of nearest
     neighbors is maintained, which allows the closest pair
     of clusters or observations to be determined by scan-
     ning this list to find the pair of agglomerands on each
     iteration.  Hence `O(n)' operations are required for
     each of `n-1' iterations, i.e.  `O(n^2)' computation
     time in total.  The initial setting up of the `n' near-
     est neighbors also requires `O(n^2)' time.  The pro-
     cessing to be carried out following each agglomeration
     is `O(n)'.  In total, all methods implemented in `hc'
     are of time complexity `O(n^2)'.  Unless `bign = T'
     (see below), storage complexity is `O(n^2)'.  Typical
     times on a Sun SPARCstation 1 for 400x4 and 800x4
     arrays: 12 and 48 secs.

     Ward's minimum variance method aims at finding compact,
     spherical clusters.  The complete link method finds
     similar clusters. The single link method, which is
     closely related to the minimal spanning tree, adopts a
     `friends of friends' clustering strategy.  The other
     methods can be regarded as aiming for clusters with
     characteristics somewhere between the single and com-
     plete link methods.

     In the case of the Ward method, and `bign = T', the
     nearest-neighbor chain algorithm is used, which gives
     an `O(n^2)' computation time algorithm. The input
     matrix is stored, but not the dissimilarities produced.
     Computation time is marginally inferior to the imple-
     mentation of Ward's method when `bign = F' (default).
     Sample times for a 400x4 and an 800x4 array on a Sun
     SPARCstation 1: 12.5 and 50 secs.

_R_e_f_e_r_e_n_c_e_s_:

     A small sample:

     B. Everitt, Cluster Analysis Heinemann Educ. Books,
     London 1974;

     P.H.A. Sneath and R.R. Sokal, Numerical Taxonomy Free-
     man, San Francisco, 1973;

     M.R. Anderberg, Cluster Analysis for Applications Aca-
     demic Press, New York, 1973;

     A.D. Gordon, Classification Chapman and Hall, London,
     1981; and

     F. Murtagh, Multidimensional Clustering Algorithms
     COMPSTAT Lectures 4, Physica-Verlag, Wuerzburg, 1985
     (for algorithmic details of algorithms used).

_S_e_e _A_l_s_o_:

     `plclust' for plotting a dendrogram, and `cutree' for
     cutting.  For finding assignments of objects to clus-
     ters, use `members'.  `cutree', and `dist' are already
     available in S-Plus.  Function `members' is available
     in this function suite.

_E_x_a_m_p_l_e_s_:

     # Default Ward criterion
     h <- hierclust(a)
     plclust(h)
     # Real-time display of clustering
     hierclust(a, movie=T)
     # Single link criterion
     h <- hierclust(dist(a), method=connected)
     # Get cluster assignments corresponding to 5-cluster partition:
     k <- members(h)
     k$class[,4]
     # Large `n'
     h <- hierclust(a, bign=T)
     # An example of the incorporated `hclust' function, which allows processing
     # of dissimilarity data:
     hierclust(dist(a))

