

daisy(cluster)                               R Documentation

_D_i_s_s_i_m_i_l_a_r_i_t_y _M_a_t_r_i_x _C_a_l_c_u_l_a_t_i_o_n

_D_e_s_c_r_i_p_t_i_o_n_:

     Returns a matrix containing all the pairwise dissimi-
     larities (distances) between observations in the
     dataset.  The original variables may be of mixed types.

_U_s_a_g_e_:

     daisy(x, metric = "euclidean", stand = F, type = list())

_A_r_g_u_m_e_n_t_s_:

       x: data matrix or dataframe. Dissimilarities will be
          computed between the rows of `x'.  Columns of
          class `numeric' will be recognized as interval
          scaled variables, columns of class `factor' will
          be recognized as nominal variables, and columns of
          class `ordered' will be recognized as ordinal
          variables.  Other variable types should be speci-
          fied with the `type' argument.  Missing values
          (NAs) are allowed.

  metric: character string specifying the metric to be used.
          The currently available options are "euclidean"
          and "manhattan".  Euclidean distances are root
          sum-of-squares of differences, and manhattan dis-
          tances are the sum of absolute differences.  If
          not all columns of `x' are numeric, then this
          argument will be ignored.

   stand: logical flag: if TRUE, then the measurements in
          `x' are standardized before calculating the dis-
          similarities. Measurements are standardized for
          each variable (column), by subtracting the vari-
          able's mean value and dividing by the variable's
          mean absolute deviation.  If not all columns of
          `x' are numeric, then this argument will be
          ignored.

    type: list containing some (or all) of the types of the
          variables (columns) in `x'. The list may contain
          the following components: `ordratio' (ratio scaled
          variables to be treated as ordinal variables),
          `logratio' (ratio scaled variables that must be
          logarithmically transformed), `asymm' (asymmetric
          binary variables). Each component's value is a
          vector, containing the names or the numbers of the
          corresponding columns of `x'.  Variables not men-
          tioned in the `type' list are interpreted as usual
          (see argument `x').

_D_e_t_a_i_l_s_:

     `daisy' is fully described in chapter 1 of Kaufman and
     Rousseeuw (1990).  Compared to `dist' whose input must
     be numeric variables, the main feature of `daisy' is
     its ability to handle other variable types as well
     (e.g. nominal, ordinal, asymmetric binary) even when
     different types occur in the same dataset.

     In the `daisy' algorithm, missing values in a row of x
     are not included in the dissimilarities involving that
     row. If all variables are interval scaled, the metric
     is "euclidean", and ng is the number of columns in
     which neither row i and j have NAs, then the dissimi-
     larity d(i,j) returned is sqrt(ncol(x)/ng) times the
     Euclidean distance between the two vectors of length ng
     shortened to exclude NAs. The rule is similar for the
     "manhattan" metric, except that the coefficient is
     ncol(x)/ng.  If ng is zero, the dissimilarity is NA.

     When some variables have a type other than interval
     scaled, the dissimilarity between two rows is the
     weighted sum of the contribution of each variable.  The
     weight becomes zero when that variable is missing in
     either or both rows, or when the variable is asymmetric
     binary and both values are zero. In all other situa-
     tions, the weight of the variable is 1.  The contribu-
     tion of nominal or binary variable a to the total dis-
     similarity is zero if both values are different, else
     it is equal to 1. The contribution of other variables
     is the absolute difference of both values, divided by
     the total range of that variable.  Ordinal variables
     are first converted to ranks.  If nok is the number of
     nonzero weights, the dissimilarity is multiplied by the
     factor 1/nok and thus ranges between 0 and 1.  If nok
     is zero, the dissimilarity is NA.

_V_a_l_u_e_:

     an object of class `"dissimilarity"' containing the
     dissimilarities among the rows of x. This is typically
     the input for the functions `pam', `fanny', `agnes' or
     `diana'. See dissimilarity.object for details.

_B_A_C_K_G_R_O_U_N_D_:

     Dissimilarities are used as inputs to cluster analysis
     and multidimensional scaling. The choice of metric may
     have a large impact.

_R_e_f_e_r_e_n_c_e_s_:

     Kaufman, L. and Rousseeuw, P.J. (1990).  Finding Groups
     in Data: An Introduction to Cluster Analysis.  Wiley,
     New York.

     Struyf, A., Hubert, M. and Rousseeuw, P.J. (1997).
     Integrating Robust Clustering Techniques in S-PLUS,
     Computational Statistics and Data Analysis, 26, 17-37.

_S_e_e _A_l_s_o_:

     `dissimilarity.object', `dist', `pam', `fanny',
     `clara', `agnes', `diana'.

_E_x_a_m_p_l_e_s_:

     data(agriculture)
     ## Example 1 in ref
     ## Compute the dissimilarities using Euclidean metric and without
     ## standardization
     daisy(agriculture, metric = "euclidean", stand = FALSE)

     data(flower)
     ## Example 2 in ref
     daisy(flower, type = list(asymm = 3))
     daisy(flower, type = list(asymm = c(1, 3), ordratio = 7))

