

DNA(mlbench)                                 R Documentation

_P_r_i_m_a_t_e _s_p_l_i_c_e_-_j_u_n_c_t_i_o_n _g_e_n_e _s_e_q_u_e_n_c_e_s _(_D_N_A_)

_D_e_s_c_r_i_p_t_i_o_n_:

     It consists of 3,186 data points (splice junctions).
     The data points are described by 180 indicator binary
     variables and the problem is to recognize the 3 classes
     (ei, ie, neither), i.e., the boundaries between exons
     (the parts of the DNA sequence retained after splicing)
     and introns (the parts of the DNA sequence that are
     spliced out).

     The StaLog dna dataset is a processed version of the
     Irvine database described below. The main difference is
     that the symbolic variables representing the
     nucleotides (only A,G,T,C) were replaced by 3 binary
     indicator variables. Thus the original 60 symbolic
     attributes were changed into 180 binary attributes.
     The names of the examples were removed. The examples
     with ambiguities were removed (there was very few of
     them, 4).  The StatLog version of this dataset was pro-
     duced by Ross King at Strathclyde University. For orig-
     inal details see the Irvine database documentation.

     The nucleotides A,C,G,T were given indicator values as
     follows:

          A -> 1 0 0
          C -> 0 1 0
          G -> 0 0 1
          T -> 0 0 0

     Hint. Much better performance is generally observed if
     attributes closest to the junction are used. In the
     StatLog version, this means using attributes A61 to
     A120 only.

_U_s_a_g_e_:

     data(DNA)

_F_o_r_m_a_t_:

     A data frame with 3,186 observations on 180 variables,
     all nominal and a target class.

_S_o_u_r_c_e_:

        * Source:
          - all examples taken from Genbank 64.1 (ftp site:
          genbank.bio.net)
          - categories "ei" and "ie" include every "split-
          gene" for primates in Genbank 64.1
          - non-splice examples taken from sequences known
          not to include a splicing site

        * Donor: G. Towell, M. Noordewier, and J. Shavlik,
          {towell,shavlik}@cs.wisc.edu, noordewi@cs.rut-
          gers.edu

     These data have been taken from:

        * ftp.stams.strath.ac.uk/pub/Statlog

     and were converted to R format by Evgenia.Dimitri-
     adou@ci.tuwien.ac.at.

_R_e_f_e_r_e_n_c_e_s_:

     machine learning:
     - M. O. Noordewier and G. G. Towell and J. W. Shavlik,
     1991; "Training Knowledge-Based Neural Networks to Rec-
     ognize Genes in DNA Sequences".  Advances in Neural
     Information Processing Systems, volume 3, Morgan Kauf-
     mann.

     - G. G. Towell and J. W. Shavlik and M. W. Craven,
     1991; "Constructive Induction in Knowledge-Based Neural
     Networks", In Proceedings of the Eighth International
     Machine Learning Workshop, Morgan Kaufmann.

     - G. G. Towell, 1991; "Symbolic Knowledge and Neural
     Networks: Insertion, Refinement, and Extraction", PhD
     Thesis, University of Wisconsin - Madison.

     - G. G. Towell and J. W. Shavlik, 1992; "Interpretation
     of Artificial Neural Networks: Mapping Knowledge-based
     Neural Networks into Rules", In Advances in Neural
     Information Processing Systems, volume 4, Morgan Kauf-
     mann.

