The files in this directory contain the Ribosome Binding Sites (RBS) data
set and associated imperfect domain theory that was used in the following 
papers (among others).   These papers are also available at the University of
Wisconsin Machine Learning (UW-ML) site via the World Wide Web
(http://www.cs.wisc.edu/~shavlik/mlrg/publications.html) or anonymous
ftp (ftp.cs.wisc.edu, then cd to machine-learning/shavlik-group).

           Opitz, D. W. & Shavlik J. W. (1996).  "Actively Searching for 
        an Effective Neural-Network Ensemble." Connection Science,
        (pp. 337-353), vol. 8, Nos. 3-4.

           Opitz, D. W.  (1995).  "An Anytime Approach to Connectionist 
        Theory Refinement: Refining the Topologies of Knowledge-Based Neural 
        Networks."  Technical Report 1281, University of Wisconsin-Madison, 
        Computer Sciences Department.   (PhD Thesis)

           Opitz, D. W. and Shavlik, J. W.  (1994). "Using Genetic Search
        to Refine Knowledge-Based Neural Networks."  Machine Learning:
        Proceedings of the 11th International Conference, (pp. 208-216),
        New Brunswick, NJ.  Morgan Kaufmann.

           Opitz, D. W. and Shavlik, J. W.  (1993).  "Heuristically Expanding
        Knowledge-Based Neural Network."  Proc. of the 13th International
        Joint Conference on Artificial Intelligence,  (pp. 1360-1365),
        Chambery, France.  Morgan Kaufmann.

    This domain consists of two parts: (1) an imperfect domain theory
("rbs.theory") and (2) two sets of data, the positive instances
("rbs.pos") and the negative instances ("rbs.neg").  Mick Noordewier
developed the theory and gathered the examples.

    DNA is a linear sequence of four "nucleotides" - adenine, guanine, 
thymine, and cytosine -- that are commonly abbreviated by the letters
{A,G,T,C}.   Genes are subsequences of DNA that serve as blueprints for
proteins.  Knowing these subsequences is very important because proteins
provide most of the structure, function, and regulatory mechanisms of cells.

    The process of gene expression into a protein has two steps.  The
first is the synthesis of an mRNA molecule using DNA as a template and the
second is the synthesis of a protein molecule using the mRNA strand as
a template.  A complex molecule (called a Ribosome) performs the task of
reading the mRNA strand and assembling the protein chain.  The sites where 
the Ribosome bind to the mRNA strand are called Ribosome Binding Sites (RBS).
Knowing these sites would aid in determining where genes occur in DNA 
sequences.  An introduction to the problem of recognizing genes in DNA
and a survey of machine-learning approaches to this problem can be found in:

        M. W. Craven and J. W. Shavlik (1994).  "Machine Learning
        Approaches to Gene Recognition."  IEEE Expert, 9(2).

  The domain theory presented here uses a special notation for specifying
locations in a DNA sequence by numbering each location with respect to a
fixed, biologically-meaningful reference point.  Negative numbers are
locations preceding the reference point, while positive numbers are locations
that follow this point.  The following is an example:

      Location numbers:  -3   -2  -1                    1   2   3
      Sequence:           A    T   A  (REFERENCE POINT) C   G   A

   The above papers have used a input of 49 sequential nucleotide
("base-pair") positions.  The inputs start at location -24, and end at
location 25 (note there is no location 0).  With a window size of 49 bases,
this domain consists of 1877 examples (1511 negative examples, and
366 positive examples).   The negative examples (located in "rbs.neg") are
generated from a (putative) RBS-free head of lambda that is 1559 bases
long.  For a input window size of 49 bases, 1511 (partially-overlapping)
negative examples can be generated.  The positive examples (located in
"rbs.pos") consist of 366 instances, where each instance has the name of
the RBS, followed by the corresponding sequence of 49 bases (starting
at -24 and ending at 25).

   The following results were produced from a ten-fold cross validation.
In a ten-fold cross validation, the data is first split into ten equal sets.
Then, ten times, one set is held out, while the remaining nine sets are
used to train the system.  The held-out set is used to measure test-set
performance of the systems final concept.

  System   Test-set correctness   Comments
  ------   --------------------   --------
  KBANN     1709/1877 = 91.05%    See (Towell & Shavlik, 1994; AI Journal; v70)
                                      for a description of KBANN.
  TopGen    1724/1877 = 91.85%	  See Opitz & Shavlik (1993) above.
  REGENT    1730/1877 = 92.17%    See Opitz & Shavlik (1994) above.
  ADDEMUP   1746/1877 = 93.02%    See Opitz & Shavlik (1996) above.

   To aid future comparisons, we include the fold number where each example
was held out as a member of the test set; the example was a member of the
training set during the other nine folds.  This fold number is contained
in "rbs.pos.folds" for the positive examples, and in "rbs.neg.folds" for
the negative examples.  The Nth number in "promoter.neg.folds" refers
to the Nth example that can be generated by sliding the window, from
left to right, across the promoter-free head of lambda.

Questions about this data set should be directed to Dave Opitz
(opitz@cs.umt.edu) or Jude Shavlik (shavlik@cs.wisc.edu).