The files in this directory contain the E. coli promoter data set and 
associated imperfect domain theory that was used in the following papers
(among others).  These papers are also available at the University of 
Wisconsin Machine Learning (UW-ML) site via the World Wide Web
(http://www.cs.wisc.edu/~shavlik/mlrg/publications.html) or anonymous
ftp (ftp.cs.wisc.edu, then cd to machine-learning/shavlik-group).
Note that this is a larger domain theory and data set than the 106
instance version donated to the UCI Repository in 1990 and appearing
in the directory ../promoters).

           Opitz, D. W. and Shavlik, J. W.  (1996). "Generating Accurate 
        and Diverse Members of a Neural-Network Ensemble." Advances in 
        Neural Information Processing Systems, Denver, CO, MIT Press.
        (an extended version will appear in Connection Science).

           Opitz, D. W.  (1995).  "An Anytime Approach to Connectionist 
        Theory Refinement: Refining the Topologies of Knowledge-Based Neural 
        Networks,"  Technical Report 1281, University of Wisconsin-Madison, 
        Computer Sciences Department.  (PhD Thesis)

           Opitz, D. W. and Shavlik, J. W.  (1994). "Using Genetic Search
        to Refine Knowledge-Based Neural Networks."  Machine Learning:
        Proceedings of the 11th International Conference, (pp. 208-216),
        New Brunswick, NJ.  Morgan Kaufmann.

           Opitz, D. W. and Shavlik, J. W.  (1993).  "Heuristically Expanding
        Knowledge-Based Neural Network."  Proc. of the 13th International
        Joint Conference on Artificial Intelligence,  (pp. 1360-1365),
        Chambery, France.  Morgan Kaufmann.

    This domain consists of two parts: (1) an imperfect domain theory
("promoters.theory") and (2) a two sets of data, the positive instances
("promoters.pos") and the negative instances ("promoters.neg"). Mick 
Noordewier developed the theory and gathered the examples.

    DNA is a linear sequence of four "nucleotides" - adenine, guanine, 
thymine, and cytosine -- that are commonly abbreviated by the letters
{A,G,T,C}.   Genes are subsequences of DNA that serve as blueprints for
proteins.  Proteins are very important since they provide most of the
structure, function, and regulatory mechanisms of cells.

    This domain is that of promoter recognition in a sequence of E. coli
DNA.  Promoters are short DNA sequences that precede genes.  Currently,
biologist are able to study only small sections of DNA; however, at the
end of the Human Genome Project, there will be long runs of DNA that have
not been analyzed.  Thus, being able to recognize promoters is important
because this would make it possible to quickly locate the start of a gene
in an otherwise unobtrusive sequence.  An introduction to the problem of
recognizing genes in DNA and a survey of machine-learning approaches to
this problem can be found in:

        M. W. Craven and J. W. Shavlik (1994).  "Machine Learning
        Approaches to Gene Recognition."  IEEE Expert, 9(2).

  The domain theory presented here uses a special notation for specifying
locations in a DNA sequence by numbering each location with respect to a
fixed, biologically meaningful reference point.  Negative numbers are
locations preceding the reference point, while positive numbers are locations
that follow this point.  The following is an example:

      Location numbers:  -3   -2  -1                    1   2   3
      Sequence:           A    T   A  (REFERENCE POINT) C   G   A

   The above papers have used an input of 57 sequential nucleotide
("base-pair") positions.  The inputs start at location -50, and end at
location 7 (note there is no location 0).  With a window size of 57 bases,
this domain consists of 5155 examples (4921 negative examples, and
234 positive examples).   The negative examples are generated from a
(putative) promoter-free head of lambda that is 4977 bases long.
For a input window size of 57 bases, 4921 (partially-overlapping) negative
examples can be generated.   Each positive instance is given by first
the name of the promoter, followed by its corresponding 57 base sequence
(starting at -50 and ending at 7).  An "X" means that the base is missing
at that position.

   Using the whole data set, the Opitz & Shavlik (1993) reported the following
test-set error rates (from five runs of a five-fold cross validation):

    System  Test-set correctness  Comments
    ------  --------------------  --------
    KBANN         97.69%          See (Towell & Shavlik 1994; AI Journal; v70)
                                      for a description of KBANN.  
    TopGen        97.98%          See Opitz & Shavlik (1993) above.

   Due to runtime limitations, we (in Opitz & Shavlik, 1994) were only able
to use a subset of the negative examples.   In this case, we used all
234 positive examples, but only used 702 of the 4921 possible negative examples
(i.e., a 3-to-1 ratio of negatives to positives).   The following results were
produced from a ten-fold cross validation on this smaller data set.
In a ten-fold cross validation, the data is first split into ten equal sets.
Then, ten times, one set is held out, while the remaining nine sets are
used to train the system.  The held-out set is used to measure test-set
performance of the learner's final concept.

    System  Test-set correctness  Comments
    ------  --------------------  --------
    KBANN     877/936 = 93.70%    See (Towell & Shavlik 1994; AI Journal; v70)
                                      for a description of KBANN.
    TopGen    887/936 = 94.76%    See Opitz & Shavlik (1993) above.
    REGENT    897/936 = 95.83%    See Opitz & Shavlik (1994) above
    ADDEMUP   908/936 = 97.01%    See Opitz & Shavlik (1996) above

   To aid future comparisons, we include the fold number where each example
was held out as a member of the test set (for the smaller data set only);
the example was a member of the training set during the other nine folds.
This fold number is contained in "promoter.pos.folds" for the positive
examples, and in "promoter.neg.folds" for the negative examples.  A zero
means that the example was not used.  The Nth number in "promoter.neg.folds" 
refers to the Nth example that can be generated by sliding the window size, 
scanning from left to right, across the promoter-free head of lambda.

Questions about this data set should be directed to Dave Opitz
(opitz@cs.umt.edu)  or Jude Shavlik (shavlik@cs.wisc.edu).