The files in this directory contain the E. coli promoter data set and associated imperfect domain theory that was used in the following papers (among others). These papers are also available at the University of Wisconsin Machine Learning (UW-ML) site via the World Wide Web (http://www.cs.wisc.edu/~shavlik/mlrg/publications.html) or anonymous ftp (ftp.cs.wisc.edu, then cd to machine-learning/shavlik-group). Note that this is a larger domain theory and data set than the 106 instance version donated to the UCI Repository in 1990 and appearing in the directory ../promoters). Opitz, D. W. and Shavlik, J. W. (1996). "Generating Accurate and Diverse Members of a Neural-Network Ensemble." Advances in Neural Information Processing Systems, Denver, CO, MIT Press. (an extended version will appear in Connection Science). Opitz, D. W. (1995). "An Anytime Approach to Connectionist Theory Refinement: Refining the Topologies of Knowledge-Based Neural Networks," Technical Report 1281, University of Wisconsin-Madison, Computer Sciences Department. (PhD Thesis) Opitz, D. W. and Shavlik, J. W. (1994). "Using Genetic Search to Refine Knowledge-Based Neural Networks." Machine Learning: Proceedings of the 11th International Conference, (pp. 208-216), New Brunswick, NJ. Morgan Kaufmann. Opitz, D. W. and Shavlik, J. W. (1993). "Heuristically Expanding Knowledge-Based Neural Network." Proc. of the 13th International Joint Conference on Artificial Intelligence, (pp. 1360-1365), Chambery, France. Morgan Kaufmann. This domain consists of two parts: (1) an imperfect domain theory ("promoters.theory") and (2) a two sets of data, the positive instances ("promoters.pos") and the negative instances ("promoters.neg"). Mick Noordewier developed the theory and gathered the examples. DNA is a linear sequence of four "nucleotides" - adenine, guanine, thymine, and cytosine -- that are commonly abbreviated by the letters {A,G,T,C}. Genes are subsequences of DNA that serve as blueprints for proteins. Proteins are very important since they provide most of the structure, function, and regulatory mechanisms of cells. This domain is that of promoter recognition in a sequence of E. coli DNA. Promoters are short DNA sequences that precede genes. Currently, biologist are able to study only small sections of DNA; however, at the end of the Human Genome Project, there will be long runs of DNA that have not been analyzed. Thus, being able to recognize promoters is important because this would make it possible to quickly locate the start of a gene in an otherwise unobtrusive sequence. An introduction to the problem of recognizing genes in DNA and a survey of machine-learning approaches to this problem can be found in: M. W. Craven and J. W. Shavlik (1994). "Machine Learning Approaches to Gene Recognition." IEEE Expert, 9(2). The domain theory presented here uses a special notation for specifying locations in a DNA sequence by numbering each location with respect to a fixed, biologically meaningful reference point. Negative numbers are locations preceding the reference point, while positive numbers are locations that follow this point. The following is an example: Location numbers: -3 -2 -1 1 2 3 Sequence: A T A (REFERENCE POINT) C G A The above papers have used an input of 57 sequential nucleotide ("base-pair") positions. The inputs start at location -50, and end at location 7 (note there is no location 0). With a window size of 57 bases, this domain consists of 5155 examples (4921 negative examples, and 234 positive examples). The negative examples are generated from a (putative) promoter-free head of lambda that is 4977 bases long. For a input window size of 57 bases, 4921 (partially-overlapping) negative examples can be generated. Each positive instance is given by first the name of the promoter, followed by its corresponding 57 base sequence (starting at -50 and ending at 7). An "X" means that the base is missing at that position. Using the whole data set, the Opitz & Shavlik (1993) reported the following test-set error rates (from five runs of a five-fold cross validation): System Test-set correctness Comments ------ -------------------- -------- KBANN 97.69% See (Towell & Shavlik 1994; AI Journal; v70) for a description of KBANN. TopGen 97.98% See Opitz & Shavlik (1993) above. Due to runtime limitations, we (in Opitz & Shavlik, 1994) were only able to use a subset of the negative examples. In this case, we used all 234 positive examples, but only used 702 of the 4921 possible negative examples (i.e., a 3-to-1 ratio of negatives to positives). The following results were produced from a ten-fold cross validation on this smaller data set. In a ten-fold cross validation, the data is first split into ten equal sets. Then, ten times, one set is held out, while the remaining nine sets are used to train the system. The held-out set is used to measure test-set performance of the learner's final concept. System Test-set correctness Comments ------ -------------------- -------- KBANN 877/936 = 93.70% See (Towell & Shavlik 1994; AI Journal; v70) for a description of KBANN. TopGen 887/936 = 94.76% See Opitz & Shavlik (1993) above. REGENT 897/936 = 95.83% See Opitz & Shavlik (1994) above ADDEMUP 908/936 = 97.01% See Opitz & Shavlik (1996) above To aid future comparisons, we include the fold number where each example was held out as a member of the test set (for the smaller data set only); the example was a member of the training set during the other nine folds. This fold number is contained in "promoter.pos.folds" for the positive examples, and in "promoter.neg.folds" for the negative examples. A zero means that the example was not used. The Nth number in "promoter.neg.folds" refers to the Nth example that can be generated by sliding the window size, scanning from left to right, across the promoter-free head of lambda. Questions about this data set should be directed to Dave Opitz (opitz@cs.umt.edu) or Jude Shavlik (shavlik@cs.wisc.edu).