The files in this directory contain the Ribosome Binding Sites (RBS) data set and associated imperfect domain theory that was used in the following papers (among others). These papers are also available at the University of Wisconsin Machine Learning (UW-ML) site via the World Wide Web (http://www.cs.wisc.edu/~shavlik/mlrg/publications.html) or anonymous ftp (ftp.cs.wisc.edu, then cd to machine-learning/shavlik-group). Opitz, D. W. & Shavlik J. W. (1996). "Actively Searching for an Effective Neural-Network Ensemble." Connection Science, (pp. 337-353), vol. 8, Nos. 3-4. Opitz, D. W. (1995). "An Anytime Approach to Connectionist Theory Refinement: Refining the Topologies of Knowledge-Based Neural Networks." Technical Report 1281, University of Wisconsin-Madison, Computer Sciences Department. (PhD Thesis) Opitz, D. W. and Shavlik, J. W. (1994). "Using Genetic Search to Refine Knowledge-Based Neural Networks." Machine Learning: Proceedings of the 11th International Conference, (pp. 208-216), New Brunswick, NJ. Morgan Kaufmann. Opitz, D. W. and Shavlik, J. W. (1993). "Heuristically Expanding Knowledge-Based Neural Network." Proc. of the 13th International Joint Conference on Artificial Intelligence, (pp. 1360-1365), Chambery, France. Morgan Kaufmann. This domain consists of two parts: (1) an imperfect domain theory ("rbs.theory") and (2) two sets of data, the positive instances ("rbs.pos") and the negative instances ("rbs.neg"). Mick Noordewier developed the theory and gathered the examples. DNA is a linear sequence of four "nucleotides" - adenine, guanine, thymine, and cytosine -- that are commonly abbreviated by the letters {A,G,T,C}. Genes are subsequences of DNA that serve as blueprints for proteins. Knowing these subsequences is very important because proteins provide most of the structure, function, and regulatory mechanisms of cells. The process of gene expression into a protein has two steps. The first is the synthesis of an mRNA molecule using DNA as a template and the second is the synthesis of a protein molecule using the mRNA strand as a template. A complex molecule (called a Ribosome) performs the task of reading the mRNA strand and assembling the protein chain. The sites where the Ribosome bind to the mRNA strand are called Ribosome Binding Sites (RBS). Knowing these sites would aid in determining where genes occur in DNA sequences. An introduction to the problem of recognizing genes in DNA and a survey of machine-learning approaches to this problem can be found in: M. W. Craven and J. W. Shavlik (1994). "Machine Learning Approaches to Gene Recognition." IEEE Expert, 9(2). The domain theory presented here uses a special notation for specifying locations in a DNA sequence by numbering each location with respect to a fixed, biologically-meaningful reference point. Negative numbers are locations preceding the reference point, while positive numbers are locations that follow this point. The following is an example: Location numbers: -3 -2 -1 1 2 3 Sequence: A T A (REFERENCE POINT) C G A The above papers have used a input of 49 sequential nucleotide ("base-pair") positions. The inputs start at location -24, and end at location 25 (note there is no location 0). With a window size of 49 bases, this domain consists of 1877 examples (1511 negative examples, and 366 positive examples). The negative examples (located in "rbs.neg") are generated from a (putative) RBS-free head of lambda that is 1559 bases long. For a input window size of 49 bases, 1511 (partially-overlapping) negative examples can be generated. The positive examples (located in "rbs.pos") consist of 366 instances, where each instance has the name of the RBS, followed by the corresponding sequence of 49 bases (starting at -24 and ending at 25). The following results were produced from a ten-fold cross validation. In a ten-fold cross validation, the data is first split into ten equal sets. Then, ten times, one set is held out, while the remaining nine sets are used to train the system. The held-out set is used to measure test-set performance of the systems final concept. System Test-set correctness Comments ------ -------------------- -------- KBANN 1709/1877 = 91.05% See (Towell & Shavlik, 1994; AI Journal; v70) for a description of KBANN. TopGen 1724/1877 = 91.85% See Opitz & Shavlik (1993) above. REGENT 1730/1877 = 92.17% See Opitz & Shavlik (1994) above. ADDEMUP 1746/1877 = 93.02% See Opitz & Shavlik (1996) above. To aid future comparisons, we include the fold number where each example was held out as a member of the test set; the example was a member of the training set during the other nine folds. This fold number is contained in "rbs.pos.folds" for the positive examples, and in "rbs.neg.folds" for the negative examples. The Nth number in "promoter.neg.folds" refers to the Nth example that can be generated by sliding the window, from left to right, across the promoter-free head of lambda. Questions about this data set should be directed to Dave Opitz (opitz@cs.umt.edu) or Jude Shavlik (shavlik@cs.wisc.edu).