The files in this directory contain the data set that was presented in the paper: M. W. Craven and J. W. Shavlik, "Learning to Represent Codons: A Challenge Problem for Constructive Induction", Proc. of the 13th International Joint Conference on Artificial Intelligence, Chambery, France, August 1993, pp. 1319-1324. This data set is a simplified version of the one presented in the paper: M. W. Craven and J. W. Shavlik, "Learning to Predict Reading Frames in E. coli DNA sequences", Proc. of the 26th Hawaii International Conference on System Sciences, Wailea, HI, January 1993, pp. 773-782. The data set contains short segments from E. coli DNA sequences. Roughly, the learning task is to distinguish protein-coding sequences from noncoding sequences. More precisely, the task is to distinguish in-frame sequences taken from protein-coding regions from out-of-frame and opposite strand sequences. This task involves learning some of the essential characteristics of genes. The data is contained in four files. Each line in these files represents a separate instance, and comprises three fields: a name, a DNA segment, and a class label. The name indicates from where the instance was extracted (more on this later). The DNA segment is a 15-character string composed from the alphabet {A, G, C, T}. These four letters represent the four nucleotides that are the key building blocks of DNA. The class label for an instance is either "coding" or "noncoding". Our IJCAI paper describes two different input representations for this data set: one represents the sequences as nucleotides, the other represents the sequences as codons. A codon is simply a string of three consecutive nucleotides. The first codon in our representation starts with the first nucleotide in a segment, and the codons do not overlap each other. For example, the codons that form the segment "GAACGTTTCCCAAAT" are GAA, CGT, TTC, CCA, and AAT. The "nucleotides" representation used in our IJCAI paper represents each segment as 60 binary features (15 nucleotides in the window X 4 possible values each) where each feature represents a nucleotide/value combination (e.g., nucleotide_1=A?, nucleotide_1=G?, ..., nucleotide15=C?, nucleotide_15=T?). The "codons" representation involves representing each segment as 320 binary features (5 codons in the window X 64 possible values each) where each feature represents a codon/value combination (e.g., codon_1=AAA?, codon_1=AAC?, ..., codon_5=TTC?, codon_5=TTT?). Each of the four data files contains 5000 instances, 50% of which are coding and 50% of which are noncoding. In our experiments, we formed training sets by selecting instances from 3 of the files, and then tested resulting classifiers on the 5000 instances in the remaining file. All of the instances in this data set were taken from the GenBank database. Each instance name indicates the GenBank locus name and the position within that entry from which the segment was taken. More information about GenBank can be found in: C. Burks, "GenBank: Current Status and Future Directions", Methods in Enzymology, volume 183, Academic Press, 1990, pp. 3-22. An introduction to the problem of recognizing genes in DNA and a survey of machine-learning approaches to this problem can be found in: M. W. Craven and J. W. Shavlik, "Machine Learning Approaches to Gene Recognition", IEEE Expert, 9(2), 1994. Questions about this data set should be directed to Mark Craven (craven@cs.wisc.edu).