The files in this directory contain the data set that was 
presented in the paper:

	M. W. Craven and J. W. Shavlik, "Learning to Represent Codons:
	A Challenge Problem for Constructive Induction", Proc. of the
	13th International Joint Conference on Artificial Intelligence,
	Chambery, France, August 1993, pp. 1319-1324.

This data set is a simplified version of the one presented in the paper:

	M. W. Craven and J. W. Shavlik, "Learning to Predict Reading Frames
	in E. coli DNA sequences", Proc. of the 26th Hawaii International
	Conference on System Sciences, Wailea, HI, January 1993, pp. 773-782.

The data set contains short segments from E. coli DNA sequences. 
Roughly, the learning task is to distinguish protein-coding sequences
from noncoding sequences.  More precisely, the task is to distinguish
in-frame sequences taken from protein-coding regions from out-of-frame
and opposite strand sequences.  This task involves learning some of
the essential characteristics of genes.

The data is contained in four files.  Each line in these files represents
a separate instance, and comprises three fields: a name, a DNA segment,
and a class label.  The name indicates from where the instance was
extracted (more on this later).  The DNA segment is a 15-character string
composed from the alphabet {A, G, C, T}.  These four letters represent
the four nucleotides that are the key building blocks of DNA.
The class label for an instance is either "coding" or "noncoding".

Our IJCAI paper describes two different input representations for
this data set:  one represents the sequences as nucleotides, the
other represents the sequences as codons.  A codon is simply a string of
three consecutive nucleotides.  The first codon in our representation
starts with the first nucleotide in a segment, and the codons do not
overlap each other.  For example, the codons that form the segment
"GAACGTTTCCCAAAT" are GAA, CGT, TTC, CCA, and AAT.

The "nucleotides" representation used in our IJCAI paper represents 
each segment as 60 binary features (15 nucleotides in the window X
4 possible values each) where each feature represents a
nucleotide/value combination (e.g., nucleotide_1=A?, nucleotide_1=G?, 
..., nucleotide15=C?, nucleotide_15=T?).  The "codons" representation
involves representing each segment as 320 binary features (5 codons
in the window X 64 possible values each) where each
feature represents a codon/value combination (e.g., codon_1=AAA?, codon_1=AAC?,
..., codon_5=TTC?, codon_5=TTT?).

Each of the four data files contains 5000 instances, 50% of which are
coding and 50% of which are noncoding.  In our experiments, we formed
training sets by selecting instances from 3 of the files, and then
tested resulting classifiers on the 5000 instances in the remaining file.

All of the instances in this data set were taken from the GenBank database.
Each instance name indicates the GenBank locus name and the position within
that entry from which the segment was taken.  More information about GenBank
can be found in:
	C. Burks,  "GenBank: Current Status and Future Directions",
	Methods in Enzymology, volume 183, Academic Press, 1990, pp. 3-22.

An introduction to the problem of recognizing genes in DNA and a survey
of machine-learning approaches to this problem can be found in:
	M. W. Craven and J. W. Shavlik, "Machine Learning Approaches to Gene
	Recognition", IEEE Expert, 9(2), 1994.

Questions about this data set should be directed to Mark Craven
(craven@cs.wisc.edu).