This directory contains data files for "Prediction and Validation of
Operons in Uncharacterized Prokaryotes", by Morgan N. Price, Katherine
H. Huang, Eric J. Alm, and Adam P. Arkin.

Contact: mprice@cs.cmu.edu

All data is as of February 2004.

Up-to-date operon predictions are available from the VIMSS Comparative
Genomics browser:
http://www.microbesonline.org/

This README file describes
ftp://vimssftp.lbl.gov/pub/UnsupervisedOperons/

Subdirectories:
	Genomics	source code
	data		microarray data and experimentally known operons
	out		output of operon predictions

The files
	VIMSS2AnnotatedId
	VIMSS2Name
	VIMSS2GenbankId

contain mappings from our VIMSS ids to other names. All the other
files use VIMSS ids, typically with the column name "VIMSS", "Gene1",
or "Gene2".

The "out" directory includes files for all 124 genomes, named by the
NCBI taxonomy id (shown as NNN), including values of the features used
in predictions and predicted p-values:

	caiNNN -- CAI scores for each gene. (The "CAI" column
		actually gives 1/CAI.)
	gncNNN.cai -- CAI scores for each pair (CAI corresponds to Gene1;
		nCAI to Gene2), along with gene neighbor and COG features
	gncNNN.caidxdy -- the bins (unsmoothed p-values) for the untrained
		CAI feature
	gncNNN.gc -- feature values and their p-values (with "bbf" prefix)
	gncNNN.genomes -- similarity of that genome to other genomes, used in
		clustering the genomes
	gncNNN.orfs -- orthologous pairs of genes
	gncNNN.phylo -- phylogenetic profiling scores, including "MI" (mutual
		information)
	gncNNN.pred -- scores for same-strand pairs, including the
		unsupervised predictions pOp = P(Operon | all features),
		bOp = TRUE if the pairs is predicted to be in an operon (pOp>0.5),
		raw feature values Sep (separation), GNScore/GNMinus/GNWithin
		(gene neighbor scores), COGsimClass (whether COG function matches),
		CAI and nCAI (CAI for Gene1 and Gene2, respectively), the
		log likelihoods for individual features (bbf* and cfCOG), 
		comparative/functional predictions pOpLogistic, and
		distance-only predictions pOpDistance (as P(Operon|X) or
		P(Operon|d)). For further explanation check the source code,
		especially util/utils.R:OperonsPredict, or send email.
	gncNNN.scores -- gene neighbor scores
	gncNNN.sep -- bins (unsmoothed p-values) for the untrained
		distance (or "Sep") feature

"out" also includes summary files:
	gncAllGenomesBN -- all pairwise similarites of genomes, and
		what cluster the each genome is in, and cluster sizes
	gncAllGenomes.stats -- summary of predictions over 124 genomes
	gncAllGenomes.statsNaivefOperon -- summary of predictions using
		"strand-naive" estimate of P(Operon|Same)
	gncAllGenomes.withMI -- summary of predictions using phylogenetic
		profiles
	genomeNames -- genome names and sources of sequence

"out" also contains information about clustering genomes:
	clusterConvAdj5 -- clustering of genomes by gene neighbor method
	clusterBoth50 -- clustering of genomes by gene content
		similarity (used in computation of phylogenetic profiling)

"out" also contains data for these genomes with microarray data
	ec	E.coli K12
	bs	B.subtilis
	hp	Helicobacter pylori  (data for Sydney strain 1,
			sequence for strain 26695)
	ct	Chlamydia trachomatis
	sc	Synechocystis sp. PCC 6803
	hb	Halobacterium sp. NRC-1

including the files:

	bsAcc2 -- resampling of genes for accuracy estimation
	bsSingleArray -- agreement of single microarrays with predictions
	bsSubsampled -- jackknife on microarray conditions for accuracy
			estimation

"out" also includes trained predictions and tests on training data, for
E.coli (ec) and B.subtilis (bs):

	ecAOCall -- trained predictions for all same-pair genes
		(known=0 means not experimentally studied)
	ecAOCpredsD -- untrained and trained predictions for known
		not-operon pairs
	ecAOCpredsS -- untrained and trained predictions for known operon pairs

The "data" directory contains data for the 6 genomes with microarray data,
e.g. for B.subtilis:

	bsCors -- orfPairs is the microarray similarity (Pearson correlation)
		between adjacent pairs. The S{12}{ct} are total intensities,
		sumD1 is the summed absolute difference in log ratio for the
		1st gene, sumD is the summed absolute difference in log ratio
		over both genes. The rows are in the same order as in the
		gnc1423.scores file.
	bsCors.named -- the same information with VIMSS ids, gene names, etc.
	bsRatios -- the microarray normalized log-ratios
	bsSEs -- description of the microarrays
	bsTreatmentDyes -- microarray intensities matrix
	bsControlDyes --  -- microarray intensities matrix

"data" also contains the experimentally known
within-operon and non-operon pairs for E.coli (ecNewSameOp, ecNewAdj)
and B.subtilis (bsSameOp, bsAdjNotOp).

For Halobacterium, the hbCors and hbCors.named files include
"SignedRSqr" instead of "orfPairs". SignedRSqr = sign(r) * r^2, where
r is the Pearson correlation.

"Genomics" contains a snapshot of the source code. Most of
the R analyses, including the operon prediction algorithm and the
training of features, but not the feature computations themselves, can
be run with the data provided and routines in 

	Genomics/util/utils.R

The key functions are:
	pseudocountLL -- maximum likelihood estimation within a bin
		with a Dirichlet prior
	BinnedBinaryFit, BinnedBinaryFit2, and ClassFit -- training a feature
	OperonsData -- loading data from the files
	OperonsPredict -- untrained predictions
	PredictOperonsTrained -- trained predictions

Also see Genomics/util/plots.R for the functions that created the plots
in the paper

The scripts to create the features are in Genomics/seq, especially see
the master script Genomics/seq/runOperons, and have been tested only
under Linux.  These run off of a SQL database (mysql, to be
specific). We hope to provide public SQL access to this database at a
later date. The data in the database can be browsed at
http://www.microbesonline.org

Lastly, the file "Delta.tar.gz" includes operon predictions for
several delta-proteobacteria, including Geobacter metallireducens and
G. sulfurreducens, which were not included in the analyses described
in the paper.