This directory contains data files for "Prediction and Validation of Operons in Uncharacterized Prokaryotes", by Morgan N. Price, Katherine H. Huang, Eric J. Alm, and Adam P. Arkin. Contact: mprice@cs.cmu.edu All data is as of February 2004. Up-to-date operon predictions are available from the VIMSS Comparative Genomics browser: http://www.microbesonline.org/ This README file describes ftp://vimssftp.lbl.gov/pub/UnsupervisedOperons/ Subdirectories: Genomics source code data microarray data and experimentally known operons out output of operon predictions The files VIMSS2AnnotatedId VIMSS2Name VIMSS2GenbankId contain mappings from our VIMSS ids to other names. All the other files use VIMSS ids, typically with the column name "VIMSS", "Gene1", or "Gene2". The "out" directory includes files for all 124 genomes, named by the NCBI taxonomy id (shown as NNN), including values of the features used in predictions and predicted p-values: caiNNN -- CAI scores for each gene. (The "CAI" column actually gives 1/CAI.) gncNNN.cai -- CAI scores for each pair (CAI corresponds to Gene1; nCAI to Gene2), along with gene neighbor and COG features gncNNN.caidxdy -- the bins (unsmoothed p-values) for the untrained CAI feature gncNNN.gc -- feature values and their p-values (with "bbf" prefix) gncNNN.genomes -- similarity of that genome to other genomes, used in clustering the genomes gncNNN.orfs -- orthologous pairs of genes gncNNN.phylo -- phylogenetic profiling scores, including "MI" (mutual information) gncNNN.pred -- scores for same-strand pairs, including the unsupervised predictions pOp = P(Operon | all features), bOp = TRUE if the pairs is predicted to be in an operon (pOp>0.5), raw feature values Sep (separation), GNScore/GNMinus/GNWithin (gene neighbor scores), COGsimClass (whether COG function matches), CAI and nCAI (CAI for Gene1 and Gene2, respectively), the log likelihoods for individual features (bbf* and cfCOG), comparative/functional predictions pOpLogistic, and distance-only predictions pOpDistance (as P(Operon|X) or P(Operon|d)). For further explanation check the source code, especially util/utils.R:OperonsPredict, or send email. gncNNN.scores -- gene neighbor scores gncNNN.sep -- bins (unsmoothed p-values) for the untrained distance (or "Sep") feature "out" also includes summary files: gncAllGenomesBN -- all pairwise similarites of genomes, and what cluster the each genome is in, and cluster sizes gncAllGenomes.stats -- summary of predictions over 124 genomes gncAllGenomes.statsNaivefOperon -- summary of predictions using "strand-naive" estimate of P(Operon|Same) gncAllGenomes.withMI -- summary of predictions using phylogenetic profiles genomeNames -- genome names and sources of sequence "out" also contains information about clustering genomes: clusterConvAdj5 -- clustering of genomes by gene neighbor method clusterBoth50 -- clustering of genomes by gene content similarity (used in computation of phylogenetic profiling) "out" also contains data for these genomes with microarray data ec E.coli K12 bs B.subtilis hp Helicobacter pylori (data for Sydney strain 1, sequence for strain 26695) ct Chlamydia trachomatis sc Synechocystis sp. PCC 6803 hb Halobacterium sp. NRC-1 including the files: bsAcc2 -- resampling of genes for accuracy estimation bsSingleArray -- agreement of single microarrays with predictions bsSubsampled -- jackknife on microarray conditions for accuracy estimation "out" also includes trained predictions and tests on training data, for E.coli (ec) and B.subtilis (bs): ecAOCall -- trained predictions for all same-pair genes (known=0 means not experimentally studied) ecAOCpredsD -- untrained and trained predictions for known not-operon pairs ecAOCpredsS -- untrained and trained predictions for known operon pairs The "data" directory contains data for the 6 genomes with microarray data, e.g. for B.subtilis: bsCors -- orfPairs is the microarray similarity (Pearson correlation) between adjacent pairs. The S{12}{ct} are total intensities, sumD1 is the summed absolute difference in log ratio for the 1st gene, sumD is the summed absolute difference in log ratio over both genes. The rows are in the same order as in the gnc1423.scores file. bsCors.named -- the same information with VIMSS ids, gene names, etc. bsRatios -- the microarray normalized log-ratios bsSEs -- description of the microarrays bsTreatmentDyes -- microarray intensities matrix bsControlDyes -- -- microarray intensities matrix "data" also contains the experimentally known within-operon and non-operon pairs for E.coli (ecNewSameOp, ecNewAdj) and B.subtilis (bsSameOp, bsAdjNotOp). For Halobacterium, the hbCors and hbCors.named files include "SignedRSqr" instead of "orfPairs". SignedRSqr = sign(r) * r^2, where r is the Pearson correlation. "Genomics" contains a snapshot of the source code. Most of the R analyses, including the operon prediction algorithm and the training of features, but not the feature computations themselves, can be run with the data provided and routines in Genomics/util/utils.R The key functions are: pseudocountLL -- maximum likelihood estimation within a bin with a Dirichlet prior BinnedBinaryFit, BinnedBinaryFit2, and ClassFit -- training a feature OperonsData -- loading data from the files OperonsPredict -- untrained predictions PredictOperonsTrained -- trained predictions Also see Genomics/util/plots.R for the functions that created the plots in the paper The scripts to create the features are in Genomics/seq, especially see the master script Genomics/seq/runOperons, and have been tested only under Linux. These run off of a SQL database (mysql, to be specific). We hope to provide public SQL access to this database at a later date. The data in the database can be browsed at http://www.microbesonline.org Lastly, the file "Delta.tar.gz" includes operon predictions for several delta-proteobacteria, including Geobacter metallireducens and G. sulfurreducens, which were not included in the analyses described in the paper.