What is an overgo?
An overgo is a designed oligonucleotide marker that
· Has 50% GC content (TM of about 66-67 C).
· Is at least 36 base pairs in length. We go up to 52 bp.
· Is synthesized by two overlapping oligos. The overlap range from 8 bp to 12 bp and should have 50% GC content.
· Should be found uniquely (at a limit of 22 matching base pairs) in the set of sequences it is designed from.
If it is not obvious from the above description, the individual lengths of the two actual oligos that make up the overgo will vary depending on the combination of ultimate overgo length and the desired overlap. As an example, if the desired overgo length is 36 bases with an overlap of 8 bp then each individual oligo will be 22 bp in length. Another, perhaps obvious, observation is that the length of the overlap must be even (8, 10, etc.) and thus the overgo lengths must be a number divisible by 4; e.g., 36, 40, 44, etc.
While overgos by themselves are simple, often other considerations make the design difficult.
· Often there are multiple overgos that match the above parameters for any given design sequence. Thus there is the desire to select the “best” overgo. This may involve looking at potential matches to ESTs, other species, etc.
· Sometimes there are no overgos that match a given design sequence. At which point decisions have to be made as to modifying the GC%, the length, etc. Of course modifying these parameters can cause problems in the laboratory.
· Sometimes the set of design sequences does not equal the ultimate set of sequences that will be used in the hybridization experiments in the lab. For example, designing overgos from a given set of ESTs but then hybridizing the overgos to an entire genome. In such a case, and if possible, a double check of the uniqueness of the overgos versus the hybridized sequence(s) is a good idea.
· In order to reduce the number of experiments, overgos are often pooled together in row and columns. Two dimensional pools of 12 x 12 overgos or three dimensional pools 6 x 6 x 6 overgos are some examples. The designed sequences should not match more than one overgo within a pool. Note that an overgo must be unique within the entire group of design sequences but, unless pooled, the design sequences could have more than one overgo match it.
Overview of the design steps
1. Pick out all possible overgos from the design sequences. We use a heavily modified version of the ‘SOOP’ program to do this work although, if SOOP is not available, it would be fairly trivial to write a program to do the work. Just find all sequences that are of the proper length (e.g., 36 bases), overall GC% and proper GC% in the overlap.
2. In order to determine uniqueness, compare each overgo to the design sequences. We use BLAST or BLAT to do this.
Those are really the only steps required in order to design overgos. They are really quite simple markers. However the devil is in the details – data transformation, data extraction, determining the best overgo, deciding what to do when there are no good overgos, arraying the overgos into pools, etc. We have a bunch of small custom written programs that do this work for us. Depending on the project only some of the programs may be used. These programs are found in the directory ‘overgo_programs’. Being familiar with UNIX tools (‘wc’, ‘grep’, etc.) is helpful.
Some of the projects we have done
· Designing overgos from 9 contigs of a chromosome. We compared these to the contigs as well as the rest of the incomplete genome and other data sources (ESTs, other species’ incomplete genomes.) Multiple overgos were designed for each contig. We tried for a spacing of about 20 kbase between overgos.
· Overgos from the putative contigs of chromosomes. As above, multiple overgos were designed for each chromosome.
· Overgos from a set of sequences. We attempted to find exactly one good overgo per sequence. There was no genomic reference sequence available. The overgos were of various lengths.
· A set of test overgos that had differing overgo lengths and overlaps. Our preliminary results indicate that longer overgos with larger overlaps perform better but not dramatically so. This is not a big surprise as one naturally expect a longer oligo-nucleotide to hybridize better than a short one. Of course using longer overgos reduced the number of potential overgos to choose from.
The following is a step-by-step outline of our procedure.
1) The program SOOP [Arjun Prasad, E. Green, aprasad@nhgri.nih.gov]
was modified to display all overgos found in a sequence file; i.e., a single
contig. Previously the program would only display the best overgo for any
given sequence.
For each sequence run the soop program like so.
The '-b5000'
options means to pick the best 5000 (or less) overgos. Other options and
thier defaults that could be used are '-i0.80' (percent identity), '-l36' (overgo
length), '-o8'
(primer overlap), '-t0.5'
(GC%), '-w0.06'
(wiggle room for GC%), '-B0.1'
(reduce the cut-off and get a lot of overgos). It is best to choose lots of
overgos and thus the "-b" option should be large. The input file is
the sequence in FastA format.
Example:
soop -b300000
chr03.con > chr03.overgos
You may need to get rid of spaces and dashes in the sequence names. The program ‘squeeze_spaces.pl’ will do
this. The program ‘clean_actgn.pl’
will get rid of supurious characters
(including spaces) in the sequence that can mess up SOOP.
There is really nothing that special about the SOOP program and it could be re-written if so desired. However since we have it, it became a handy tool. Using ‘grep –i pip’ on the output from SOOP will detect bad files that SOOP may generate.
2) Next comes one of two small custom written programs. The first one (and the most quick) is called ‘japonica_to_fasta.pl’ converts the overgo file to a FASTA format file. This one is good if you have only one SOOP output file and are certain that only the normal ACGT characters are present. A second program called ‘soop_to_fasta.pl’ will take all of the SOOP output files in order to not only convert the soop files to FASTA format but will also throw away any lines with N’s in the sequence and any overgos that do not meet Tm constraints (67 degrees plus or minus 2 degrees.) This latter program can take a long time to run. You will want to use the ‘soop_to_fasta’ program if you are using varying overgo lengths that you need to combine together. The sequence names will have the position and length appended to them; example “AW3245-450-40” for position 450 and overgo length 40 of sequence “AW3245”.
japonica_to_fasta.pl
<chr03.overgos >chr03.tfa
cat *.overgos | soop_to_fasta.pl > all.tfa
3) Then the overgos need to be compared to a variety of
different databases in order to determine a ranking of the overgos. Either BLAT
or BLAST can be used. For BLAT the
‘minMatch’ parameter must be 1 otherwise very bad things will
happened. If the overgos are BLASTed it is important to get the filtering done correctly.
We wish to find all matches to the design sequences in question as
well as (if available) the genome as a whole and thus filtering must be off
for these databases. We do this so that any overgo that matches more than once
to the database can be discarded. For the other databases we wish to find the
minimal number of hits and thus filtering can be left on.
For BLAT the 'filtering is done via using a
"-minscore" of 20; remember the ‘-minMatch=1’. For Blast the program
is run with the options '-e 0.001' (expectation score), '-b 1' (alignments; we
do not need to see many), '-v 50' (descriptions; make sure we see a lot), and
'-F F' for filtering to be turned off. Examples:
blat
-minScore=16 –tileSize=8 –minIdentity=50 –minMatch=1 -noTrimA chr03.con
chr03.tfa chr03.blat
blastall -e 0.001
-b 1 -v 50 -i chr03.tfa -p blastn -d japonica-chromosome3 -o chr3.blastn -F F
blastall -e 0.001
-b 1 -v 50 -i chr03.tfa -p blastn -d japonica-ests -o japests.blastn -F T
At a minimum the overgos need to be BLAT/BLASTed versus the design sequences in order to determine uniqueness. Depending on the project other databases (genome, est, other species, repeat sequences, etc.) may be used as well.
BLAT may be faster than BLAST however both can take a long time to run.
4) The Blat/Blast output then needs have the number of hits
extracted from it. This is done via one of two programs. The first two are ‘japonica_count_blat_hits’
(BLAT) or ‘japonica_count_hits’
(Blast). These program counts up how many times each overgo matches the
database. The cutoff is base pairs for Blat and bit score for Blast. Bit score
is basically double the base pairs -- i.e. '64' for 32 base pairs. For the design sequences and whole
genome the cutoff should be 22 base pairs (we want to make sure that each
overgo is unique and thus if it matches the design sequences more than once at
22+ base pairs then it is not unique.)
For the other databases we use a cut-off of 32 base pairs since what we
are trying to answer is “does the overgo match the database?” This assumes overgos of 36 bp length.
japonica_count_blat_hits
-cutoff=22 chr03.blat > chr03.score
japonica_count_blat_hits
-cutoff=32 japests.blat > japests.score
japonica_count_hits
chr3.blastn chr3.score 44
japonica_count_hits
japests.blastn japests.score 64
Sometimes scores should be combined. As an example, when looking at chromosomal contigs from two different places (e.g., TIGR and IRGSP), one would like to find the worst case scenario for matches to the genome. The program japonica_combine_scores will take multiple score files on the command line and then output the worse-case score in all of the other files. Example:
japonica_combine_scores.pl chr03.score-from-tigr chr03.score-from-irgsp > chr03.score
For BLAST the above program just calls ‘blast_strip’ and so running this latter program directly is feasible. Or any other program that can count the number of hits inside a BLAT or BLAST output file could be used.
5) At the point you have a list of overgos and a count of how often each overgo matches each database. The techniques for determining the best overgos to use and how to arrange the overgos into pools varies dramatically from project to project. No matter which method you use make sure that you double check your overgos for uniqueness once you are done with the selection process. Depending on where you order your oligos from you may need to break down the overgos into forward and reverse oligos and put them on an order form.
5a) The “overgo test” project we just manually choose the 36 overgos to work with. A variety of UNIX programs (grep, cut) helped in this regard. The program ‘make_spreadsheet_test_overgos.pl’ was used to make up an order form.
5b) For soybean singleton sequences we used the ‘summarize_stripped.pl’ and ‘combine_with_lotus_medicago.pl’ programs to combine the output from searching versus the design sequences, soybean repeat sequences, medicago sequences, lotus sequences, and other soybean sequences into a CSV file. Putting these into plates was straightforward since we did not have to worry about distances between overgos nor cross-hybridization. Actually we did need to worry about the latter but since we had no information on the relationships between the design sequences we had not information about possible cross-hybridization. The program ‘add_plate_row_column.pl’ added the plate information to the CSV file. The program ‘make_soybean_spreadsheet.pl’ program made the ordering spreadsheet.
Some other programs used for checking the output are ‘soybean_length’ (to print length, GC% and repeats), ‘soybean_temperature’ (similar, also prints Tm plus length and GC%),
5c) For the rice chromosome project we did need to worry about cross-hybridization, distances that overgos were located along the genome as well as pooling effects. A rather complex set of programs were written to do these takes.
The program japonica_make_relationship_file.pl was used to combine the scores for each overgo with the overgos themselves were combined into one large a CSV (comma separated value) file. The program takes the FastA overgo file plus the various 'count' files as generated above. Example:
japonica_make_relationship_file -soop=chr03.tfa -chromosome=chr03.score -japest=japests.score > raw.csv
An overall score for each overgo is calculated using the custom program ‘japonica_score’.
japonica_score
< raw.csv > scored.csv
Scoring parameters. EST hits give a positive score, genomic hits tend to
give negative scores, and more than one hit to the chromosome gives a score of
zero as does more than one hit to the genome. Not all of the databases have to
be present; e.g., run through blat/blast and counted.
10 = base score
+5 if any hits to grain EST database (excluding Japonica and Indica)
+3 if any hits to Indica EST database
+3 if any hits to Japonica chromosome EST database
+1 if any hits to Japonica EST database
-1 if 4 or more hits to Indica genome database (BACs)
-5 if 8 or more hits to Indica genome database
0 if 2 or more hits to Japonica genome database
+15 if already in a plate (optional)
Other pluses or minuses as specified in the 'other' scoring.
If there is not exactly 1 and only 1 hit to the source Japonica chromosome then
the score is 0.
After the scores have been determined then suitable overgos are selected. We want to find the best scoring overgo every 15,000 - 25,000 basepairs given a minimum score of 8. Naturally there may be no suitable overgos within this range and thus a larger distance between overgos will sometimes be used. The custom program 'japonica_make_list' is used to do this. The distances between overgos can be varied. This is often done in order to get close to an even number of plates (hint: word count the number of lines in the output file; s.txt below.) Also see step 9 below on how to select a sub-set of overgos.
japonica_make_list -mindist=22000 scored.csv > s.txt
The output of 'make_list' is put into the CSV file via the 'japonica_insert_column' program. Column 16 is the one to be updated:
japonica_insert_column.pl -change=16 -new=s.txt -data=0 scored.csv > selected.csv
The distances
between the overgos can be checked with 'japonica_distances' . After the
overgos are deemed satisfactory then they can be put into plates via 'japonica_make_plates'.
The latter program tries to space the overgos so that closely related
overgos (based on distance) are as far apart -- different plates, different
rows and columns -- as possible. The disadvantage of doing this is that
all of the plates must be run before much data can be extracted. On the
other hand resolution of conflicted overgos should be easier. Finally
actually ordering the overgos is done via an spreadsheet. The program
'japonica_make_ready_for_spreadsheet' does some of this work although manual
intervention is still required.
japonica_distances
< selected.csv
japonica_make_plates
<selected.csv > plated.csv
japonica_make_ready_for_spreadsheet
<plated.csv
The spreadsheet program creates 3 files:
final.csv
final.tfa
final-parts.tfa
Naturally it is a good idea to take the final.tfa file
and Blat/Blast it against the genome database in order to make sure that it
indeed has unique probes.
Some checking programs include ‘japonica_double_check.pl’,
‘japonica_blat_minmax.pl’, ‘japonica_reduce_list.pl’, and
‘japonica_row_column_distances.pl’.