Overgo selection

Revised June, 2006

What is an overgo?

An overgo is a designed oligonucleotide marker that

·       Has 50% GC content (TM of about 66-67 C).

·       Is at least 36 base pairs in length. We go up to 52 bp.

·       Is synthesized by two overlapping oligos. The overlap range from 8 bp to 12 bp and should have 50% GC content.

·       Should be found uniquely (at a limit of 22 matching base pairs) in the set of sequences it is designed from.

If it is not obvious from the above description, the individual lengths of the two actual oligos that make up the overgo will vary depending on the combination of ultimate overgo length and the desired overlap.  As an example, if the desired overgo length is 36 bases with an overlap of 8 bp then each individual oligo will be 22 bp in length.  Another, perhaps obvious, observation is that the length of the overlap must be even (8, 10, etc.) and thus the overgo lengths must be a number divisible by 4; e.g., 36, 40, 44, etc.

While overgos by themselves are simple, often other considerations make the design difficult.

·       Often there are multiple overgos that match the above parameters for any given design sequence. Thus there is the desire to select the “best” overgo.  This may involve looking at potential matches to ESTs, other species, etc. 

·       Sometimes there are no overgos that match a given design sequence.  At which point decisions have to be made as to modifying the GC%, the length, etc.  Of course modifying these parameters can cause problems in the laboratory.

·       Sometimes the set of design sequences does not equal the ultimate set of sequences that will be used in the hybridization experiments in the lab.  For example, designing overgos from a given set of ESTs but then hybridizing the overgos to an entire genome.  In such a case, and if possible,  a double check of the uniqueness of the overgos versus the hybridized sequence(s) is a good idea.

·       In order to reduce the number of experiments, overgos are often pooled together in row and columns. Two dimensional pools of 12 x 12 overgos or three dimensional pools 6 x 6 x 6 overgos are some examples.  The designed sequences should not match more than one overgo within a pool.  Note that an overgo must be unique within the entire group of design sequences but, unless pooled, the design sequences could have more than one overgo match it.

 

Overview of the design steps

1.     Pick out all possible overgos from the design sequences.  We use a heavily modified version of the ‘SOOP’ program to do this work although, if SOOP is not available, it would be fairly trivial to write a program to do the work. Just find all sequences that are of the proper length (e.g., 36 bases), overall GC% and proper GC% in the overlap.

2.     In order to determine uniqueness, compare each overgo to the design sequences.  We use BLAST or BLAT to do this.

Those are really the only steps required in order to design overgos.  They are really quite simple markers.  However the devil is in the details – data transformation, data extraction, determining the best overgo, deciding what to do when there are no good overgos, arraying the overgos into pools, etc.  We have a bunch of small custom written programs that do this work for us.  Depending on the project only some of the programs may be used.  These programs are found in the directory ‘overgo_programs’.  Being familiar with UNIX tools (‘wc’, ‘grep’, etc.) is helpful.

 

Some of the projects we have done

·       Designing overgos from 9 contigs of a chromosome. We compared these to the contigs as well as the rest of the incomplete genome and other data sources (ESTs, other species’ incomplete genomes.)  Multiple overgos were designed for each contig.  We tried for a spacing of about 20 kbase between overgos.

·       Overgos from the putative contigs of chromosomes.  As above, multiple overgos were designed for each chromosome.

·       Overgos from a set of  sequences.  We attempted to find exactly one good overgo per sequence. There was no genomic reference sequence available.  The overgos were of various lengths.

·       A set of test overgos that had differing overgo lengths and overlaps.  Our preliminary results indicate that longer overgos with larger overlaps perform better but not dramatically so.  This is not a big surprise as one naturally expect a longer oligo-nucleotide to hybridize better than a short one.  Of course using longer overgos reduced the number of potential overgos to choose from.

 

The following is a step-by-step outline of our procedure.

     1) The program SOOP [Arjun Prasad, E. Green, aprasad@nhgri.nih.gov] was modified to display all overgos found in a sequence file; i.e., a single contig.  Previously the program would only display the best overgo for any given sequence.   

For each sequence run the soop program like so.  The '-b5000' options means to pick the best 5000 (or less) overgos.  Other options and thier defaults that could be used are '-i0.80' (percent identity), '-l36' (overgo length), '-o8' (primer overlap), '-t0.5' (GC%), '-w0.06' (wiggle room for GC%), '-B0.1' (reduce the cut-off and get a lot of overgos). It is best to choose lots of overgos and thus the "-b" option should be large. The input file is the sequence in FastA format.    Example:

soop -b300000 chr03.con > chr03.overgos

You may need to get rid of spaces and dashes in the sequence names.  The program ‘squeeze_spaces.pl’ will do this.  The program ‘clean_actgn.pl’ will get rid of supurious characters  (including spaces) in the sequence that can mess up SOOP. 

There is really nothing that special about the SOOP program and it could be re-written if so desired.  However since we have it, it became a handy tool.   Using ‘grep –i pip’ on the output from SOOP will detect bad files that SOOP may generate.

 

     2) Next comes one of two small custom written programs.  The first one (and the most quick) is called ‘japonica_to_fasta.pl’ converts the overgo file to a FASTA format file.  This one is good if you have only one SOOP output file and are certain that only the normal ACGT characters are present.  A second program called ‘soop_to_fasta.pl’ will take all of the SOOP output files in order to not only convert the soop files to FASTA format but will also throw away any lines with N’s in the sequence and any overgos that do not meet Tm constraints (67 degrees plus or minus 2 degrees.)  This latter program can take a long time to run.  You will want to use the ‘soop_to_fasta’ program if you are using varying overgo lengths that you need to combine together.  The sequence names will have the position and length appended to them; example “AW3245-450-40” for position 450 and overgo length 40 of sequence “AW3245”.

japonica_to_fasta.pl <chr03.overgos >chr03.tfa

cat *.overgos | soop_to_fasta.pl > all.tfa


     3) Then the overgos need to be compared to a variety of different databases in order to determine a ranking of the overgos. Either BLAT or BLAST can be used.  For BLAT the ‘minMatch’ parameter must be 1 otherwise very bad things will happened.  If the overgos are  BLASTed  it is important to get the filtering done correctly.  We wish to find all matches to the design sequences in question as well as (if available) the genome as a whole and thus filtering must be off for these databases. We do this so that any overgo that matches more than once to the database can be discarded. For the other databases we wish to find the minimal number of hits and thus filtering can be left on.


     For BLAT the 'filtering is done via using a "-minscore" of 20; remember the ‘-minMatch=1’. For Blast the program is run with the options '-e 0.001' (expectation score), '-b 1' (alignments; we do not need to see many), '-v 50' (descriptions; make sure we see a lot), and '-F F' for filtering to be turned off. Examples:

blat -minScore=16 –tileSize=8 –minIdentity=50 –minMatch=1 -noTrimA chr03.con chr03.tfa chr03.blat

blastall -e 0.001 -b 1 -v 50 -i chr03.tfa -p blastn -d japonica-chromosome3 -o chr3.blastn -F F
blastall -e 0.001 -b 1 -v 50 -i chr03.tfa -p blastn -d japonica-ests -o japests.blastn -F T

     At a minimum the overgos need to be BLAT/BLASTed versus the design sequences in order to determine uniqueness.  Depending on the project other databases (genome, est, other species, repeat sequences, etc.) may be used as well.

     BLAT may be faster than BLAST however both can take a long time to run.

 


     4) The Blat/Blast output then needs have the number of hits extracted from it.  This is done via one of two programs.  The first two are ‘japonica_count_blat_hits’ (BLAT) or ‘japonica_count_hits’ (Blast).  These program counts up how many times each overgo matches the database. The cutoff is base pairs for Blat and bit score for Blast. Bit score is basically double the base pairs -- i.e. '64' for 32 base pairs.  For the design sequences and whole genome the cutoff should be 22 base pairs (we want to make sure that each overgo is unique and thus if it matches the design sequences more than once at 22+ base pairs then it is not unique.)  For the other databases we use a cut-off of 32 base pairs since what we are trying to answer is “does the overgo match the database?”  This assumes overgos of 36 bp length.

japonica_count_blat_hits -cutoff=22 chr03.blat > chr03.score
japonica_count_blat_hits -cutoff=32 japests.blat > japests.score

japonica_count_hits chr3.blastn chr3.score 44
japonica_count_hits japests.blastn japests.score 64

      Sometimes scores should be combined. As an example, when looking at chromosomal contigs from two different places (e.g., TIGR and IRGSP), one would like to find the worst case scenario for matches to the genome. The program japonica_combine_scores will take multiple score files on the command line and then output the worse-case score in all of the other files. Example:

japonica_combine_scores.pl chr03.score-from-tigr chr03.score-from-irgsp > chr03.score

     For BLAST the above program just calls ‘blast_strip’  and so running this latter program directly is feasible.  Or any other program that can count the number of hits inside a BLAT or BLAST output file could be used.

 

     5) At the point you have a list of overgos and a count of how often each overgo matches each database.  The techniques for determining the best overgos to use and how to arrange the overgos into pools varies dramatically from project to project.   No matter which method you use make sure that you double check your overgos for uniqueness once you are done with the selection process.  Depending on where you order your oligos from you may need to break down the overgos into forward and reverse oligos and put them on an order form.

 

 

     5a) The “overgo test” project we just manually choose the 36 overgos to work with.  A variety of UNIX programs (grep, cut) helped in this regard.  The program ‘make_spreadsheet_test_overgos.pl’ was used to make up an order form.

 

     5b) For soybean singleton sequences we used the ‘summarize_stripped.pl’ and  ‘combine_with_lotus_medicago.pl’ programs to combine the output from searching versus the design sequences, soybean repeat sequences, medicago sequences, lotus sequences, and other soybean sequences into a CSV file.   Putting these into plates was straightforward since we did not have to worry about distances between overgos nor cross-hybridization.  Actually we did need to worry about the latter but since we had no information on the relationships between the design sequences we had not information about possible cross-hybridization.  The program ‘add_plate_row_column.pl’ added  the plate information to the CSV file.  The program ‘make_soybean_spreadsheet.pl’ program made the ordering spreadsheet.

 

     Some other programs used for checking the output are ‘soybean_length’ (to print length, GC% and repeats),  ‘soybean_temperature’ (similar, also prints Tm plus length and GC%),

 

     5c) For the rice chromosome project we did need to worry about cross-hybridization, distances that overgos were located along the genome as well as pooling effects.  A rather complex set of programs were written to do these takes.

 

    The program japonica_make_relationship_file.pl was used to combine the scores for each overgo with the overgos themselves were combined  into one large a CSV (comma separated value) file.  The program takes the FastA overgo file plus the various 'count' files as generated above. Example:

japonica_make_relationship_file -soop=chr03.tfa -chromosome=chr03.score -japest=japests.score > raw.csv

         An overall score for each overgo is calculated using the custom program ‘japonica_score’.  

japonica_score < raw.csv > scored.csv

Scoring parameters. EST hits give a positive score, genomic hits tend to give negative scores, and more than one hit to the chromosome gives a score of zero as does more than one hit to the genome. Not all of the databases have to be present; e.g., run through blat/blast and counted.

10 = base score

+5 if any hits to grain EST database (excluding Japonica and Indica)
+3 if any hits to Indica EST database
+3 if any hits to Japonica chromosome EST database
+1 if any hits to Japonica EST database

-1 if 4 or more hits to Indica genome database (BACs)
-5 if 8 or more hits to Indica genome database
0 if 2 or more hits to Japonica genome database

+15 if already in a plate (optional)
Other pluses or minuses as specified in the 'other' scoring.

If there is not exactly 1 and only 1 hit to the source Japonica chromosome then the score is 0.

      After the scores have been determined then suitable overgos are selected. We want to find the best scoring overgo every 15,000 - 25,000 basepairs given a minimum score of 8.  Naturally there may be no suitable overgos within this range and thus a larger distance between overgos will sometimes be used. The custom program 'japonica_make_list' is used to do this. The distances between overgos can be varied. This is often done in order to get close to an even number of plates (hint: word count the number of lines in the output file; s.txt below.) Also see step 9 below on how to select a sub-set of overgos.

japonica_make_list -mindist=22000 scored.csv > s.txt

      The output of 'make_list' is put into the CSV file via the 'japonica_insert_column' program. Column 16 is the one to be updated:

japonica_insert_column.pl -change=16 -new=s.txt -data=0 scored.csv > selected.csv


       The distances between the overgos can be checked with 'japonica_distances' .  After the overgos are deemed satisfactory then they can be put into plates via 'japonica_make_plates'.  The latter program tries to space the overgos so that closely related overgos (based on distance) are as far apart -- different plates, different rows and columns -- as possible.  The disadvantage of doing this is that all of the plates must be run before much data can be extracted.  On the other hand resolution of conflicted overgos should be easier.  Finally actually ordering the overgos is done via an spreadsheet. The program 'japonica_make_ready_for_spreadsheet' does some of this work although manual intervention is still required.

japonica_distances < selected.csv
japonica_make_plates <selected.csv > plated.csv
japonica_make_ready_for_spreadsheet <plated.csv

The spreadsheet program creates 3 files:
final.csv
final.tfa
final-parts.tfa


       Naturally it is a good idea to take the final.tfa file and Blat/Blast it against the genome database in order to make sure that it indeed has unique probes.   Some checking programs include ‘japonica_double_check.pl’, ‘japonica_blat_minmax.pl’, ‘japonica_reduce_list.pl’, and ‘japonica_row_column_distances.pl’.


The three most important lessons that I have learned during the last couple of years of overgo design.
  1. Physically overgos do not extend much more than about 3 bases past their oligo lengths. Exactly how far depends on the bases in the extended region and the radioactive nucleotides used. As an example, if you are designing overgos of 36 bp length with an 8 bp overlap then then oligos -- as ordered -- are 22 bp in length. In the lab these 22 bp oligos do not get extended to 36 bp but rather they get extended to around 23-26 base pairs. Thus it is critically important that these be unique at a cut-off of 22 bp. Not 36 bp!
  2. In the Blast search make sure that filter is off. In the Blat search make sure that minMatch is set to 1. Otherwise you will miss critical hits.
  3. In plating overgos make sure that within a plate (12 by 12, 8 by 8, etc.) there is only one overgo that will match any given target sequence. Otherwise there can be problems in deconvoluting the pool data. This restriction is not strict but highly suggested. Note that the restriction of having only one matching overgo in any given row or column is strict; there is no good way to deconvolute this data.