Obtaining unique oligos from an EST database is a problem of great importance in bioinformatics, particularly in the discovery of new genes and the mapping of the human genome. Many algorithms have been developed to find unique oligos, many of which are much less time consuming than the traditional brute force approach. An algorithm was presented by Zheng et al. (2004) which finds the solution of the unique oligos search problem efficiently. We implement this algorithm as well as several new algorithms based on some theorems included in this paper. We demonstrate how, with these new algorithms, we can obtain unique oligos much faster than with previous ones. We parallelize these new algorithms to further improve the time of finding unique oligos. All algorithms are run on ESTs obtained from a Barley EST database.
Expressed Sequence Tags (or ESTs) are fragments of DNA that are about 200–800 bases long generated from the sequencing of complementary DNA. ESTs have many applications. They were used in the Human Genome Project in the discovery of new genes and are often used in the mapping of genomic libraries. They can be used to infer functions of newly discovered genes based on comparison to known genes [
An oligonucleotide (or oligo) is a subsequence of an EST. Oligos are short, since they are typically no longer than 50 nucleotide bases. Oligos are often referred to in the context of their length by adding the suffix “mer”. For example, an oligo of length 9 would be referred to as a 9mer. The importance of oligos in relation to EST databases is quite significant. An oligo that is unique in an EST database serves as a representative of its EST sequence. The oligonucleotides (or simply oligos) contained in these EST databases have applications in many areas such as PCR primer design, microarrays, and probing genomic libraries [
In this paper we will improve on the algorithms presented in [
In this paper we use the notation
Many algorithms have been presented to solve this problem [
Suppose one has two
Suppose by contradiction that for any
Using this observation, an algorithm was presented in [
Assuming there are
In [
Suppose one has two
Suppose by contradiction that we cannot find any
Based on Theorem
(maximum number of mismatches between nonunique oligos)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8) goo2(
(9)
(maximum number of mismatches between nonunique oligos)
(1)
(2)
(3)
(4) goo(
(5)
A third theorem was also briefly mentioned [
Suppose one has two
Suppose by contradiction that for any
The algorithm is somewhat similar to Algorithm
It is important to note the
(maximum number of mismatches between nonunique oligos)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8) goo2(
(9)
We implement these algorithms using C on a machine with 12 Intel Core i7 CPU 80 @ 3.33 GHz processors and 12 GB of memory. The datasets we use in this implementation are Barley ESTs taken from the genetic software HarvEST by Steve Wanamaker and Timothy Close of the University of California, Riverside (
Results of serial algorithms.
Algorithm 



Dataset  Time taken (secs)  Nonunique oligos 

Algorithm 
28  6  4  1 (78 ESTs)  163  46,469 
Algorithm 
28  6  7  1 (78 ESTs)  131  46,469 
Algorithm 
27  6  9  1 (78 ESTs)  231  46,564 
Algorithm 
28  6  4  2 (2838 ESTs)  197, 500  1,611,241 
Algorithm 
28  6  7  2 (2838 ESTs)  117, 714  1,611,241 
Algorithm 
27  6  9  2 (2838 ESTs)  94, 317  1,614,235 
Results of parallel algorithms on 12 processors.
Algorithm 



Dataset  Time taken (secs)  Nonunique oligos 

Algorithm 
28  6  4  1 (78 ESTs)  33  46,469 
Algorithm 
28  6  7  1 (78 ESTs)  29  46,469 
Algorithm 
27  6  9  1 (78 ESTs)  66  46,564 
Algorithm 
28  6  4  2 (2838 ESTs)  40, 420  1,611,241 
Algorithm 
28  6  7  2 (2838 ESTs)  22, 848  1,611,241 
Algorithm 
27  6  9  2 (2838 ESTs)  18, 375  1,614,235 
One important thing to note about all of these algorithms is the fact that the main portion of them is a for loop which iterates through each index of the hash table. It is also obvious that loop iterations are independent of each other. These two factors make the algorithms perfect candidates for parallelism. Rather than process the hash table one index at a time, our parallel algorithms process groups of indices simultaneously. Ignoring the communication between processors, our algorithms optimally parallelize our three serial algorithms.
There are many APIs in different programming languages that aid in the task of parallel programming. Some examples of this in the C programming language are OpenMP and POSIX Pthreads. OpenMP allows one to easily parallelize a C program amongst multiple cores of a multicore machine [
A new trend in parallel programming is in the use of GPUs. GPUs are the processing units inside computers graphics card. C has several APIs which allow one to carry out GPU programming. The two such APIs are OpenCL and CUDA [
In the second implementation of our algorithms we use OpenMP to parallelize our algorithms throughout the 12 cores of our machine. We can easily see that we achieve near optimal parallelization with our parallel algorithms; that is, the time taken by the parallel algorithms is approximately that of the serial algorithms divided by the number of processors.
In this paper we used three algorithms to solve the unique oligos search problem which are extensions of the algorithm presented in [
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(1)
(2)
(3)
(1)
(2)
(1)
(
(2)
(depending on the filtration algorithm using this function)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11) mark the
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24) mark the
(25)
(26)
(27)
(28)
(29)
(30)
(1)
(
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10) mark the
(11)
(12)
(13)
(14)
(15)
(16)