List Info

Thread: Papers/techniques on early culling of Smith-Waterman operation?




Papers/techniques on early culling of Smith-Waterman operation?
user name
2006-06-15 23:01:29
Hi people,

Do you know of any technique to end a Smith-Waterman
operation early,  
by saying that if the operation cannot find a match of a
certain  
desired minimum strength, then we won't do anymore? We'll
end the  
operation early? We would at every row, check the
"Best score we can  
possibly get based upon the current best score". If
the "Best score  
we could possibly get" is below our desired score, we
end the SW  
operation.

Or do you know of any papers on this subject? Or even can
you think  
of good keywords for me to search with? Perhaps the way I'm
 
describing it isn't the best way.

So, let's say you have a standard SW operation. You have a
square  
Matrix, and you run from left to right across the matrix,
line by  
line down the matrix. Let's say, the matrix has 90 rows, as
we are  
trying to match a protein sequence of 90 letters.

Now, is there anyway, some way, by logical analysis or
guesswork, to  
tell perhaps by the time we get to row 70, we know that
we'll not be  
able to get any better score matches than the one we got,
and thus  
cull the SW operation? So that we don't search through rows
71 to 90?  
The numbers 70 or 90 don't matter, as long as we can check
at every  
row, whether the "best match score" gets better
or not.

I have something like this for a global alignment producer,
and it  
slices the time down to a small fraction!! Even though it
only kills  
the last third or so, of the alignment, it still cuts down
the time  
required down to less than 66%, perhaps even down to 20%.

Basically, I've spent a few months, as a hobby, putting my
inventive  
capacity into making a SmithWaterman-based, BLAST-like
searcher. I do  
things like this because I can, although I think this is
probably the  
most complex thing I'll ever work on in my entire life.

My BLAST-like searcher, stores all of it's data in a tree
(a "trie"  
actually). Protein sequences that are shared across many
proteins,  
are stored in the root or earlier branches of the tree. So
any work  
done on the early parts of the tree is reused across many
many  
proteins. This is good, it saves CPU-time. Most of the work
then, is  
done processing the leaves of the tree.

This is why the last few rows of the SW Matrix are so
important,  
because most of the work is done there to process them! This
is why I  
want to shave the last few rows off, when possible. But of
course,  
the idea is to process the entire leaf, if the "best
score match" is  
going to get better. For that reason some kind of quick
hueristic  
predictive trick would be desired, so that we don't cull
good leaves.

Any ideas anyone? This would mean so much to me if I can
find a  
paper, or if anyone even knows what I am talking about and
can give  
me some help.

If anyone doesn't know what I am talking about but feels
like helping  
should you understand me, please let me know that I need to
explain  
this in a different better way?

I've spent many months on this algorithm. I have a nearly
identical  
algorithm that already works to do global alignmnents, it
just  
doesn't do local alignments, and the global aligner is
pretty quick  
actually. It's just the CPU-saving tricks I've done with
global  
alignment, don't seem to cross over to local alignment, and
I'm  
scratching my brain wondering if these past few months were
a waste  
or not.

Because of this "loss of tricks" (that worked
with global alignment)  
situation, I am asking for new tricks, here!

Thankfully, I was smart enough to do this work part-time, as
a  
hobby 
I also have a proper normal job that doesn't rely on the  
success of proposed inventions to make me money. I decided
it wasn't  
a good idea to starve myself just for an idea that might
never  
produce anything worthwhile.

Help anyone! Thanks a lot 

--
http://elfdata.com/plugin/




_______________________________________________
Biodevelopers mailing list
Biodevelopersbioinformatics.org
https://bioinformatics.org/mailman/listinfo/biodevelope
rs
Any simple clustering program for Hierarchical Single Linkage available?
user name
2006-06-18 06:52:46

Hello All,

I need a simple , freely downloadable , Clustering program , that will cluster a large number of records (>10,000) using Hierarchical Clustering Methods, and using Single Linkage Clustering Technique.

I would like to cluster a large number of proteins based on a similarity score that will be used as input and to output a tree in the form of a text file.

This program needs to be run from command line, prefereably a C executable from a UNIX machine. The input would be a text file of a distance matrix and the output will also be a text file of clusters or dendrograms.

I came across OC - Geoffery Barton - University of Dundee, that fits my required description, however the output is just a PS file, and thus cannot be manipulated further.

I also came across Eisen Lab's - CLuster3.0 , however, this is more complicated and used for mostly microarray data, since the input and output file fits the requirement for microarray sets.

Could someone please suggest me a simple program that would fit my description required? Would really appreciate it.

Thank you.

Shalini Sridhar

MRes.Bioinformatics and Computational Biology, Univ of Leeds,Leeds.

Any simple clustering program for Hierarchical Single Linkage available?
user name
2006-06-18 12:47:12
Shalini Sridhar wrote:
> Hello All,
> 
> I need a simple , freely downloadable , Clustering
program , that will 
> cluster a large number of records (>10,000) using
Hierarchical 
> Clustering Methods, and using Single Linkage Clustering
Technique.
> 
> I would like to cluster a large number of proteins
based on a similarity 
> score that will be used as input and to output a tree
in the form of a 
> text file.
> 
> This program needs to be run from command line,
prefereably a C 
> executable from a UNIX machine. The input would be a
text file of a 
> distance matrix and the output will also be a text file
of clusters or 
> dendrograms.
> 
> I came across OC - Geoffery Barton - University of
Dundee, that fits my 
> required description, however the output is just a PS
file, and thus 
> cannot be manipulated further.
> 
> I also came across Eisen Lab's - CLuster3.0 , however,
this is more 
> complicated and used for mostly microarray data, since
the input and 
> output file fits the requirement for microarray sets.
> 
> Could someone please suggest me a simple program that
would fit my 
> description required? Would really appreciate it.
> 

I found the statistical analysis software R exelent for
basic clustering.

Not sure if it will scale to a 10,000 x 10,000 size matrix,
but on a 
machine with lots of memory it may do fine.




> Thank you.
> 
> Shalini Sridhar
> 
> MRes.Bioinformatics and Computational Biology, Univ of
Leeds,Leeds.
> 
> 
>
------------------------------------------------------------
------------
> 
> _______________________________________________
> Biodevelopers mailing list
> Biodevelopersbioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biodevelope
rs

_______________________________________________
Biodevelopers mailing list
Biodevelopersbioinformatics.org
https://bioinformatics.org/mailman/listinfo/biodevelope
rs
[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )