|
List Info
Thread: Papers/techniques on early culling of Smith-Waterman operation?
|
|
| Papers/techniques on early culling of
Smith-Waterman operation? |

|
2006-06-15 23:01:29 |
Hi people,
Do you know of any technique to end a Smith-Waterman
operation early,
by saying that if the operation cannot find a match of a
certain
desired minimum strength, then we won't do anymore? We'll
end the
operation early? We would at every row, check the
"Best score we can
possibly get based upon the current best score". If
the "Best score
we could possibly get" is below our desired score, we
end the SW
operation.
Or do you know of any papers on this subject? Or even can
you think
of good keywords for me to search with? Perhaps the way I'm
describing it isn't the best way.
So, let's say you have a standard SW operation. You have a
square
Matrix, and you run from left to right across the matrix,
line by
line down the matrix. Let's say, the matrix has 90 rows, as
we are
trying to match a protein sequence of 90 letters.
Now, is there anyway, some way, by logical analysis or
guesswork, to
tell perhaps by the time we get to row 70, we know that
we'll not be
able to get any better score matches than the one we got,
and thus
cull the SW operation? So that we don't search through rows
71 to 90?
The numbers 70 or 90 don't matter, as long as we can check
at every
row, whether the "best match score" gets better
or not.
I have something like this for a global alignment producer,
and it
slices the time down to a small fraction!! Even though it
only kills
the last third or so, of the alignment, it still cuts down
the time
required down to less than 66%, perhaps even down to 20%.
Basically, I've spent a few months, as a hobby, putting my
inventive
capacity into making a SmithWaterman-based, BLAST-like
searcher. I do
things like this because I can, although I think this is
probably the
most complex thing I'll ever work on in my entire life.
My BLAST-like searcher, stores all of it's data in a tree
(a "trie"
actually). Protein sequences that are shared across many
proteins,
are stored in the root or earlier branches of the tree. So
any work
done on the early parts of the tree is reused across many
many
proteins. This is good, it saves CPU-time. Most of the work
then, is
done processing the leaves of the tree.
This is why the last few rows of the SW Matrix are so
important,
because most of the work is done there to process them! This
is why I
want to shave the last few rows off, when possible. But of
course,
the idea is to process the entire leaf, if the "best
score match" is
going to get better. For that reason some kind of quick
hueristic
predictive trick would be desired, so that we don't cull
good leaves.
Any ideas anyone? This would mean so much to me if I can
find a
paper, or if anyone even knows what I am talking about and
can give
me some help.
If anyone doesn't know what I am talking about but feels
like helping
should you understand me, please let me know that I need to
explain
this in a different better way?
I've spent many months on this algorithm. I have a nearly
identical
algorithm that already works to do global alignmnents, it
just
doesn't do local alignments, and the global aligner is
pretty quick
actually. It's just the CPU-saving tricks I've done with
global
alignment, don't seem to cross over to local alignment, and
I'm
scratching my brain wondering if these past few months were
a waste
or not.
Because of this "loss of tricks" (that worked
with global alignment)
situation, I am asking for new tricks, here!
Thankfully, I was smart enough to do this work part-time, as
a
hobby
I also have a proper normal job that doesn't rely on the
success of proposed inventions to make me money. I decided
it wasn't
a good idea to starve myself just for an idea that might
never
produce anything worthwhile.
Help anyone! Thanks a lot
--
http://elfdata.com/plugin/
_______________________________________________
Biodevelopers mailing list
Biodevelopers bioinformatics.org
https://bioinformatics.org/mailman/listinfo/biodevelope
rs
|
|
| Any simple clustering program for
Hierarchical Single Linkage available? |

|
2006-06-18 06:52:46 |
|
Hello All,
I need a simple , freely downloadable , Clustering program , that will cluster a large number of records (>10,000) using Hierarchical Clustering Methods, and using Single Linkage Clustering Technique.
I would like to cluster a large number of proteins based on a similarity score that will be used as input and to output a tree in the form of a text file.
This program needs to be run from command line, prefereably a C executable from a UNIX machine. The input would be a text file of a distance matrix and the output will also be a text file of clusters or dendrograms.
I came across OC - Geoffery Barton - University of Dundee, that fits my required description, however the output is just a PS file, and thus cannot be manipulated further.
I also came across Eisen Lab's - CLuster3.0 , however, this is more complicated and used for mostly microarray data, since the input and output file fits the requirement for microarray sets.
Could someone please suggest me a simple program that would fit my description required? Would really appreciate it.
Thank you.
Shalini Sridhar
MRes.Bioinformatics and Computational Biology, Univ of Leeds,Leeds.
|
| Any simple clustering program for
Hierarchical Single Linkage available? |

|
2006-06-18 12:47:12 |
Shalini Sridhar wrote:
> Hello All,
>
> I need a simple , freely downloadable , Clustering
program , that will
> cluster a large number of records (>10,000) using
Hierarchical
> Clustering Methods, and using Single Linkage Clustering
Technique.
>
> I would like to cluster a large number of proteins
based on a similarity
> score that will be used as input and to output a tree
in the form of a
> text file.
>
> This program needs to be run from command line,
prefereably a C
> executable from a UNIX machine. The input would be a
text file of a
> distance matrix and the output will also be a text file
of clusters or
> dendrograms.
>
> I came across OC - Geoffery Barton - University of
Dundee, that fits my
> required description, however the output is just a PS
file, and thus
> cannot be manipulated further.
>
> I also came across Eisen Lab's - CLuster3.0 , however,
this is more
> complicated and used for mostly microarray data, since
the input and
> output file fits the requirement for microarray sets.
>
> Could someone please suggest me a simple program that
would fit my
> description required? Would really appreciate it.
>
I found the statistical analysis software R exelent for
basic clustering.
Not sure if it will scale to a 10,000 x 10,000 size matrix,
but on a
machine with lots of memory it may do fine.
> Thank you.
>
> Shalini Sridhar
>
> MRes.Bioinformatics and Computational Biology, Univ of
Leeds,Leeds.
>
>
>
------------------------------------------------------------
------------
>
> _______________________________________________
> Biodevelopers mailing list
> Biodevelopers bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biodevelope
rs
_______________________________________________
Biodevelopers mailing list
Biodevelopers bioinformatics.org
https://bioinformatics.org/mailman/listinfo/biodevelope
rs
|
|
[1-3]
|
|