| Hi,
I don't know exactly what you are looking for,
but if you assume all polymorphisms are single base substitutions and that there
are no insertions or deletions (is this correct??), then the basic code is
pretty easy. Just look at each position in each sequence and see if it matches
the reference. If so, keep going. If not, record a polymorphism.
Allowing insertions for deletions is trickier
because there is a chance that your sequences will get out of alignment with
each other and that would cause massive problems. You would probably have to
check alignment with every position. I am not sure off hand what the best way to
do this would be, but I think it would not be too hard...
Ethan
Ethan Strauss Ph.D. Bioinformatics Scientist Promega
Corporation 2800 Woods Hollow Rd. Madison, WI
53711 608-274-4330 800-356-9526 promega.com">ethan.strauss promega.com
Hi,
I have a bioinformatics project that involves finding polymorphisms in
mitochondrial DNA (mtDNA). The
polymorphisms are typically denoted as "reference base/position/polymorphic
base", as in A750G. I'd like to add
a software tool to our company website where a visitor could paste in a set of
mitochondrial genomes, and a reference sequence, and get back a list of
polymorphisms. Something
like:
>Seq1
A458G, T4899A....
>SEQ2
T678C, G6789C....
etc.
We sequence mitochondrial DNA for customers interested in learning about
their ancient ancestry.
The site will be freely available. It will be attached to our company site, www.argusbio.com, which is still in
development at LunarPages. The author's name and an email link
could be listed on the page.
A full-length genome is 16,569 bases long. Typically two people will have around 30
to 50 differences in their mtDNAs - more (but less than 100) if they have very
different ancestry (African vs European, for example). These polymorphisms determine the
person217;s mitochondrial haplogroup.
It would be very helpful if the program were able to determine which
haplogroup the mtDNA belongs in based on the list of polymorphisms. I have tables of diagnostic
polymorphisms used for classing mt genomes.
It would also be very useful if there were an option to generate a fasta
file that consisted of just polymorphic sites. So if someone put in 100 full-length
genomes, and a reference genome, the output would be fasta sequences where each
base varied from the reference in at least one test sequence. This output would be much easier to
align with CLUSTALW than the full-length sequences, which are typically > 99%
invariant.
I am looking for some ideas of how best to implement this
web-based tool.
Thanks,
David B. Whyte, Ph.D. Argus Biosciences, LLC 650-954-1055
argusbio.com">dwhyte argusbio.com www.argusbio.com
|