List Info

Thread: Assembly programs




Assembly programs
user name
2006-07-06 23:26:34
Hi folks:

   Was asked recently about genome assembly, and I gave the
answer that 
Chris gave below.  What bugs me is that I haven't followed
the assembly 
work for a while, and all I remember are the TIGR tools.

   Basically what I am asking is whether or not people have
built 
assembly algorithms to run on smaller memory machines, or do
we still 
need  large memory SMPs to do the job?  64GB and up, or can
we run some 
set of tools in under 16 GB on lots of cluster nodes?

   Thanks!

Joe

Chris Dagdigian wrote:
> 
> Hi François,
> 
> First off, what assembly program are you trying to run
on your cluster? 
> Are you sure it is even capable of running in parallel
across many 
> machines? Most people I know doing assembly are doing
it within a single 
> large SMP system because shared memory is easier/faster
and (I think...) 
> there is a relative lack of "true parallel"
assembly algorithms.
> 
> Here are some official grid engine helpful URLs:
> 
> - http://gridengine.sun
source.net (main site for the codebase)
> 
> - http://docs.
sun.com/app/docs/coll/1017.3  (official documentation
site)
> 
> I also run a site at http://gridengine.info but
that may not be helpful 
> until you are at least up and running.
> 
> Some specific suggestions for you and your current
setup:
> 
> (1) Ignore the 'qmon' GUI. You won't be using it
anyway with your 
> assembler and it just gets in the way of the more
flexible command line 
> programs. Stick with the unix binaries like
"qstat", "qrsh" and 
> "qsub".   You won't be able to use SGE to
its fullest unless you are 
> comfortable with the command line programs
> 
> (2) Send us (or me) the output of the command
"qstat -f" when run on 
> your system. It may explain why you could not run the
simple.sh example 
> job.
> 
> (3) Learn where your spool logs are, they will be
invaluable in 
> debugging failures. The default location is something
along the lines of 
> $SGE_ROOT/<cell>/spool/ -- in particular you want
to look at the last 
> few lines of "qmaster/messages",
"qmaster/schedd/messages" and any 
> messages files belonging to exec hosts that are not
behaving.
> 
> Regards,
> Chris
> 
> 
> 
> 
> 
> On Jul 6, 2006, at 4:42 PM, francois.fauteux2mail.mcgill.ca wrote:
> 
>> Hi;
>>
>> I am totally new to grid computing. I recently
tried to run some 
>> sequence assembly process on a G5 (8Gb RAM) but the
process did 
>> require more memory.
>>
>> I installed N1SGE6 on 3 MACs G5 under 10.4.7
(connected trough a 
>> router) (alltogheter 13Gb RAM) and I would like to
run the assembly 
>> process in parallel trough the cluster hoping that
memory resources 
>> would be sufficient for the process to complete.
>>
>> I would appreciate hints as to
"for-dummies-fast-how-to" configure the 
>> cluster / submit the job properly.
>>
>> I installed master and hosts with defaults
settings. First try with 
>> examples/simple.sh returns (w. qmon):
>> No free slots for interactive job!
>> while 5 PCUs are available.
>>
>> Any hint as to how to properly configure the 
>> cluster/project/queues/parallel environments; or to
use qsub with 
>> usefull options -for a fast getting started- would
be greatly 
>> appreciated; thanks.
>>
>> François
>>
>> _______________________________________________
>> Bioclusters maillist  -  Bioclustersbioinformatics.org
>> https://bioinformatics.org/mailman/listinfo/bioclusters
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclustersbioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landmanscalableinformatics.com
web  : http://www.scalabl
einformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615
_______________________________________________
Bioclusters maillist  -  Bioclustersbioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters
Assembly programs
user name
2006-07-06 23:41:15
Have not had time time to dig into this further but I'm
pulling these  
app names from notes I had taken during a recent
conversation about  
assembly with someone ...

The person was a current heavy user of "CAP3" on
a 32GB Solaris/sparc  
system and was looking at a program called
"PCAP" as a way of running  
across a cluster since the 32GB memory machine was no longer
 
performing well on large assembly problems. Also mentioned
repeatedly  
as a possible parallel-and-low-memory-requirements
alternative was EULER

CAP3: http:/
/www.genome.org/cgi/content/full/9/9/868

PCAP and CAP3 seem to be from the same authors but the main
website  
cited by google seems to be down at the moment.

EULER looks pretty interesting and seems to live here:
http://nbcr.sdsc.edu/eule
r/


-Chris



On Jul 6, 2006, at 7:26 PM, Joe Landman wrote:

> Hi folks:
>
>   Was asked recently about genome assembly, and I gave
the answer  
> that Chris gave below.  What bugs me is that I haven't
followed the  
> assembly work for a while, and all I remember are the
TIGR tools.
>
>   Basically what I am asking is whether or not people
have built  
> assembly algorithms to run on smaller memory machines,
or do we  
> still need  large memory SMPs to do the job?  64GB and
up, or can  
> we run some set of tools in under 16 GB on lots of
cluster nodes?
>
>   Thanks!
>
> Joe
>
> Chris Dagdigian wrote:
>> Hi François,
>> First off, what assembly program are you trying to
run on your  
>> cluster? Are you sure it is even capable of running
in parallel  
>> across many machines? Most people I know doing
assembly are doing  
>> it within a single large SMP system because shared
memory is  
>> easier/faster and (I think...) there is a relative
lack of "true  
>> parallel" assembly algorithms.
>> Here are some official grid engine helpful URLs:
>> - http://gridengine.sun
source.net (main site for the codebase)
>> - http://docs.
sun.com/app/docs/coll/1017.3  (official  
>> documentation site)
>> I also run a site at http://gridengine.info but
that may not be  
>> helpful until you are at least up and running.
>> Some specific suggestions for you and your current
setup:
>> (1) Ignore the 'qmon' GUI. You won't be using it
anyway with your  
>> assembler and it just gets in the way of the more
flexible command  
>> line programs. Stick with the unix binaries like
"qstat", "qrsh"  
>> and "qsub".   You won't be able to use
SGE to its fullest unless  
>> you are comfortable with the command line programs
>> (2) Send us (or me) the output of the command
"qstat -f" when run  
>> on your system. It may explain why you could not
run the simple.sh  
>> example job.
>> (3) Learn where your spool logs are, they will be
invaluable in  
>> debugging failures. The default location is
something along the  
>> lines of $SGE_ROOT/<cell>/spool/ -- in
particular you want to look  
>> at the last few lines of
"qmaster/messages", "qmaster/schedd/ 
>> messages" and any messages files belonging to
exec hosts that are  
>> not behaving.
>> Regards,
>> Chris
>> On Jul 6, 2006, at 4:42 PM, francois.fauteux2mail.mcgill.ca wrote:
>>> Hi;
>>>
>>> I am totally new to grid computing. I recently
tried to run some  
>>> sequence assembly process on a G5 (8Gb RAM) but
the process did  
>>> require more memory.
>>>
>>> I installed N1SGE6 on 3 MACs G5 under 10.4.7
(connected trough a  
>>> router) (alltogheter 13Gb RAM) and I would like
to run the  
>>> assembly process in parallel trough the cluster
hoping that  
>>> memory resources would be sufficient for the
process to complete.
>>>
>>> I would appreciate hints as to
"for-dummies-fast-how-to"  
>>> configure the cluster / submit the job
properly.
>>>
>>> I installed master and hosts with defaults
settings. First try  
>>> with examples/simple.sh returns (w. qmon):
>>> No free slots for interactive job!
>>> while 5 PCUs are available.
>>>
>>> Any hint as to how to properly configure the
cluster/project/ 
>>> queues/parallel environments; or to use qsub
with usefull options  
>>> -for a fast getting started- would be greatly
appreciated; thanks.
>>>
>>> François
>>>
>>> _______________________________________________
>>> Bioclusters maillist  -  Bioclustersbioinformatics.org
>>> https://bioinformatics.org/mailman/listinfo/bioclusters
>> _______________________________________________
>> Bioclusters maillist  -  Bioclustersbioinformatics.org
>> https://bioinformatics.org/mailman/listinfo/bioclusters
>
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landmanscalableinformatics.com
> web  : http://www.scalabl
einformatics.com
> phone: +1 734 786 8423
> fax  : +1 734 786 8452
> cell : +1 734 612 4615
> _______________________________________________
> Bioclusters maillist  -  Bioclustersbioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters

_______________________________________________
Bioclusters maillist  -  Bioclustersbioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters
Assembly programs
user name
2006-07-07 21:45:00
Joe,

    We have worked on the WGS assembly of Galdieria
sulphuraria, a
unicellular red algae with an estimated genome size of ~12
Mb.  We did it on
a 4-way Opteron box with 16 GB of RAM using Arachne from the
Broad institute
(formerly Whitehead). See http://www.broad.mit.edu
/wga

    As genome assembly projects go it is not that large
(bigger than a
bacteria but way, way smaller than a mammal).  The 16 GB of
RAM was plenty
for an assembly of this size.  The original papers on
Arachne cited memory
efficiency as one of the design goals and IIRC they were
doing fruit fly on
12 GB machines.  Another nice thing about Arachne was that
it was reasonably
straightforward to get up and running.  I did muck around
with the TIGR
Assembler and EULER a bit but was never able to get them
working properly.
I should point out that 3 of the 4 cpus on our box were
idling since most of
Arachne is not "SMP Aware", only the initial
parsing of the read, quality &
info files (you can launch multiple processes if there are
multiple input
files.)  Arachne is not "cluster capable" either
but a decent opteron box
with oodles of RAM can be had for a pretty good price these
days.

    
Kevin M. Carr

**************************
Bioinformatics Specialist
Research Technology
  Support Facility
202-D Biochemistry Bldg.
Michigan State University
East Lansing, MI  48824

Ph: (517) 353-6794
Fax:(517) 353-8638
**************************


> From: Joe Landman <landmanscalableinformatics.com>
> Reply-To: HPC for bioinformatics <bioclustersbioinformatics.org>
> Date: Thu, 06 Jul 2006 19:26:34 -0400
> To: HPC for bioinformatics <bioclustersbioinformatics.org>
> Subject: [Bioclusters] Assembly programs
> 
> Hi folks:
> 
>    Was asked recently about genome assembly, and I gave
the answer that
> Chris gave below.  What bugs me is that I haven't
followed the assembly
> work for a while, and all I remember are the TIGR
tools.
> 
>    Basically what I am asking is whether or not people
have built
> assembly algorithms to run on smaller memory machines,
or do we still
> need  large memory SMPs to do the job?  64GB and up, or
can we run some
> set of tools in under 16 GB on lots of cluster nodes?
> 
>    Thanks!
> 
> Joe
> 
> Chris Dagdigian wrote:
>> 
>> Hi François,
>> 
>> First off, what assembly program are you trying to
run on your cluster?
>> Are you sure it is even capable of running in
parallel across many
>> machines? Most people I know doing assembly are
doing it within a single
>> large SMP system because shared memory is
easier/faster and (I think...)
>> there is a relative lack of "true
parallel" assembly algorithms.
>> 
>> Here are some official grid engine helpful URLs:
>> 
>> - http://gridengine.sun
source.net (main site for the codebase)
>> 
>> - http://docs.
sun.com/app/docs/coll/1017.3  (official documentation
site)
>> 
>> I also run a site at http://gridengine.info but
that may not be helpful
>> until you are at least up and running.
>> 
>> Some specific suggestions for you and your current
setup:
>> 
>> (1) Ignore the 'qmon' GUI. You won't be using it
anyway with your
>> assembler and it just gets in the way of the more
flexible command line
>> programs. Stick with the unix binaries like
"qstat", "qrsh" and
>> "qsub".   You won't be able to use SGE
to its fullest unless you are
>> comfortable with the command line programs
>> 
>> (2) Send us (or me) the output of the command
"qstat -f" when run on
>> your system. It may explain why you could not run
the simple.sh example
>> job.
>> 
>> (3) Learn where your spool logs are, they will be
invaluable in
>> debugging failures. The default location is
something along the lines of
>> $SGE_ROOT/<cell>/spool/ -- in particular you
want to look at the last
>> few lines of "qmaster/messages",
"qmaster/schedd/messages" and any
>> messages files belonging to exec hosts that are not
behaving.
>> 
>> Regards,
>> Chris
>> 
>> 
>> 
>> 
>> 
>> On Jul 6, 2006, at 4:42 PM, francois.fauteux2mail.mcgill.ca wrote:
>> 
>>> Hi;
>>> 
>>> I am totally new to grid computing. I recently
tried to run some
>>> sequence assembly process on a G5 (8Gb RAM) but
the process did
>>> require more memory.
>>> 
>>> I installed N1SGE6 on 3 MACs G5 under 10.4.7
(connected trough a
>>> router) (alltogheter 13Gb RAM) and I would like
to run the assembly
>>> process in parallel trough the cluster hoping
that memory resources
>>> would be sufficient for the process to
complete.
>>> 
>>> I would appreciate hints as to
"for-dummies-fast-how-to" configure the
>>> cluster / submit the job properly.
>>> 
>>> I installed master and hosts with defaults
settings. First try with
>>> examples/simple.sh returns (w. qmon):
>>> No free slots for interactive job!
>>> while 5 PCUs are available.
>>> 
>>> Any hint as to how to properly configure the
>>> cluster/project/queues/parallel environments;
or to use qsub with
>>> usefull options -for a fast getting started-
would be greatly
>>> appreciated; thanks.
>>> 
>>> François
>>> 
>>> _______________________________________________
>>> Bioclusters maillist  -  Bioclustersbioinformatics.org
>>> https://bioinformatics.org/mailman/listinfo/bioclusters
>> 
>> _______________________________________________
>> Bioclusters maillist  -  Bioclustersbioinformatics.org
>> https://bioinformatics.org/mailman/listinfo/bioclusters
> 
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landmanscalableinformatics.com
> web  : http://www.scalabl
einformatics.com
> phone: +1 734 786 8423
> fax  : +1 734 786 8452
> cell : +1 734 612 4615
> _______________________________________________
> Bioclusters maillist  -  Bioclustersbioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
> 


_______________________________________________
Bioclusters maillist  -  Bioclustersbioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters
[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )