List Info

Thread: Versioning databases




Versioning databases
user name
2006-06-05 02:33:40
Sounds nice.  I had thought of also (somehow) saving diffs
in a db so 
you could generate the test db you used previously.  Don't
know if there 
is interest in this, but we had a prototype of this a few
years ago.

Joe

Michael James wrote:
> Some biological databases actually come in versions,
>  for example;  we are up to the TIGR4 rice genome and
>  swisprot UniProtKB/Swiss-Prot Release 50.0 of
30-May-2006
> 
> Others just change daily, NCBI:nr  NCBI:nt  etc.
> 
> All this effort creates a problem for repeatability,
>  the blast results you get next week
>  won't quite be the ones you got today.
> 
> It seems to me that the situation would be improved
>  by tagging results "BLAST against ncbi.nih.gov
nr 2006-06-05 000"
> 
> This means we need to come up with a versioning scheme
>  and for anything without, I'd suggest something as
simple as
>    issuing_authority  database  date   
3_digit_release_number
> eg  ncbi.nih.gov           nr  2006-06-05          000
> 
> For uniqueness, use the internet name for
issuing_authority.
> 
> The database is the filename stripped of all qualifiers
> Remove things like  .gz  .00.tar.gz  
> 
> The date in ISO format!
> 
> 3 more digits to ensure uniqueness.
> 
> 
> Such a scheme would also be
>  a big win for us database administrators.
> We could start to weave it through the tangled web
>  of different providers and formats
>  so we actually know the original issuing authority
>  for the file we are downloading.
> 
> What do you think?
> michaelj
> 
> 

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landmanscalableinformatics.com
web  : http://www.scalabl
einformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615
_______________________________________________
Bioclusters maillist  -  Bioclustersbioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters
Versioning databases
user name
2006-06-05 03:02:45
It seems much of this could be addressed by a svn repository. I know I'd sure appreciate typing 'svn update nt'. What was in your prototype?

----- Original Message ----
From: Joe Landman <landmanscalableinformatics.com>
To: "Clustering, compute farming & distributed computing in life science informatics" <bioclustersbioinformatics.org>;
Sent: Sunday, June 4, 2006 10:33:40 PM
Subject: Re: [Bioclusters] Versioning databases

Sounds nice. ; I had thought of also (somehow) saving diffs in a db so
you could generate the test db you used previously. &nbsp;Don't know if there
is interest in this, but we had a prototype of this a few years ago.

Joe

Michael James wrote:
>; Some biological databases actually come in versions,
> &nbsp;for example;&nbsp; we are up to the TIGR4 rice genome and
>&nbsp; swisprot UniProtKB/Swiss-Prot Release 50.0 of 30-May-2006
>
> Others just change daily, NCBI:nr&nbsp; NCBI:nt  etc.
>
> All this effort creates a problem for repeatability,
>&nbsp; the blast results you get next week
>&nbsp; won't quite be the ones you got today.
>;
> It seems to me that the situation would be improved
&gt; &nbsp;by tagging results "BLAST against ncbi.nih.gov nr 2006-06-05 000"
>
> This means we need to come up with a versioning scheme
>; &nbsp;and for anything without, I'd suggest something as simple as
>&nbsp; &nbsp; issuing_authority  database &nbsp;date&nbsp; &nbsp; 3_digit_release_number
> eg &nbsp;ncbi.nih.gov   ; &nbsp; &nbsp; &nbsp;  nr &nbsp;2006-06-05 &nbsp; &nbsp;   ; &nbsp; 000
>
> For uniqueness, use the internet name for issuing_authority.
&gt;
> The database is the filename stripped of all qualifiers
> Remove things like  .gz&nbsp; .00.tar.gz&nbsp; 
>
> The date in ISO format!
&gt;
> 3 more digits to ensure uniqueness.
>
>
> Such a scheme would also be
>&nbsp; a big win for us database administrators.
> We could start to weave it through the tangled web
>&nbsp; of different providers and formats
&gt; &nbsp;so we actually know the original issuing authority
> &nbsp;for the file we are downloading.
>
> What do you think?
>; michaelj
&gt;
>

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landmanscalableinformatics.com
web ; : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax&nbsp; : +1 734 786 8452
cell : +1 734 612 4615
_______________________________________________
Bioclusters maillist&nbsp; -  ;Bioclustersbioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters
Versioning databases
user name
2006-06-05 03:06:10
Just a simple postgresql saving of compressed deltas with a
simple front 
end.  SVN wasn't popular at the time, and cvs didn't look
like it could 
handle it.  Even svn might blow lots of time in diff
calculation.

Mike Cariaso wrote:
> 
> It seems much of this could be addressed by a svn
repository. I know I'd 
> sure appreciate typing 'svn update nt'. What was in
your prototype?
> 
> ----- Original Message ----
> From: Joe Landman <landmanscalableinformatics.com>
> To: "Clustering, compute farming &
distributed computing in life science 
> informatics" <bioclustersbioinformatics.org>
> Sent: Sunday, June 4, 2006 10:33:40 PM
> Subject: Re: [Bioclusters] Versioning databases
> 
> Sounds nice.  I had thought of also (somehow) saving
diffs in a db so
> you could generate the test db you used previously. 
Don't know if there
> is interest in this, but we had a prototype of this a
few years ago.
> 
> Joe
> 
> Michael James wrote:
>  > Some biological databases actually come in
versions,
>  >  for example;  we are up to the TIGR4 rice genome
and
>  >  swisprot UniProtKB/Swiss-Prot Release 50.0 of
30-May-2006
>  >
>  > Others just change daily, NCBI:nr  NCBI:nt  etc.
>  >
>  > All this effort creates a problem for
repeatability,
>  >  the blast results you get next week
>  >  won't quite be the ones you got today.
>  >
>  > It seems to me that the situation would be
improved
>  >  by tagging results "BLAST against
ncbi.nih.gov nr 2006-06-05 000"
>  >
>  > This means we need to come up with a versioning
scheme
>  >  and for anything without, I'd suggest something
as simple as
>  >    issuing_authority  database  date   
3_digit_release_number
>  > eg  ncbi.nih.gov           nr  2006-06-05        
 000
>  >
>  > For uniqueness, use the internet name for
issuing_authority.
>  >
>  > The database is the filename stripped of all
qualifiers
>  > Remove things like  .gz  .00.tar.gz  
>  >
>  > The date in ISO format!
>  >
>  > 3 more digits to ensure uniqueness.
>  >
>  >
>  > Such a scheme would also be
>  >  a big win for us database administrators.
>  > We could start to weave it through the tangled
web
>  >  of different providers and formats
>  >  so we actually know the original issuing
authority
>  >  for the file we are downloading.
>  >
>  > What do you think?
>  > michaelj
>  >
>  >
> 
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landmanscalableinformatics.com
> web  : http://www.scalabl
einformatics.com
> phone: +1 734 786 8423
> fax  : +1 734 786 8452
> cell : +1 734 612 4615
> _______________________________________________
> Bioclusters maillist  -  Bioclustersbioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
> 
> 
>
------------------------------------------------------------
------------
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclustersbioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landmanscalableinformatics.com
web  : http://www.scalabl
einformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615
_______________________________________________
Bioclusters maillist  -  Bioclustersbioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters
Versioning databases
user name
2006-06-05 14:43:11
On 5 Jun 2006, at 4:06 am, Joe Landman wrote:

> Just a simple postgresql saving of compressed deltas
with a simple  
> front end.  SVN wasn't popular at the time, and cvs
didn't look  
> like it could handle it.  Even svn might blow lots of
time in diff  
> calculation.


Oooh yes.  I've seen what happens when even small things
like  
bacterial genomes are kept in CVS repositories.  It's not
pretty.

Putting nt in it?  *shudder*

I'm not sure I'd trust diff and patch to do the right
thing anyway,  
especially with repetitive or highly similar sequences. 
Does diff  
use enough context to be reliable for DNA sequence?  I doubt
it.   
Diff is essentially a sequence alignment algorithm anyway,
and we all  
know all about those.  

Tim
_______________________________________________
Bioclusters maillist  -  Bioclustersbioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters
[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )