Back to the main AceView page

Downloads

Last updated April 24, 2008

This page provides access to the data that we generated over the years by using our manually-guided AceView software to annotate the genes from Human, Mouse, Arabidopsis and C. elegans. The data section is organized chronologically, current data on top, Archive in the bottom. Format is documented when needed. Some of the software we develop is available, although not necessarily fully documented.

CURRENT DATA FOR       HUMAN March 6, 2008                                  MOUSE April 24, 2008                     ARABIDOPSIS  September, 2007

                                                Conditions of use             Archive                 Software             Help

AceView data Downloads

Last updated April 24, 2008

Conditions of use: Generating and manually curating this data is not a small enterprise, it is very hard work and we would appreciate recognition.

If you intend to perform a large scale analysis of the AceView data in view of a publication, please contact us, to make sure we are not currently doing it ourselves. This data is unique, we have performed a large number of (still unpublished) analyses, and may ask to be considered as collaborators. Thank you.

If on the other hand you just need the AceView data as a tool in your research or applications, you may download it from this site or from UCSC or EBI, but you should please acknowledge AceView by citing our publication: Danielle Thierry-Mieg and Jean Thierry-Mieg, AceView: a comprehensive cDNA-supported gene and transcripts annotation, Genome Biology 2006, 7(Suppl 1):S12.

If you want to receive an announcement when we update the data on the site, please mail us to be added to our mailing list.  

Files for download are usually in tar.gz format. Help documentation is here.

The Arabidopsis thaliana September 2007 AceView release aligns 1,188,694 cDNA sequences (available September 15, 2007) into 22,177/32,925 spliced/any genes and transcripts (33,787 spliced) on the Arabidopsis NCBI genome 7.0 (August 2007).

The files available are:

.  The coordinates of exons/introns, CDS and UTR of each mRNA in gff format, for non-cloud genes (6.1 MB)

·  The mRNA sequences (including UTR parts) for the genes, known or unknown, but excluding the clouds, in fasta format (16.3 MB) (see the definition of clouds in the FAQ)

.  The amino acid sequences of the best protein from each transcript, provided they look like ‘good proteins’, in fasta format (7.3 MB)  Warning: like most proteins in SwissProt or RefSeq, these are conceptual translation products, most have not been observed experimentally.

.  All AceView mRNAs, with no restriction: a comprehensive non redundant curated representation of all data submitted as cDNA sequences to GenBank and dbEST as of March 2007, in fasta format (16.7 MB).

·  All peptides and proteins, with no restriction, for decoding mass spectrometry data, in fasta format (8.4 MB)

·  The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM, coordinates of the domain in the proteins (1.5 MB)

 

If you need other kinds of files or analyses, please email us, we are open to collaborations. Help on formats is here.

The Mouse September 2007 AceView release aligns more than 4 million cDNA sequences (available August 26, 2007) into genes and transcripts on the Mus musculus NCBI genome 37/mm9 (July 2007).

The files available are:

.  The coordinates of exons/introns, CDS and UTR of each mRNA in gff format, for non-cloud genes (20.0 MB)

·  The mRNA sequences (including UTR parts) for the genes, known or unknown, but excluding the clouds, in fasta format (67.0 MB) (see the definition of clouds in the FAQ)

.  The amino acid sequences of the best protein from each transcript, provided they look like ‘good proteins’, in fasta format (15.1 MB)  Warning: like most proteins in SwissProt or RefSeq, these are conceptual translation products, most have not been observed experimentally.

.  All AceView mRNAs, with no restriction: a comprehensive non redundant curated representation of all data submitted as cDNA sequences to GenBank and dbEST as of March 2007, in fasta format (90.7 MB).

·  All peptides and proteins, with no restriction, for decoding mass spectrometry data, in fasta format (31.5 MB)

·  The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM, coordinates of the domain in the proteins (2.9 MB)

. Added April 24, 2008: Mapping of mouse Affymetrix  430-2 expression microarray probes on AceView transcripts, on Refeq (NM, NR, M, R from April 24, 2008) and on the genome build 37  (22 MB) (see this help for format)

Affymetrix expression array Mouse 430-2 contains 496,468 probe sequences. 467,032 (94.1%) map on the current mouse genome, 466,730 (94.0%) map onto AceView transcripts, 301,873 (only 60.8%) map onto RefSeq: Using the AceView mapping should improve significantly the information you get from your arrays…

 

If you need other kinds of files or analyses, please email us, we are open to collaborations. Help on formats is here.

The Human Apr07 release aligns more than 7 million cDNA sequences (available March 26, 2007 in GenBank/ dbEST/ RefSeq) into genes on the human genome assembly NCBI_36/hg18 of March 2006.

The files available are:

·  The coordinates of exons/introns, CDS and UTR of each mRNA in gff format, for non-cloud genes (26.6 MB)

·  The mRNA sequences (including UTR parts) for the genes, known or unknown, but excluding the clouds, in fasta format (79.3 MB) (see the definition of clouds in the FAQ)

.  The amino acid sequences of the best protein from each transcript, provided they look like ‘good proteins’, in fasta format (15.6 MB)  Warning: like most proteins in SwissProt or RefSeq, these are conceptual translation products, most have not been observed experimentally.

.  All AceView mRNAs, with no restriction: a comprehensive non redundant curated representation of all data submitted as cDNA sequences to GenBank and dbEST as of March 2007, in fasta format (119.3 MB).

·  All peptides and proteins, with no restriction, for decoding mass spectrometry data, in fasta format (44.9 MB)   

·  The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM, coordinates of the domain in the proteins (3.4 MB)

 

. Added March 6, 2008: Mapping of human expression microarray probes (53 MB) (see this help)

 

If you need other kinds of files or analyses, please email us, we are open to collaborations. Help on formats is here.

Archive data

The Mouse June 2007 AceView release aligns more than 3 million cDNA sequences (available June 22, 2007) into genes and transcripts on the Mus musculus NCBI genome 37 (June 2007).  The files available are:

.  The coordinates of exons/introns, CDS and UTR of each mRNA in gff format, for non-cloud genes (17.4 MB) ·  The mRNA sequences (including UTR parts) for the genes, known or unknown, but excluding the clouds, in fasta format (60.5 MB) (see the definition of clouds in the FAQ)

.  The amino acid sequences of the best protein from each transcript, provided they look like ‘good proteins’, in fasta format (12.2 MB)  Warning: like most proteins in SwissProt or RefSeq, these are conceptual translation products, most have not been observed experimentally.

.  All AceView mRNAs, with no restriction: a comprehensive non redundant curated representation of all data submitted as cDNA sequences to GenBank and dbEST as of March 2007, in fasta format (81.3 MB).

·  All peptides and proteins, with no restriction, for decoding mass spectrometry data, in fasta format (28.9 MB)

·  The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM, coordinates of the domain in the proteins (2.7 MB)

The Human August 2005 AceView annotations were performed on the human genome NCBI_35/hg17 of July 2004.

This release aligns ESTs, mRNAs and RefSeqs in GenBank or dbEST on September 24 on the human genome build, NCBI_35 /hg17 of July 2004.

Two files were archived in July 07, the files still available are:

·  The coordinates of exons/introns, CDS and UTR of each mRNA in gff format, for non-cloud genes (26.8 MB)

·  The amino acid sequences of the best and good CDSs or ORFs in non-cloud mRNA, in fasta format (21.9 MB)

·  All peptides and proteins, with no restriction, for decoding mass spectrometry data, in fasta format (33.9 MB)

·  The mRNA sequences (including UTR parts) for all non-cloud genes, in fasta format (77.8 MB)

·  The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM, coordinates of the domain in the proteins (3.2 MB)

The Human December 2003 (Dec03) AceView annotations were performed on the human genome NCBI_34/hg16 of July 2003.

·  The amino acid sequences of each CDS or ORF in fasta format (16.3 MB)

·  The coordinates of exons/introns, CDS and UTR of each mRNA in gff format (18.5 MB)

·  The mRNA sequences (including UTR parts) in fasta format (77.9 MB)

·  The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM (730 kB)

The human AceView annotations were performed on the previous human genome build, NCBI_33 genome build of May 2003.

The amino acid sequences of each CDS or ORF in fasta format (14.5 MB)

The coordinates of exons/introns, CDS and UTR of each mRNA in gff format (20.5 MB)

The mRNA sequences (including UTR parts) in fasta format (74.6 MB)

The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM (604 kB)

The human AceView annotations were performed on the previous human genome build, NCBI_31 genome build of Decembre 2002.

The amino acid sequences of each CDS or ORF in fasta format (14.6 MB)

The coordinates of exons/introns, CDS and UTR of each mRNA in gff format (10.5 MB)

The mRNA sequences (including UTR parts) in fasta format (65.6 MB)

The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM (582k)

The human AceView annotations were performed on the previous human genome build, NCBI_30 genome build of August 25 2002.

The amino acid sequences of each CDS or ORF in fasta format (11.5 MB)

The coordinates of exons/introns, CDS and UTR of each mRNA in gff format (12.9 MB)

The mRNA sequences (including UTR parts) in fasta format (46.8 MB)

Please quote us as D,J and Y Thierry-Mieg, M.Potdevin, M.Sienkiewicz, V.Simonyan, www.humangenes.org: Construction and automatic annotation of cDNA-supported genes using Acembly, unpublished

The human AceView annotations were performed on the previous human genome build, NCBI_29 of may 25 2002:

The amino acid sequences of each CDS or ORF in fasta format (12.0 MB)

The coordinates of exons/introns, CDS and UTR of each mRNA in gff format (15.3 MB)

Please quote us as D,J and Y Thierry-Mieg, M.Potdevin, V.Simonyan, www.humangenes.org: Construction and automatic annotation of cDNA-supported genes using Acembly, unpublished

The human AceView annotations were performed on the previous human genome build, NCBI_28 of January 2002:

The amino acid sequences of each CDS or ORF in fasta format (9.5 MB)

The mRNA sequences (including UTR parts) in fasta format (40.8 MB)

The coordinates of exons/introns, CDS and UTR of each mRNA in gff format (9.5 MB)


(use shift-mouse click in Netscape)

 

Help on formats

Last updated March 21st, 2008

The files currently available are listed below.

·         The coordinates of exons/introns, CDS and UTR of each mRNA in gff format, for non-cloud genes (1 file per chromosome, MT is mitochondria)

·         The amino acid sequences of the best and good CDSs or ORFs in non-cloud mRNA, in fasta format

·         All peptides and proteins, with no restriction, for decoding mass spectrometry data, in fasta format

·         The mRNA sequences (including UTR parts) for all non-cloud genes, in fasta format  

·         The highly significant PFAM hits (position on genome of the corresponding gene, title and accession of the PFAM, coordinates of the domain in the proteins)  

·         The relation between all the aligned ESTs/mRNAs, the genes, the locusIds and the alternative variants, with indication of quality of match, tissue of origin, type of AceView gene, for all genes (main, putative and cloud). 

·         Mapping of probes from microarrays assaying whole genome expression, from the Affymetrix HG-U133 Plus 2.0 GeneChip, Agilent Whole Human Genome Oligo Microarray, G4112A and Illumina Human-6 BeadChip, 48K v1.0 platforms

 Please use shift-mouse click in Netscape. Most files are in gzip format.

 

The relation between all the aligned ESTs/mRNAs, the genes, the locusIds and the alternative variants, with indication of quality of match, tissue of origin, type of AceView gene, for all genes (main, putative and cloud).

This file describes the association of each mRNA/EST accession satisfactorily aligned in AceView to the corresponding AceView gene and transcript.

It is ordered by map position, and includes all genes, even the "cloud" genes. We indicate the official gene name and the LocusID when available. We show, for each accession, the alternative variant to which the cDNA contributes, but please be aware that we try to minimize the number of variants to which a given cDNA belongs, to avoid combinatorics in number of AceView transcripts: even when a sequence could equally well participate in two or more variants, we try to assign it to only one. Most alternative variants, except those with structural defects that we filter out, are included.

 The file was generated at the request of various users among which the GeneCards team; it has been made available since build 34 from the “downloads” page. For build 35, we have added the type of AceView gene (main, putative or cloud), tissue for each accession, characteristics and quality of the AceView EST/mRNA to genome match.

For build 34, the file has 5,101,247 lines and 5 columns; for build 35, the file has 6,397,928 lines and 12 columns (we added col 4, 7, 8, 9, 10, 11, 12). 

The current content and format is described below:

 column 1: chromosome position (Position)

This is a volatile name changing with each build, not a stable identifier: it starts with the chromosome name, then coordinate on that chromosome, in basepairs (e.g. 1_1287: chromosome 1, base 1287). It is handy because it allows us to export the table in genome order and can be helpful when you parse the data

column 2: AceView gene name (AceViewID).

This should be stable from release to release, although it evolves as new official gene names are generated. See how we generate the names in AceView: basically, we use official names if they exist, else PFAM derived names, else invented AceView names.

column 3:   set of LocusID or GeneID from Entrez Gene (previously LocusLink LocusID)

Examples:

·         1234 LocusID 1234

·         NULL: there is no LocusID yet, because this gene has not yet been modeled in LocusLink/Gene at NCBI.

·         1234;9876  this gene has 2 LocusID, 1234 and 9876. This actually occurs for 1373 human genes in build 34 out of 19,413 AceView genes with a LocusID. In build 35, 1498 human genes out of 24520 AceView genes with a GeneID have more than one GeneID (23,356 GeneID have one or more corresponding AceView gene(s)). Some AceView genes with more than one GeneID correspond to a single gene that was split in the RefSeq/LocusLink model, some to genes producing two different types of proteins, but fused in a single gene in AceView because there is at least one cDNA which bridges the two, even if only through the UTRs. AceView would actually consider this a single (possibly complex) gene, and name it by concatenating the LocusLink/official gene names (see this help and example: PEX19andWDR42A, where clone AGENCOURT_14368013 NIH_MGC_181 Homo sapiens cDNA clone IMAGE:30398500, accession CD518985 creates a nice bridge). Finally, a few correspond to repeated genes in a close tandem arrangements that we may have unintentionally merged.

 column 4: AceView gene type (Type)

Since build 35, the AceView genes have been categorized as Main gene, Putative gene or Cloud gene, as defined here.

 column 5: AceView transcript variant name (Variant)

  This is very often identical to the gene name, except when there is evidence for alternative splicing, then this column shows how we sub-cluster the mRNAs. The relation accession/ alternative variant to which the cDNA contributes is indicated in the next column, but please be aware that we try to minimize the number of variants to which a given cDNA belongs, to avoid combinatorics in number of AceView transcripts: even when a sequence could equally well participate in two or more variants, we try to assign it to only one. On the other hand, it may happen that an accession belongs to several different genes or different variants.

 column 6: mRNA or EST GenBank accession aligned in this gene (Accession)

We give the accession without a version, but for each new AceView release, we provide explicitly the date at which the mRNA/EST data were downloaded from GenBank.

 column 7: Number of basepairs of the accession to be aligned (Length)

This number is derived from the length in basepairs, as found in GenBank, minus the polyA and the eventual vector sequence that AceView recognized and clipped. This is the length in basepairs that we think has to be aligned on the genome.

 column 8: Number of unaligned bases, on the 5’ side (Start)

The value in this column is 0 if the alignment starts at the first base that needs to be aligned (base 1 of the GenBank accession if the vector was correctly clipped by the submitter, first base after the vector if some vector sequence was recognized unclipped in the GenBank accession. For example, the value in this column is 20 if we failed to align the first 20 bases, with no justification: the cDNA might have some structural anomaly or the reference genome might be missing a piece (or rarely, our alignment procedure might be bugged). 

column 9: Number of bases of the mRNA accession actually aligned on the genome in AceView (Ali)

column 10: number of base differences in the aligned region between the cDNA and the genome (Err). A single base transition, transversion, addition or deletion counts as 1.

column 11: Quality of the AceView alignment (Quality), scored from 1 (very best alignments) to 9. The quality factor reflects both the % length aligned and the % differences from the genome.

Scores also depend on whether the sequence is supposed to be high quality (mRNA) or single pass (EST). For mRNAs, the score is measured over the entire length to be aligned (column 7); for ESTs, it is scored on a maximum of 600 bp. Scores follow the empirical chart below: 

  %length aligned

 

%bp differences

> 98%

90-98%

80-90%

50-80% or 800 bp

Less than 50% or 800 bp

mRNA

EST

mRNA

EST

mRNA

EST

mRNA

EST

mRNA

EST

< 0.1%

1

1

2

2

3

3

4

5

7

9

0.1 to 1%

2

2

3

3

4

4

5

6

8

(10)

1 to 2%

3

3

4

4

5

5

6

7

(10)

(11)

2 to 3%

4

4

5

5

6

6

7

9

(11)

(11)

3 to 4%

5

5

6

6

7

7

9

(10)

(11)

(11)

4 to 5%

7

6

8

7

9

8

(10)

(11)

(11)

(11)

> 5%

9

7

(10)

8

(10)

9

(11)

(11)

(11)

(11)

 

Note: Qualities are initially scored from 1 to 11, but if no structural rearrangement is involved, we ultimately only retain qualities from 1 to 9. This scoring system has been tuned first on the nematode then on unfinished human genome in an empirical way, but it is useful, since qualities are used internally in AceView as a critical step to keep only the best matches during the initial clean-up phase. These numbers may also be used as a quick indicator in the case of other projects, such as GeneCards/GeneTide, to explain some discrepancies between various clustering methods.

 To give ideas on the statistics, for build 35, there are

Aligned at quality

1

2

3

4

5

6

7

8

9

(10)

(11)

#accessions

1,254,893

2,502,199

1,014,606

573,650

387,665

254,449

212,953

114,645

82,866

29,844

559

 

 column 12: Tissue, as copied from the GenBank submission (not yet systematic). We would prefer to use a standardized vocabulary, such as TissueInfo developed by Lucy Skrabanek and Fabien Campagne, but this is not yet done.

 

Mapping of probes from microarrays assaying whole genome expression, from Affymetrix HG-U133 Plus 2.0 GeneChip, Agilent Whole Human Genome Oligo Microarray, G4112A and Illumina Human-6 BeadChip, 48K v1.0 platforms

File first added March 6, 2008 for human, April 24, 2008  for Mouse.

Mapping probes to genes in a systematic fashion is essential for comparing expression results obtained on different platforms, as well as to properly interpret results from a single platform (for example see the MAQC project). Multiple users asked us to make public the mapping of probes to AceView transcripts and the genome for some of the most used commercial microarray platforms.: in the first release (March 6, 2008), we provide alignments for the Illumina Human-6 BeadChip (48k v1.0), Agilent WHG (G4112A0) and Affymetrix HG-U133 Plus 2.0 GeneChip expression arrays, used in MAQC.

The mapping is performed on the current human genome (build 36) as well as on three annotated transcripts sets, two from NCBI and one from EBI. We used all AceView transcripts (current release April 2007), all RefSeq (NM and XM, but not including the few non coding NR/XR) (downloaded January 2008), or all Ensembl models (downloaded January 2008).

We use our program ProbeAlign, set to allow up to 2 mismatches (base difference or indel) for Agilent 60-mer probes and Illumina 50-mer, and up to 1 mismatch for Affymetrix 25-mer probes.

 You may ask for more data of that kind as we are willing to similarly map any platform which provides the sequence of their probes to the public.  This naturally excludes ABI or TaqMan.

 

The format is defined below, all files are ordered on the name of the probe (column 4):

Column 1: name of target mRNA (from AceView, RefSeq or Ensembl), or of chromosome if mapping is to reference genome

Column 2: (t1) coordinate of the first base of the match on the target mRNA

Column 3: (t2) coordinate of the last base of the match on the target. (t1 and t2 also give the strand)

Column 4: name of the arrayed probe

Column 5: (p1) coordinate of the first base of the match on the arrayed probe

Column 6: (p2) coordinate of the last base of the match on the probe

Column 7: number of uncalled bases (N) in probe [rare; in some control probes]

Column 8: number of mismatches between probe and target. One mismatch is a difference, an insertion or deletion affecting a single nucleotide

Column 9: coordinate in the probe of the start of the longest exact match

Column 10: coordinate of the end of the longest exact match

Column 11: length of the longest exact match

Column 12: position of the first mismatch

Column 13: type of the first mismatch: may be base_in_transcript > base_in_probe (e.g. a>g) OR +base for extra base in probe (insertion) OR -base for base missing in probe (deletion)

Column 14: only in Affymetrix, indicates the probeset

 

We also provide:

·         A list of the 276,184 MAQC validated probes from all MAQC platforms, according to our unpublished analysis (File: ConfirmedTitratingProbes.txt). These probes map uniquely to a single gene and do not cross-hybridize. They have two desirable properties when hybridized to mixes of RNA samples A (Universal RNA Stratagene) and B (Ambion brain): they are “titrating”, i.e. signals for the 4 samples containing mixes of the two RNAs:  A (A100:B0), C (A75:B25), D (A25:B75), B (A0:B100) are in proper order, either monotonously increasing or decreasing when we average the normalized signals over the 15 replicas (minus a few outlier arrays). In addition, the direction of the differential expression for the gene measured agrees with at least 2 other platforms: A/C/D/B signals vary in the same direction in a majority of platforms assaying the gene, so in this sense, all probes in these files coherently and sensitively measure the validated differential expression A>B or B>A for the genes. Probes are ordered alphabetically, ABI, AFX (Affymetrix), AGL (Agilent), EPN (Eppendorf), GEH (General Electric Healthcare), GEX, ILM (Illumina), NCI (Operon), QGN (QuantiGene) and TAQ (TaqMan) (Note that this is a minimal list of validated probes, as some probes on the array might not have been testable by the two RNA samples selected for MAQC. But at least these probes have been proven to be good, sensitive and reliable.)

·         A file giving a comparable measure of the melting temperature for all probes (files ending in .Tm).

·         The mapping in AceView of all the MAQC probes to their desired (and undesired) target genes, including the identification of the specific alternative transcripts targeted, is available as a zip or a tar.gz compressed file. This file was generated as part of the Micro Array Quality Control project (MAQC study), we used ProbeAlign to map all probes sequences to the human genes (from the previous version of AceView, April 2005), including to their putatively cross-hybridizing targets. The total number of genes tested with gene specific probes by each genome wide array platform participating to the project, i.e. 29,040 genes for Affymetrix, 22,106 for Agilent, 21,943 for Applied Biosystems, 35,392 for Codelink General Electric, 27,463 for Illumina and 19,025 for the NCI/Operon array, are detailed in the Supplementary data, page 14, on the Nature Biotechnology website. 

We hope this information may be useful, do not hesitate to feed us back on what your needs or questions are.

Last updated March 21st, 2008

Software

Last updated March 6th, 2008

The source code of some of the programs developed by the AceView group at NCBI is available from this page, under GNU Public Licence.

AceDB@NCBI

The source code of the NCBI version of the AceDB object oriented database system is available here, under GNU General Public License. This code is supported by Jean Thierry-Mieg at NCBI. All Unix/Linux 32 or 64 bits platforms are recognized, including IBM, Sun, Intel, AMD, alpha ... MacX, and Windows/Cygnus. Documentation is available here.

The AceDB database manager was written in the early 90s by Richard Durbin, now at Sanger, and Jean Thierry-Mieg, author of AceView, now at NCBI. However, over the years, the codes have evolved to suit the needs of the two main AceDB authors and their users. In particular, the Sanger version contains since 2005 a bug that potentially looses data (http://www.acedb.org/Software/Downloads/ ). This bug has never affected our NCBI AceDB version.

The AceDB@NCBI server supports the AceView web site. Relative to the Sanger version, we have incorporated many optimizations and new graphics. The tabular query interface “TableMaker” was expanded to enable the selection of sets of genes with complex combinations of properties, including sequence constraints. We have developed with Mark Sienkiewicz a C language programmers’ interface, AceC, which is part of the present distribution. The human server with its 29 million objects, the mouse and worm servers all run currently on a Suse Linux box with 8 Gigabytes of RAM and two double core Intel processors.

AceView

This package contains the complete source code of the AceView software, including the NCBI version of the acedb database manager which we originally developed in 1990 in collaboration with Richard Durbin and has been used in a number of genome projects in many laboratories worldwide. The AceView code is stand alone and distributed under GNU Public License. It compiles and runs on all the Unix platforms we have ever tested. It is built around the acedb database manager.  You may download the source code here It includes the Acedb database manager, written in the early 90s by Richard Durbin, now at Sanger, and Jean Thierry-Mieg, now at NCBI. As a database engine, the NCBI version is compatible with the Sanger Center version: the data files can be freely exchanged between the two systems, they can even run from the same disc and they both support AcePerl. However, over the years, the codes have evolved to suit the needs of the two main Acedb authors and their users. We have incorporated in the AceView NCBI version of Acedb a very powerful cDNA alignment code, and many optimizations and new graphics to support sequencing trace edition, genome assembly, mRNA to genome alignment and biological annotation of the genes. "TableMaker" was expanded to enable the selection of sets of genes with complex combinations of properties, including sequence constraints. The AceView web site is supported by a new version of the Acedb database server and a C language programmer’s interface AceC, which are part of the present distribution. The human servers currently run on a standard Intel Linux box with 4 Gb of RAM and two processors. Support for the NCBI version will of course be done by Jean at NCBI . All Unix/Linux 32 or 64 bits platforms are supported, including IBM, Sun, Intel, opteron, alpha ... MacX, and Windows/Cygnus. The README file in the same directory contains the intruction on how to compile and test the code. This README file is also included in the source code tar.gz file. An AceView demo, human chromosome Y as of October 2005, is available here . Please follow the instructions in the README file.

UCSCtrackCompare

This packages can be used to compare the genome annotation tracks available from the magnificient UCSC genome browser maintained by Jim Kent's group. The script UCSCdownload.csh can be used to download the tracks discussed in our paper 'The genomewide AceView annotation closely matches the hand curated Gencode transcript annotation'. The chromConfig.txt and trackConfig.txt configuration files describes which tracks you wish to compare over which regions. Finally the executable UCSCtrackCompare perform the comparison. Called without parameters, it exports its self-documentation.

You may either download the executable, or recompile. The source code is provided in this directory for your convenience if you wish to read it. But since it links against the acedb libraries, the easiest way to recompile the code is to download our AceView package, to install it, and then to go to the wacext subdirectory of the AceView package and isse the command 'make UCSCtrackCompare'. The executable will appear in ../bin.$ACEDB_MACHINE

Here is the users guide, the source code the track download script, the chromosome configuration file, the track configuration file, the models.wrm (principal schema file) of our AceView database, our full schema, and some executables for Solaris, Mac, Linux.

ProbeAlign

This package can be used to align relatively short DNA sequences: "probes", to a reference set of long DNA sequences: "target". Typically, we use it to align the micro-arrays probes (say the 600,000 AFX probes) to the human genome or transcriptome.

SWFC/Flash

The Flash diagrams are generated using the open software SWFC. The acedb graphic package includes several drivers, allowing exporting acedb images in X11, Post-script, Gif... and now .sc which is the input format of the swfc compiler which in turn generates .swf swift, i.e. Adobe Flash flash files.

Jean Thierry-Mieg

Last modified: Tue Oct 9 18:16:32 EDT 2007