Schema Change Document Description:

 

The table that lists the schema change has three columns.  The first two columns show the build id. The third column has change description.

Table names are in bold. Only columns are indented and only columns with changes are shown under the table.

Please see data dictionary for full table details.

 ‘ -- -- --‘ means no corresponding objects ( table or column) for the build.

For example, in build 118, a new table “SNPAlleleFreq” is added.

 

Build 127 Schema Change

Date last updated: March 2, 2007

 

-          For human, SubInd table is replaced with a partitioned view of a set of underlying tables by chromosomes in two separate databases to keep each database size small for ease of maintenance: human_gty1 has SubInd for chromosome 1-7.  human_gyt2 has SubInd for chromosome 8-22, X, Y, Mitochondrial.

-          Table name format is: SubInd_ch1, SubInd_ch2 etc. SubInd_Un is for any snp that is mapped to a contig but the contig’s chromosome is unknown; SubInd_NotOn is for any snp that can not be aligned to any contigs.

 

 

 

Build 125 Schema Change

Date last updated: September 28, 2005

 

-          Major changes are in alignment types in mapping tables.  Detailed description of new mapping tables is in:  ftp://ftp.ncbi.nih.gov/snp/specs/b125_mapping.doc

-          The other major change is that starting b125 the old dbSNP database is split by organisms into 39 different snp databases with identical schema.

-          Except mapping table changes, other changes are minimal.  Please check back around October 6, 2005 for updated b125 ER diagram.

 

Build 123 Schema Change

Date last updated: November 3, 2004

 

 

Build 122

Build 123

Note

 

 

 

    AlleleMotif                        

    AlleleMotif                         

 

        repeat_cnt                     

        repeat_cnt                     

 

          Datatype:int                 

          Datatype:real                

We have submitted variation as (ACTGTGGGGACACCTCGCCTGCACCCACCTCCCTCCCCCCAGCTCCCGAGC)10.5/11/11.5/12/14/15/21.  So repeat_cnt need to real to record the repeat value of "10.5" in this case.

 

 

 

    ContigInfo                         

    ContigInfo                         

 

        contig_acc                     

        contig_acc                     

 

          Datatype:varchar(15)         

          Datatype:varchar(25)         

To accomate longer contig accession names.

 

 

 

    -- -- --                            

    db_build_pipeline                  

This is an internal table to track tasks. It is in draft state.

 

 

 

    -- -- --                           

    dn_Motif_rev                       

Motif table has rev_motif_id to point to the reverse motif sequence in the same Motif table.  This table is a denormalized table that makes it easier to find the reverse form of a Motif.  This table is used in internal computations.

 

 

 

    -- -- --                            

    dn_UniGty_allele                   

This table is used in internal computation.

 

 

 

    -- -- --                           

    dn_UniVariation_rev                

Similar to dn_Motif_rev. This table makes it easier to get the reverse form of a univariation string.

 

 

 

 

 

 

    -- -- --                           

    hsSupplContigInfo                  

dbSNP build 123 human snp are mapped to Genome build 35.  To assist users who still work with NCBI build 34.3 map, we have also mapped the new dbSNP build 123 refSNP(rs) set against build 34.3. The results are in SNPContigLoc_b34_3.  This table stores the contig information for build 34.3. It has exactly the same structure as ContigInfo, which has build 35 contig information.

 

 

 

    -- -- --                           

    hsSupplMapInfo                     

Please see hsSupplContigInfo note. This table is the "SNPMapInfo" counterpart for build 123 rs onto build 34.3. SNPMapInfo, of course, is the map statistics for dbSNP build 123 rs hits on NCBI build 35.

 

 

 

    -- -- --                           

    IndGrpByPop                        

This table assigns "individual group such as Asian" to population id. Individual group is a way to group snp frequency data for analysis. Ind_grp is in SubmittedIndividual, IndivBySource and Individual table.

 

 

 

    -- -- --                           

    IndGrpByPopInd                     

This table assigs "individual group" to each individual id within a population. This is in addition to the above IndGrpByPop because submitters define population and sometimes a submitted population contains individuals from several "Individual groups".

 

 

 

    IndMergeArch                       

    IndMergeArch                       

We assign individual id (ind_id) based on submission information for example Coriell number. When there is no information on submitted individual id, we assign new ind_id. Later we may get Coriell numbers and find out that several separate ind_id are actually  one. We merge these ind_id. The merge process is saved in this table.

        old_ind_id                     

        -- -- --                       

 

 

 

 

        -- -- --                       

        ind_id_high                    

old_ind_id-> new_ind_id was in previous build. To be more precise, we changed to column name to: ind_id_high and ind_id_low. Since when we merge ind_id, we always keep the ind_id that has the lowest number.

 

 

 

        new_ind_id                     

        -- -- --                        

 

 

 

 

        -- -- --                       

        ind_id_low                     

 

 

 

 

        -- -- --                       

        build_id                       

 

 

 

 

        -- -- --                        

        create_time                    

 

 

 

 

    -- -- --                           

    IUPACna                            

This is a table that assists simple computation of IUPack bases.

 

 

 

 

 

 

    -- -- --                           

    mmContigExon_35_1                  

Started mapping of mouse to NCBI build 35.1.

 

 

 

 

 

 

    RsMergeArch                        

    RsMergeArch                        

 

        build_id                       

        build_id                       

smallint ranges from -32,768 - 32,767. Enough for our build_id. A quick way to save space.

          Datatype:int                 

          Datatype:smallint            

 

 

 

 

        -- -- --                       

        rsCurrent                      

When an rs merges away several times, we need to trace rsHigh->rsLow several time to get the current rs. For convenience, rsCurrent and orien2Current are added. If an rs only merged once, then rsLow is the same as rsCurrent.

        -- -- --                       

        orien2Current                  

see above.

 

 

 

    SNP                                

    SNP                                

 

        avg_heterozygosity             

        avg_heterozygosity             

 

          Datatype:float               

          Datatype:real                

"real" offers plenty precision for the field and saves storage space for this table.

 

 

 

        het_se                         

        het_se                         

 

          Datatype:float               

          Datatype:real                 

see above.

 

 

 

        avg_max_heterozygosity         

        -- -- --                       

Remove not used field to save space.

        snp_class                      

        -- -- --                       

snp_class is the class for the variation type. This is replace by field univar_id which points to the unified variation string for the snp in UniVariation table. Use UniVariation.subsnp_class for this information instead of the old snp_class.

        -- -- --                        

        exemplar_subsnp_id             

For a refSNP exemplar ss, we use to use view SNPSeqPick. When querying large sets of data,  using SNPSeqPick is inefficient and slow. Add this new field simplifies many queries that need to get exemplar ss data for an rs, for example the query to get rs fasta.

        -- -- --                       

        univar_id                      

 This univar_id points to the unified variation string for the snp in UniVariation table. This makes it easy to get all the alleles for an rs.

        -- -- --                       

        cnt_subsnp                     

A refSNP is a cluster of subsnp(ss). Cnt_subsnp is an important property of rs. Although it can be derived from SNPSubSNPLink table, it is convenient to save it here. This is a tradeoff between space and speed.

 

 

 

    SNP_mRNA_pos                       

    -- -- --                           

This table used to store a snp position on an mrna_acc. The position is derived from ContigExon. But due to gaps and alignment exceptions, the position in this table has not been precise.  This table is only used to link to NCBI SeqViewer when a snp is on mRNA. It meant to center the SeqViewer on the snp. But since mRNA sequence is usually short enough to page through, we rather simply bring the mRNA SeqViewer on the first position than keep an inprecise position. We may store this information again in the future when better ways to compute is found.

 

 

 

    SNPAllele                          

    -- -- --                           

This table is unnecessary now that we have univar_id in SNP and we have UniVariAllele table which lists all alleles for each unified variation. For ex. For '(CT)3/4/5/6', UniVariAllele will have allele_id for (CT)3, (CT)4, (CT)5 and (CT)6.

 

 

 

    -- -- --                           

    SNPAncestralAllele                 

We used to have ancestral_allele_flag in SNPAllele table. Since SNPAllele table can be dropped, we need this table to save the ancestral allele data.

 

 

 

    SNPContigLoc                       

    SNPContigLoc                       

 

        allele                         

        allele                          

 

          Datatype:varchar(255)        

          Datatype:varchar(1200)       

New mapping program in build 35 can find alignment with bigger gaps, so allele length needs to be increased. We see that most of the long alleles contain repeats. In the future we may try to condense the allele using repeat string and number the way we do with STR in variation string.

 

 

 

    -- -- --                           

    SNPContigLoc_b34_3                 

This table has exactly the same structure as SNPContigLoc, which has dbSNP build 123 rs hits on NCBI build 35 and other alternative assemblies. But this table SNPContigLoc_b34_3 has dbSNP build 123 rs hits on NCBI build 34.3 to assist users who still use the 34.3 genome maps.

 

 

 

 

 

 

    SNPFlankStatus                     

        -- -- --                       

Starting build 123, mapping programs use megablast internal repeat masker. Before build 123, dbSNP repeat masks flank sequences and save the masking status in SNPFlankStatus.   So this table is no longer needed.

 

 

 

 

 

 

 

 

 

    SNPHWProb                          

    SNPHWProb                          

 

        ind_cnt                         

        ind_cnt                        

 

          Datatype:int                 

          Datatype:smallint            

smallint ranges from -32,768 - 32,767. Enough for this field. A quick way to save space.

 

 

 

    -- -- --                            

    SNPIndGtyFlag                      

Since a snp is a cluster of ss and an ind_id is a cluster of submitted_ind_id. When we have individual genotype for several ss and several submitted_ind_id that cluster to one rs and one ind_id, individual genotype should be same. But we have found out many records with different genotype data.  This table flags this interesting fact.

 

 

 

 

 

 

    SNPSubSNPLink                      

    SNPSubSNPLink                       

 

        comment                        

        comment                        

 

          Null Option:NOT NULL         

          Null Option:NULL             

Comment field is raraly if ever used but occupies large space. So we nullified it but kept the field for compatible with old sql codes. In the future, we may drop this completely.

 

 

 

    SubInd                             

    SubInd                             

 

        batch_id                       

        batch_id                       

 

          Datatype:int                 

          Datatype:smallint            

smallint ranges from -32,768 - 32,767. Enough for this field. A quick way to save space.

 

 

 

        submitted_ind_id               

        submitted_ind_id               

 

          Datatype:int                 

          Datatype:smallint            

smallint ranges from -32,768 - 32,767. Enough for this field. A quick way to save space.

 

 

 

 

 

 

    SubSNP                             

    SubSNP                             

We added three derived fields to SubSNP table to speed up data dump by organism, chromosome, rs and then within each rs cluster, property of each ss.

        -- -- --                       

        snp_id                         

 

        -- -- --                       

        tax_id                         

 

        -- -- --                       

        chr_id                          

 

 

 

 

    SubSNPHWProb                       

    -- -- --                           

Hardy-Weinberg probability (HWP) are computed at each rs level (SNPHWProb) and at each submitted ss and population level( FreqSummaryBySsPop).  This table is therefore dropped.

 

 

 

    SubSNPRepMask                      

    -- -- --                           

We dropped this table for the same reason in SNPFlankStatus note.  Starting from dbSNP build 123, mapping uses megablast internal repeat mask, so we do not track the mask status of submitted snp seqeunce with dbSNP.

 

 

 

    -- -- --                           

    UniVariAllele                      

For each unified variation, this table lists all alleles within the variations. Please see SNP.univar_id note for more details.

 

 

 

    UniVariation                       

    UniVariation                       

This table keeps all unique variation.

        allele_cnt                     

        allele_cnt                     

 

          Datatype:int                 

          Datatype:smallint            

smallint is enough. No need for int.

 

 

 

        -- -- --                        

        src_code                       

This code is a foreign key to UniVariationSrcCode. It tracks the source the first time a univariation is added. For ex. The variation string is added for a reverse form of a variation.

 

 

 

        -- -- --                       

        rev_univar_id                  

Added this field so it is easy to find the reverse form of a univariation.

 

 

 

    -- -- --                           

    UniVariationSrcCode                 

Please see UniVariation.src_code and the table data for detail.

 

 

 

    -- -- --                           

    VariComplexAllele                  

This is an internal table to handle complex variation forms.

 

Build 122 Schema Change

Date last updated: August 18, 2004

 

Build 121

Build 122

Note

    -- -- --                           

    ChiSqPValueLookUp                   

This table replaces ChiSqPValue. This is used to get the p-value from ChiSq when computing H-W Prob.

 

 

 

    dn_Allele_rev                      

    dn_Allele_rev                      

This table is used internally for processing.

        rev_flag                       

        rev_flag                       

 

          Datatype:bit                 

          Datatype:tinyint             

 

 

 

 

    -- -- --                            

    dn_Gty_rev                         

Used for internal processing.

 

 

 

    dn_snpFxnCnt                       

    dn_snpFxnCnt                       

This is a summary table showing count of snp in each functional class(intron, coding etc).

        -- -- --                       

        tax_id                         

Added tax_id to this table since we start to have genomes for more organisms.

 

 

 

    dn_table_rowcount                  

    dn_table_rowcount                  

This table lists the row count of each table. When users re-create their own dbSNP, they can use this table to check if they data were loaded OK.

        org_prefix                     

        -- -- --                        

org_prefix and tax_id is not used yet. Currently, the table rowcount just the rowcount for a table. In the future when we have organism division databases, we may have tables shared by multiple organism. tax_id field will be used to show row count per tax_id.

        -- -- --                       

        tax_id                         

 

 

 

 

    FreqSummaryBySsPop                 

    FreqSummaryBySsPop                 

Column type changes for this table. Float to real to save storage space.

        source                         

        source                         

 

          Datatype:char(2)             

          Datatype:varchar(1)          

We compute H-W prob from either individual genotype data(I), or just genotype frequency (P). So we only need varcchar(1).

 

 

 

        chisq                          

        chisq                          

 

          Datatype:float               

          Datatype:real                

 

          Null Option:NOT NULL         

          Null Option:NULL             

 

 

 

 

        df                             

        df                             

 

          Null Option:NOT NULL         

          Null Option:NULL             

 

 

 

 

        hwp                            

        hwp                            

 

          Datatype:float               

          Datatype:real                

 

          Null Option:NOT NULL         

          Null Option:NULL             

 

 

 

 

    -- -- --                           

    ggaContigExon                      

The new set of chicken mapping table are added. They all have prefix 'gga'.  Organism prefix is defined in OrganismTax.prefix column.

 

 

 

    -- -- --                           

    ggaContigInfo                      

 

 

 

    -- -- --                           

    ggaMapLink                         

 

 

 

    -- -- --                           

    ggaMapLinkPid                      

 

 

 

    -- -- --                           

    ggaMapList                         

 

 

 

 

    -- -- --                           

    ggaSNPContigLoc                    

 

 

 

    -- -- --                           

    ggaSNPContigLocusId                

 

 

 

    -- -- --                           

    ggaSNPMapInfo                      

 

 

 

    -- -- --                           

    IndGrpCode                         

Added this new table to group individuals. See the data in IndGrpCode for a list of groups. Currently, the grouping only applies to human.

 

 

 

    IndivBySource                       

    IndivBySource                      

        -- -- --                       

        src_ind_grp                    

This column stores grouping supplied by sample source ,such as Coriell.

 

 

 

    Individual                          

    Individual                         

 

        -- -- --                       

        ind_grp                        

This column stores the grouping code assigned by dbSNP.

 

 

 

    IndivSourceCode                     

    IndivSourceCode                    

        -- -- --                       

        src_type                       

Added this column to distinguish various sources. For example, Coriell's src_type is "repository" and CEPH is "curator".

 

 

 

        -- -- --                       

        display_order                  

One individual often has several id from different source. This value will be used to order the display.

 

 

 

    -- -- --                            

    IndMergeArch                       

As we get more individual sample information, we may see that previously two separated individuals are actually the same. This table keeps individual merging history. This is not used yet.

 

 

 

 

 

 

    miscMapLink                        

    miscMapLink                        

        -- -- --                       

        i_tax_id                       

For organisms with no NCBI genome assembly, the hit to GenBank accessions are stored in this table. Since there're multiple organisms, tax_id was added. But since tax_id is in SNP table, we could get tax_id by joining SNP and this table. So this column may be unnecessary and be taken out in future schema modifications.

 

 

 

    Publication                        

    Publication                        

 

        asn                            

        -- -- --                       

This field was not used.

 

 

 

    SNPHWProb                           

    SNPHWProb                          

This table stays basically the same except the column name listed below. The column name is not consistent with table FreqSummaryBySsPop which has H-W prob for each ss over an population.

        degreeFreedom                  

        df                             

 

 

 

 

        chi_sq                         

        chisq                          

 

 

 

 

        hw_prob                        

        hwp                             

 

 

 

 

        -- -- --                       

        ind_cnt                        

Added column ind_cnt for individual count.

 

 

 

        source                         

        -- -- --                        

This field is removed because it is not as meaningful for an refSNP(rs) as the source in FreqSummaryBySsPop: some member ss may have individual genotype while other may only have genotype frequency.

 

 

 

    SubmittedIndividual                 

    SubmittedIndividual                

        -- -- --                       

        loc_ind_grp                    

loc_ind_grp will store the individual group name submitted by submitter.

 

 

 

    -- -- --                            

    UniVar_Valid_Allele                

This table is used internally.

 

 

Build 121 Schema Change

Date last updated: June 2, 2004

 

To keep dbSNP in manageable size, we have partitioned four large tables by ss# and moved the data with ss# less than 20 million to a supplemental database called dbSNP_sup.

The four large tables that have been partitioned are: SubSNPSeq3, SubSNPSeq5, SubSNPCommLine and SubSNPAcc.

 

For ss# between 1 and 10,000,000, it is stored in table <tablename>_p1, where <tablename> represents the above four tables.

For ss# between 10,000,001 and 20,000,000, it is stored in table <tablename>_p2

Currently, we do not have ss# over 30,000,000, when we do, we will create <tablename>_p3 for ss# between 20,000,001 and 30,000,000,

 

 For example:

SubSNPSeq3_p1 has 3’ sequence for all ss# between 1 and 10,000,000.

SubSNPSeq3_p2 has 3’ sequence for all ss# between 10,000,001 and 20,000,000.

 

SubSNPSeq3_p1 and SubSNPSeq3_p1 still have the same column definitions as the original SubSNPSeq3.

 

In dbSNP, for ss# over 20,000,000, we still store them in dbSNP, the main database.  We renamed these four tables by adding a suffix “_ins” to note that we will be add new data into these tables as new submissions come in.   The original names for these four tables are now partitioned views across dbSNP and dbSNP_sup having all ss#.  For more details on partitioned view,

 

This change will be reflected in ftp bcp files. Using SubSNPSeq3 as an example, instead of having a huge file for SubSNPSeq3 prior to this buid, we now have bcp file for SubSNPSeq_p1, SubSNPSeq_p2 and SubSNPSeq_ins.  Between builds, sequence for old snp seldom change (they only change when submitter updates them which is rare). We will have separate small update file for SubSNPSeq3_p1 and SubSNPSeq3_p2.  So for each new build, users only need to get the new bcp file from SubSNPSeq3_ins.

 

 

Build 120 Schema Change

Date last updated: April 24, 2004

Note:  ER diagram and Data dictionary for build 120 will be updated on soon.

 

 

Build 119

Build 120

Note

    agContigInfo                       

    agContigInfo                       

Please see change note in "Build 117 Schema Change - ContigInfo" in this document.

        -- -- --                       

        group_term                     

 

        -- -- --                       

        group_label                    

 

        -- -- --                       

        contig_label                   

 

 

 

 

    agSNPMapInfo                       

    agSNPMapInfo               

Please see change note in "Build 117 Schema Change-SNPMapInfo" in this document.

        -- -- --                        

        hap_cnt                        

 

        -- -- --                       

        unplaced_cnt                   

 

        -- -- --                       

        ref_cnt                        

 

        -- -- --                        

        assembly_cnt                   

 

 

 

 

    -- -- --                           

    dn_table_rowcount

Added this table to simply store the row count for each table. So users who create local dbSNP can compare the counts.

 

 

 

    -- -- --                           

    dn_validationSummary

An internal validation table.

 

 

 

    Individual                         

    Individual                  

SubmittedIndividual has tax_id due to annotation process. It is possible several submitted individuals are from the same sample. This curated sample data is in Individual table. So to normalize data,  tax_id should be in Individual table. Currently, we have tax_id in both SubmittedIndividual and Individual tables. Eventually, we will only have tax_id in Individual table.

        -- -- --                       

        tax_id                         

 

 

 

 

SNPHapMapPriority

    SNPHapMapPriority

 

        priority_code                  

        priority_code                  

 

Datatype:tinyint             

          Datatype:smallint            

HapMapClassCode has 9 codes, so priority_code needs to be a smallint.

 

Build 119 Schema Change

Date last updated: Jan. 30, 2004

Note:  ER diagram and Data dictionary for build 119 will be updated on soon.

Please see most recent data dictionary at: ftp://ftp.ncbi.nih.gov/snp/mssql/schema/DataDictionary_b118.html

 

Build 118

Build 119

Note

 

 

 

    -- -- --                           

    dgContigInfo                       

Added 'dg' prefixed tables for dog snp mapping data.

    -- -- --                           

    dgSNPContigLoc                     

 

    -- -- --                           

    dgSNPContigLocusId                 

 

    -- -- --                           

    dgSNPMapInfo                        

 

 

 

 

    -- -- --                           

    dn_chr_locusList                   

This table is a de-normalized table, storing list of locus_id for each contig chromosomes. This is used in individual genotype summary page.

 

 

 

    -- -- --                           

    dn_gty_rsCnt_byChr                 

This table is a de-normalized table, storing genotyped snp count for each individual on chromosomes.

 

 

 

    -- -- --                            

    dn_gty_rsCnt_byLocus               

This table is a de-normalized table, storing genotyped snp count for each individual on a locus.

 

 

 

    MethodClass                        

    MethodClass                         

 

 

 

 

        -- -- --                       

        validation_status              

Added validation_status. Value 0 means the method is computational. 1 - method can be considered validating when multiple ss  are in the same rs cluster.

 

 

 

 

 

 

    SNPContigLocusId                   

    SNPContigLocusId                   

Added mrna_ver and protein_ver. Contig_acc, mrna_acc and protein_acc fields are changed to varcahr(15) to be consistent with accession fields in other tables such as SNPContigLoc.  The following int field are changed to tinyint: contig_ver, fxn_class, reading_frame to save space.

 

 

 

    SubmittedIndividual                

    SubmittedIndividual                

 

        -- -- --                       

        loc_ind_id                     

Submitter individual id can be case sensitive. For ex. Mouse strain name use case in special ways. This field is needed to preserve case information.

 

Build 118 Schema Change

Date last updated: Nov. 14, 2003

Please see data dictionary at: ftp://ftp.ncbi.nih.gov/snp/mssql/schema/DataDictionary_b118.html

 

Build 117

Build 118

Note

 

 

 

    -- -- --                           

    dgMapLink                          

Added mapping data for dog.

 

 

 

    -- -- --                           

    dgMapLinkPid                       

Added mapping data for dog.

 

 

 

    dn_batchCount                      

    dn_batchCount                      

 

        build_id                        

        -- -- --                       

We will keep snp counts in each batch for current batch only. So there is no need for this column.

 

 

 

    dn_handleCount                     

    dn_handleCount                     

 

        build_id                       

        -- -- --                       

We will keep snp counts in each batch for current batch only. So there is no need for this column.

 

 

 

    dn_IND_batchCount                  

    dn_IND_batchCount                  

 

        build_id                       

        -- -- --                       

We will keep snp counts in each batch for current batch only. So there is no need for this column.

 

 

 

    SubSNPGenotypeSum                  

    FreqSummaryBySsPop                 

This table is renamed to be consistent with the two tables used to compute this summary: GtyFreqBySsPop and GtyFreqBySsPop. The new name clearly shows that the data ins for each SubSNP within a specific population.

        source                         

        source                         

 

          Datatype:char(1)             

          Datatype:char(2)             

Use char(2): GF - source data is genotype frequency, IG - source data is individual genotype, AF - source data is allele frequency.

        non_founder_cnt                

        non_founder_ind_cnt            

Renamed so it is clear that the number in this column count of individuals, not chromosomes.

        -- -- --                       

        het                            

heterozygosity

        -- -- --                       

        het_se                         

standard error for the above heterozygosity.

 

 

 

    mmContigInfo                       

    mmContigInfo                       

The changes in this table is the same in ContigInfo. Please see change description for ContigInfo in build 117.

        contig_class                    

        -- -- --                       

 

        -- -- --                       

        group_term                     

 

        -- -- --                       

        group_label                    

 

        -- -- --                       

        contig_label                   

 

 

 

 

    mmSNPContigLocusId                 

    mmSNPContigLocusId                 

 

        -- -- --                       

        mrna_ver                        

Added this to keep track the version of the mrna accession, like the contig_ver.

        -- -- --                       

        protein_ver                    

Added this to keep track the version of a protein accession, like the contig_ver.

 

 

 

    mmSNPMapInfo                       

    mmSNPMapInfo                       

The changes in this table are the same in ContigInfo. Please see change description for ContigInfo in build 117.

        -- -- --                        

        ref_cnt                        

 

        -- -- --                       

        assembly_cnt                   

 

 

 

 

    -- -- --                           

    SNPAlleleFreq                      

This new table keeps the average allele frequency for refSNP.

 

 

 

    -- -- --                           

    SNPGtyFreq                         

This new table keeps the average genotype freqeuncy for refSNP.

 

 

 

    SubmittedIndividual                

    SubmittedIndividual                

 

        -- -- --                       

        loc_ind_alias                  

Added this column to track the different id for a submitted individual for genotype data.

 

 

 

    t_vwSNP_avgFreq                    

    -- -- --                           

This is replaced with SNPAlleleFreq.

 

 

 

    t_vwSNP_MinorAlleleCnt2More        

    -- -- --                           

This  table is no longer needed.

 

 

 

    t_vwSubPopGty_alleleFreq           

    -- -- --                           

This table is replaced with GtyFreqBySsPop.

 

 

 

 t_vwSubSNP_validationByMinorAlleleCnt 

    -- -- --                            

This table is no longer needed.

 

 

 

 

Build 117 Schema Change

 

Date last updated: Oct. 16, 2003

Note:  Data dictionary and ER diagram for build 117 will be available on Oct.17, 2003.

 

Build 116

Build 117

Note

 

 

 

    Allele                             

    Allele                             

Removed 3 columns in Allele table because an allele may have many motifs (ex. (GAT)10AAT(GAT)5). So keep allele motif relationship in enw table AlleleMotif.

        motif_id                       

        -- -- --                       

 

        repeat_cnt                      

        -- -- --                       

 

        rev_allele                     

        -- -- --                       

 

        -- -- --                       

        last_updated_time              

 

 

 

 

    -- -- --                           

 

Added this table to save computed allele frequency per subsnp and population. The frequency computed from individual genotype takes the highest precedence which then takes precedence over the allele frequency computed from genotype frequency. When none of the above data is available, we use the submitted allele frequency.

 

 

 

    -- -- --                           

    AlleleMotif                        

Added this table to keep the relationship between an allele and all its motif.

 

 

 

    ContigInfo           

ContigInfo

contig_class was used to track the haplotype data of humans and the strain data of mouse for each of the various contigs. But it became clear that different ncbi assembly tasks required different ways of grouping the alternatives so the single category contig_class was broken out into three fields. For dbSNP, the group_term gives the best breakdown of the various alternative forms, but contig_label and/or group_label give a better idea of the common name associated with haplotype, strain, alternate_assembly, etc.

        contig_class                   

        -- -- --                       

 

        -- -- --                        

        group_term                     

id used to group contigs together for annotation purposes. This allows
                annotaters to group contigs by assembly (see note below for PAR
                special case).
                Possible values:
                        ref_[strain/haplotype] (haplotype human, strain others)
                        alt_[strain/haplotype]
                        ref_PAR
                        alt_assembly
                the ref_ and alt_ prefixes assume that these are contigs
                of type NT/NW and the NCBI-assembly team will be loading these to ID.
                ref_PAR: The pseudoautosomal region requires special processing.
                The Y contigs of the PAR region will have this group_term to
                make sure the NCBI-assembly team don't penalize features in this
  region for going to two chromosomes. These contigs will be joined back into the
                reference assembly upon loading to MapViewer
                alt_assembly: this designation is used when the NCBI-assembly
  team have an extra assembly
                for a chromosome and the NCBI-assembly team don't want to load
  the annotation/contigs from this chromosome to ID.

        -- -- --                       

        group_label                    

Label used to label an assembly in the MapViewer.
                On a non-sequence based map, this term can serve to group
                objects into a common coordinate system.
                Example values:
                        reference
                        C57BL/6J
                        129_substrain

        -- -- --                       

        contig_label                   

Label used to label a particular contig in an assembly. For
                example, in mouse, the NCBI-assembly team have grouped all contigs that are of
                type 129/??? for annotation purposes. When the NCBI-assembly team display a
                particular contig, it would be useful to display the
                actual strain name (129/Sv, 129/Ola, etc).
                Example values:
                        C57BL/6J
                        129/SvJ
                        129/SvEvTac

 

 

 

    -- -- --                            

    GtyFreqBySsPop                  

Added this table to save computed genotype frequency per subsnp and population. The frequency computed from individual genotype takes the precedence over submitted genotype frequency.

 

 

 

    ObsVariation                       

    ObsVariation                       

 

        -- -- --                       

        var_flag                       

refers to VarFlagCode.code

 

 

 

    PopClassUniqCount                   

    -- -- --                           

 Not used.

 

 

 

    -- -- --                           

    rnContigExon                       

Added mapping information for rat, with prefix 'rn' added to mapping tables. This note also applies to the following 7 tables.

 

 

 

    -- -- --                           

    rnContigInfo                       

 

 

 

 

    -- -- --                           

    rnMapLink                          

 

 

 

 

    -- -- --                           

    rnMapLinkPid                       

 

 

 

 

    -- -- --                           

    rnMapList                          

 

 

 

 

    -- -- --                            

    rnSNPContigLoc                     

 

 

 

 

    -- -- --                           

    rnSNPContigLocusId                 

 

 

 

 

    -- -- --                           

    rnSNPMapInfo                        

 

 

 

 

    SnpFunctionCode                    

    SnpFunctionCode                    

 

        -- -- --                       

        is_coding                      

Added this column for internal query.

 

 

 

    SNPMapInfo                         

    SNPMapInfo                         

 

        -- -- --                       

        ref_cnt                        

set greater than zero if the snp maps onto the reference strand.

        -- -- --                       

        assembly_cnt                   

Counts the number of alternate assemblies (excluding reference) which the snp hits.

 

 

 

    sts_markers                        

    -- -- --                            

 

 

 

 

    SubPop                             

    SubPop                             

Change some float column to real to save storage space.

 

 

 

    -- -- --                           

    ValidGtyByVariation                 

Internal use currently.

 

 

 

    -- -- --                           

    VarFlagCode                        

Some submitted variation pattern is problematic. Define codes to track problems.

 

Build 116 Schema Change

 

Date: Aug 7, 2003

 

The following details all schema changes between build 116 and 115, sorted by table names.

Table names are in bold. Only columns are indented and only columns with changes are shown under the table. See data dictionary for full table details.

  -- -- --‘ means no corresponding objects ( table or column) for the build.

For example, in build 116, pop_cnt is added to table dn_batchCount.

 

Please see ER diagram for more details: ftp://ftp.ncbi.nih.gov/snp/mssql/schema/erd_dbSNP.pdf

Please see data dictionary at: ftp://ftp.ncbi.nih.gov/snp/mssql/schema/DataDictionary_b116.html

 

 

 

Build 116

Build 115

Note

    dn_batchCount                      

    dn_batchCount                      

Count to speed up query.

        pop_cnt                         

        -- -- --                       

 For frequency and genotype batch, added count of population and count of individual.

        ind_cnt                        

        -- -- --                       

 

 

 

 

    dn_handleCount                     

    dn_handleCount                     

        rs_cnt                         

        rs_cnt                         

Make the column nullable, so 'NULL' means not computed yet, while '0' means the result.

          Null Option:NULL             

          Null Option:NOT NULL         

        rs_validated_cnt               

        rs_validated_cnt               

          Null Option:NULL             

          Null Option:NOT NULL          

 

 

 

   dn_IND_batchCount

   -- -- --

This count table helps speed up "New Batches" web query.

 

 

 

    -- -- --                           

    dn_new_batch                   

dn_new_xxx_batch tables are no longer needed. It was used internally to support query most recent batches. Now web page uses Batch.submit_time to decide submission order.

    -- -- --                           

    dn_new_ind_batch                   

 

 

 

 

    -- -- --                           

    dn_new_nov_batch                   

 

 

 

    -- -- --                           

    dn_new_pop_batch                   

 

 

 

    -- -- --                           

    dn_new_snp_batch                   

 

 

 

    GtyAllele                          

    GtyAllele                          

The new table makes it easy to compute allele frequency from genotype submission on forward or reverse strand.

        -- -- --                       

        allele_id                      

 

        rev_flag                       

        -- -- --                       

 

        fwd_allele_id                  

        -- -- --                        

 

        unigty_id                      

        -- -- --                       

 

        create_time                    

        create_time                    

 

          Null Option:NOT NULL         

          Null Option:                  

 

          Default:-- -- --             

          Default:DefCurTime           

        last_updated_time              

        -- -- --                       

 

 

 

 

    IndivAlias                         

    -- -- --                           

Genotype submission includes individual id. Different submitters sometimes use different individual id for the same DNA sample.  Ex. Mouse strain 'AKR' or 'AKR/J' are the same.

 

 

 

    IndivBySource                       

    IndivBySource                      

Minor database change

        src_id                         

        src_id                         

 

          Datatype:int                 

          Datatype:numeric(3)           

 

 

 

    IndivSourceCode                    

    IndivSourceCode                    

Minor database change

        code                           

        code                           

 

          Datatype:int                  

          Datatype:numeric(3)          

 

 

 

    LoadHistory                        

    LoadHistory                        

        -- -- --                       

        create_time                    

Each build releases different component in slight different time.  Now each individual component release time is stored here.

        ftp_done_date                  

        -- -- --                       

 

        entrez_done_date               

        -- -- --                       

 

        blastdb_done_date              

        -- -- --                       

 

        web_date                       

        -- -- --                       

 

 

 

 

    mmSNPMapInfo                        

    mmSNPMapInfo                       

See data dictionary.

        hap_cnt                        

        -- -- --                       

 

        unplaced_cnt                   

        -- -- --                       

 

 

 

 

    ObsGenotype                        

    ObsGenotype                        

See data dictionary.

        -- -- --                       

        submitted_strand_code          

        -- -- --                       

        obs_ss_fwd                     

        -- -- --                       

        create_time                    

 

        -- -- --                       

        unigty_id                      

 

        -- -- --                        

        allele_id_1                    

 

        -- -- --                       

        allele_id_2                    

 

 

 

 

    ObsVariation                       

    ObsVariation                       

subsnp_clas and iupack_code are moved to table UniVariation.

        -- -- --                       

        subsnp_class                   

        univar_id                      

        univar_id                      

 

          Null Option:NULL              

          Null Option:NOT NULL         

        -- -- --                       

        iupack_code                    

 

 

 

    OrganismTax                        

    OrganismTax                        

Added division_cd in preparation to split dbSNP database into several divisions. Divisions are loosely based on GenBank divisions.

        division_cd                    

        -- -- --                       

 

    Population                          

    Population                         

When curating submitted individuals, it helps to know the source of the population samples.  So src_id is added. For ex. If we know a population's source is NIHPDR, then we know '1' or 'P001' or 'PD001' is the same sample in NIHPDR panel. We currently manually set this field based on population description.

        src_id                         

        -- -- --                       

 

 

 

 

   SNPContigLocusId

   SNPContigLocusId

 

      fxn_class

      fxn_class

data type changed from int to tinyint to save storage.

      reading_frame

      reading_frame

same as above.

      contig_ver

      contig_ver

same as above.

 

 

 

    SNP                                 

    SNP                                

 

        -- -- --                       

        snp_type                       

snp_type was used to record snp withdraw reason code defined in SnpTypes. We currently do not keep withdrawn snp in SNP table. Withdrawn snp is in SNPHistory.

        tax_id                         

        -- -- --                       

tax_id, snp_class and validation_status was orignianlly in SNP table. From build 111 to build 115, they were moved to t_vwSNP to faciliate database processing. But during database size increase in build 116, views proved too slow to merit the separating these three fields. So starting build 116, these three fields are back in SNP table.

        snp_class                       

        -- -- --                       

 

        validation_status              

        -- -- --                       

 

 

 

 

 

 

 

    SnpFunctionCode                    

    SnpFunctionCode                    

        top_level_class                

        -- -- --                       

Added this field to group the functional class code into "exon" and "other".

 

 

 

    SNPHWProb                          

    SNPHWProb                          

        chi_sq                         

        chi_sq                         

float' takes 8 byte storage. 'Real' takes 4 byte. In dbSNP, 'real' offers enough precision.

          Datatype:real                

          Datatype:float                

 

        hw_prob                        

        hw_prob                        

 

          Datatype:real                

          Datatype:float               

 

 

 

 

    SNPMapInfo                         

    SNPMapInfo                         

        hap_cnt                        

        -- -- --                       

 

        unplaced_cnt                   

        -- -- --                       

 

 

 

 

 

 

 

    SnpValidationCode                  

    SnpValidationCode                  

        abbrev                         

        abbrev                         

 

          Datatype:varchar(40)         

          Datatype:varchar(20)         

With the addition of double hit validation type, the abbreviation string gets longer.

        create_time                    

        -- -- --                       

 

        last_updated_time              

        -- -- --                        

 

 

 

 

        -- -- --                       

   SNPValBy_Illumina_Apr1003                                                                                   

The snp list is this table is now in SNPVal and used to set the validation_status for the snp.

 

 

 

        -- -- --                       

   SNPValBy_PERLEGEN_OCT0802

The snp list is this table is now in SNPVal and used to set the validation_status for the snp.

 

 

 

    StrandCode                          

    StrandCode                         

        rs_to_ss_orien                 

        -- -- --                       

See data in this table to see why this new column is needed.

 

 

 

    -- -- --                           

    sts_markers                        

 

 

 

    SubInd                             

    SubInd                             

Streamlined SubInd table by normalizing observed genotype into ObsGenotype table.

        -- -- --                        

        allele_cnt                     

 

        -- -- --                       

        a_bit                          

 

        -- -- --                       

        c_bit                          

 

        -- -- --                        

        g_bit                          

 

        -- -- --                       

        t_bit                          

 

        -- -- --                       

        del_bit                        

 

        -- -- --                       

        oth_bit                        

 

        -- -- --                       

        observed                       

 

        submitted_rs                   

        -- -- --                        

 

 

 

 

    SubIndDeletedBySubmitter           

    -- -- --                           

Added this table for internal tracking.

 

 

 

    SubPop                             

    SubPop                             

        subpop_id                      

        -- -- --                       

Added this column to shorten primary key, especially since we have added more tables referencing SubPop table, such as SubPopGty.

 

 

 

    SubPopAllele                        

    SubPopAllele                       

        subpop_id                      

        -- -- --                       

See above. In the future SubPopAllele will use this column to reference SubPop so batch_id, pop_id,subsnp_id,type will no be needed. These four columns are planed to be dropped in build 117.

 

 

 

    SubPopDeletedBySubmitter           

    SubPopDeletedBySubmitter           

Added this table for internal tracking.

        -- -- --                       

        build_id                       

 

        delete_build_id                

        -- -- --                       

 

 

 

 

    SubPopGty                          

    -- -- --                           

In the past, genotype frequency were stored in SubPopAllele. But submitted genotype frequency has different property from submitted allele frequency. So this table is created to keep just genotype frequency to faciliate computing, for ex, to derive allele frequency based on genotype frequency.

 

 

 

    SubPopHet                          

    -- -- --                           

This table is created, but not populated yet. We plan to move the heterozygosity frequency from SubPopAllele into this table in build 117.

 

 

 

    SubPopRsMerge                      

    -- -- --                           

Submitters may submit frequency on an rs that was already merged to another rs. This table keeps a record of this fact. Since we always stored frequency on the rs exemplar ss, the effect of this merging is that the frequency may be stored under a different ss#.

 

 

 

    SubSNP                             

    SubSNP                             

        -- -- --                       

        hw_prob                        

SubSNPHWProb is used for this values. Less than 5% subsnp has data that make the computation of Hardy-Weinberg probability possible. Storing this value in SubSNP table wastes space since SubSNP has over 10 millions rows.

        -- -- --                       

        subsnp_type                    

This was intended to indicate the withdraw detail of a subsnp. Now SubSNP table does not keep deleted snp. Deleted SubSNP is tracked in SubSNPDeletedBySubmitter.

        top_or_bot_strand              

        -- -- --                       

This is an experimental way of identifying strand.

        validation_status              

        -- -- --                       

Since build 111, we started to compute validation status from a view. But with data growth, it is no longer efficient to compute this value in each query.  Maintaining a table copy of view t_vwSubSNP_validation defeats the beauty of views. So we are returning this field back into SubSNP.

 

 

 

    SubSNPSeqPos                       

    -- -- --                           

In build 116, we received submission of snp with flanking sequence over 200Kb for snp's in repeat regions and the submitter already know the genome location of the snp. In this case, storing all flanking sequence does not provide value since the flanks will be repeat masked out anyway. So we save the accession and the snp position of the accession.

 

 

 

    SubSNPSeqTypeCode                  

    -- -- --                           

Explains the meaning of 'type' field in table SubSNPSeq3 and SubSNPSeq5.

 

 

 

    -- -- --                           

    t_vwSNP                             

 

 

 

    t_vwSubPopGty_alleleFreq           

    -- -- --                           

 

 

 

 

    -- -- --                           

    t_vwSubSNP_validation              

See note for SubSNP.validation_status.

 

 

 

 t_vwSubSNP_validationByMinorAlleleCnt 

                                       

 

 

 

 

    UniGty                             

    -- -- --                           

 

 

 

 

    UniVariation                       

    -- -- --                           

 

 

 

Build 115 Schema Change

 

Date: June 6, 2003

 

-               Dropped tables: dn_new_snp_batch, dn_new_pop_batch, dn_new_ind_batch, dn_new_nov_batch

These are internal table that are no longer needed.

-               Added tax_id to SubmittedIndividual table because some populations group individuals from different organism.

               

 

Build 114 Schema Change

Date:                               April 29, 2003

 

Table

Column

Note

ObsGenotype

 

This table keeps all submitted genotypes and submitted strand code in obs and submitted_strand_code columns.  Obs_upp is the all uppercase version of obs.  Obs_ss_fwd is the genotype on the forward strand of the SubSNP. Ex.  If genotype is submitted as A/C|STRAND_SS_REV, then obs has “A/C”, obs_ss_fwd has “G/T”.

SubSNPGenoSum is renamed to SubSNPGenotypeSum.

  • Freq_obs and genotype_obs are dropped.
  • Hw_prob is renamed to hwp
  • degreeFreedom is renamed to df
  • chi_sq is renamed to chisq

Source has value “I” or “P”:

“I” means the genotype frequency is summarized from individual genotypes.

“P” means the genotype frequency is submitted for a population.

SubSNPHWProb

  • Hw_prob is renamed to hwp
  • degreeFreedom is renamed to df
  • chi_sq is renamed to chisq

Source has same means as in SubSNPGenotypeSum.source

SubPopDeletedBySubmitter

 

Added this table to keep track deleted data, so summary statistics needs to be recomputed for the affected batch and refSNP cluster.

t_vwSNP

-          Dropped rarely used columns.

- Current columns are snp_id, validation_status, snp_class and tax_id, create_time ( is the time this row is inserted ).

 

Dn_new_snp_batch

Dn_new_pop_batch

Dn_new_ind_batch

Dn_new_nov_batch

 

dropped

Dn_new_batch

 

Added to replace dn_new_<batch_type>_batch, where <batch_type> has value of snp,pop,ind or nov.

 

 

Build 112 Schema Change

Date:                               April 2, 2003

 

Table

Column

Note

SubSNP

  • observed is replaced with ObsVariation.pattern. 
  • subsnp_class is replaced with ObsVariation.subsnp_class.
  • Dropped columns allele_cnt,a_bit,c_bit,g_bit,t_bit,del_bit,oth_bit,allele_bits

To get the observed variation and subsnp_class for a give subsnp, use ObsVariation.pattern, join SubSNP and ObsVariation on SubSNP.variation_id=ObsVariation.var_id

Batch

·         Organism is replaced with OrganismTax.organism.

Join Batch and OrganismTax on tax_id to get the organism of a batch.

 

Build 111 Schema Change

 

Date:                               March 23, 2003

 

There are a few major schema changes between build 110 and build 111. 

 

Tables/Columns in Build 111 that are replacing old tables/columns in build 110:

 

Table

Column

Note

t_vwSubSNP_validation

 

- replace SubSNP.validaton_status

 

 

validation_status

- t_vwSubSNP_validation is a direct copy from view vwSubSNP_validation.  Please see vwSubSNP_validation.sql for details.

SNPSeqPick

 

- This is a view. It replaces the old table with the same name and columns: SNPSeqPick.

snp_id,

subsnp_id

- see SNPSeqPick.sql for view details.

t_vwSNP

 

- replace the same columns in SNP.

 

PCR_confirmed_ind

assay_sample_size

cnt_subsnp

indv_homozygosity_obs_ind

num_segregate

popdata_sample_size

snp_class

tax_id

total_sample_size

validation_status

t_vwSNP is direct copy from view vwSNP. Please see vwSNP.sql for details on how each field is computed. 

SubPopAllele

cnt

cnt_int

- cnt is changed to type float. This will help eliminate rounding error in computing average allele frequency etc.

- cnt_int will be deleted after cnt is set to be number of chromosomes for allele frequency and number of individuals for genotype and heterozygosity data.

 

New table and columns:

 

Table

Column

Note

Further change in Build 112

SubSNP

variation_id

variation_id is foreign key to ObsVariation (var_id).

- SubSNP.observed will be replaced with ObsVariation.pattern.

- SubSNP.subsnp_class will replaced with ObsVariation.subsnp_class.

Motif

motif_id

motif,

rev_motif_id

This table keeps all unique motifs in an allele.

For ex. An observed variation is (CA)2/3/4/6.

Allele table will have

four rows: (CA)2, (CA)3, (CA)4, (CA)6. 

Each row

with the same motif_id, referring to one motif

record of  motif (CA).

-  rev_motif_id is id of the reverse motif. For ex. (CA)’s reverse motif is (TG). rev_motif_id is the motif_id for motif (TG).

 

Allele

allele_id,

allele

motif_id

repeat_cnt

- This table keeps all unique occurrence of an observed allele in a variation submission.

 

VariAllele

var_id,

allele_id

- This table associates a variation with 2 or more alleles.

 

ObsVariation

var_id,

pattern,

subsnp_class

- This table keeps all unique submitted observed field from the old SubSNP.observed field. 

 

ObsGenotype

gty_id,

var_id,

allele_id1,

allele_id2

For an observed variation, this table lists all possible genotypes.  For ex. If an observed variation is ‘A/C’, then there could possibly be three genotypes: A/A, A/C, C/C.

 

SubInd

gty_id

foreign key to ObsGenotype(gty_id)

- The following columns in SubInd will be replace with gty_id:

observed,

allele_cnt

a_bit,

c_bit

g_bit

t_bit

del_bit

oth_bit

 SubPopAllele

allele_id

foreign key to Allele(allele_id).

-The following columns in SubPopAllele will be replaced with allele_id:

allele, 

other

t_vwSNP_avgFreq

 

- This is a table directly copied from view vwSNP_avgFreq.

snp_id,

allele,

freq

 

- This information is previously only available on refSNP page.

ChiSqPValue

chi_sq

df

pvalue

This is a lookup table with chi-square probability distribution use by Hardy-weinberg probability computation.

 

 

 

 

Schema Change from builds before 111: