Sequence Locations and Identifiers


Introduction
Seq-id: Identifying Sequences
Seq-id: Semantics of Use
Seq-id: The C Implementation
NCBI ID Database: Imposing Stable Seq-ids
Seq-loc: Locations on a Bioseq
Seq-loc: The C Implementation
ASN.1 Specification: seqloc.asn
C Structures and Functions: objloc.h


 Introduction

As described in the Biological Sequences chapter, a Bioseq always has at least one identifier. This means that any valid biological sequence can be referenced by using this identifier. However, all identifiers are not created equal. They may differ in their basic structure (e.g. a GenBank accession number is required to have an uppercase letter followed by exactly five digits while the NCBI GenInfo Id uses a simple integer identifier). They also differ in how they are used (e.g. the sequence identified by the GenBank accession number may change from release to release while the sequence identified by the NCBI GenInfo Id will always be exactly the same sequence).

Locations of regions on Bioseqs are always given as integer offsets, also described in the Biological Sequences chapter. So the first residue is always 0 and the last residue is always (length - 1). Further, since all the classes of Bioseqs from bands on a gel to genetic or physical maps to sequenced DNA use the same integer offset convention, locations always have the same form and meaning even when moving between very different types of Bioseq representations. This allows alignment, comparison, and display functions, among others, to have the same uniform interface and semantics, no matter what the underlying Bioseq class. Specialized numbering systems are supported but only as descriptive annotation (see Numbering in Biological Sequences and Feature types "seq" and "num" in Sequence Features). The internal conventions for positions on sequences are always the same.

There are no implicit Bioseq locations. All locations include a sequence identifier. This means Features, Alignments, and Graphs are always independent of context and can always be exchanged, submitted to databases, or stored as independent objects. The main consequence of this is that information ABOUT regions of Bioseqs can be developed and contributed to the public scientific discussion without any special rights of editing the Bioseq itself needing to be granted to anyone but the original author of the Bioseq. Bioseqs in the public databases, then, no longer need an anointed curator (beyond the original author) to be included in ongoing scientific discussion and data exchange by electronic media.

Seq-id: Identifying Sequences

In a pure sense, a Seq-id is meant to unambiguously identify a Bioseq. Unfortunately, different databases have different semantic rules regarding the stability and ambiguity of their best available identifiers. For this reason a Bioseq can have more than one Seq-id, so that the Seq-id with the best semantics for a particular use can be selected from all that are available for that Bioseq, or so that a new Seq-id with different semantics can be conferred on an existing Bioseq. Further, Seq-id is defined as a CHOICE of datatypes which may differ considerably in their structure and semantics from each other. Again, this is because differing sequence databases use different conventions for identifying sequences and it is important not to lose this critical information from the original data source.

One Seq-id type, "gi", has been implemented specifically to make a simple, absolutely stable Seq-id available for sequence data derived from any source. It is discussed in detail below.

A Textseq-id structure is used in many Seq-ids described below. It has four possible fields; a name, an accession number, a release, and a version. Formally, all fields are OPTIONAL, although to be useful, a Textseq-id should have at least a name or an accession or both. This style of Seq-id is used by GenBank, EMBL, DDBJ, PIR, SWISS-PROT, and PRF, but the semantics of its use differ considerably depending on the database. However none of these databases guarantees the stability of name or accession (i.e. that it points at a specific sequence), so to be unambiguous the id must also have either the release of the database in which the sequence with this id appeared. See the discussion under Seq-id: Semantics for details.

Seq-id: Semantics of Use

Different databases use their ids different ways and these patterns may change over time. An attempt is made is this section to describe current usage and offer some guidelines for interpreting Seq-ids.

local: Privately Maintained Data

The local Seq-id is an Object-id (see discussion in General Use Objects), which is a CHOICE of a string or an integer. This is to reconcile the requirement that all Bioseqs have a Seq-id and the needs of local software tools to manipulate data produced or maintained privately. This might be pre-publication data, data still being developed, or proprietary data. The Object-id will accommodate either a string or a number as is appropriate for the local environment. It is the responsibility of local software to keep the local Seq-ids unique. A local Seq-id is not globally unique, so when Bioseqs with such identifiers are published or exchanged, context (i.e. the submittor or owner of the id) must be maintained or a new id class must be applied to the Bioseq (e.g. the assignment of a GenBank accession upon direct data submission to GenBank).

other: A Local Textseq-id

The type "other" is a Textseq-id only, it does not carry context (what database is this from?). So it is meant only be used under similar conditions as "local", above, but allows the name/accession system to be used locally instead of being limited to a single string or name as "local" is.

general: Ids from Local Databases

The Seq-id type "general" uses a Dbtag (see discussion in General Use Objects), which is an Object-id as in Seq-id.local, above, with an additional string to identify a source database. This means that an integer or string id from a smaller database can create Seq-ids which both cite the database source and make the local Seq-ids globally unique (usually). For example, the EcoSeq database is a collection of E.coli sequences derived from many sources, curated and maintained by Kenn Rudd. Each sequence in EcoSeq has a unique descriptive name which is used as its primary identifier. A "general" Seq-id could be made for the EcoSeq entry "EcoAce" by making the following "general" Seq-id:

Seq-id ::= general {

                                db "EcoSeq" ,

                                tag str "EcoAce" }

gibbsq, gibbmt: GenInfo Backbone Ids

The journal scanning component of GenBank was originally known as the "GenInfo Backbone" database. This database is built by NCBI in collaboration with Library Operations at the National Library of Medicine (NLM) by building on the journal abstracting work done for building MEDLINE. This collaboration means more than 3500 different journals (more than 350,000 articles a year) are scanned for sequence containing publications, both nucleic acid and protein. New sequence data which cannot be proven to have been already directly submitted to the sequence databases is entered into the GenInfo Backbone. The data is released as part of the normal NCBI sequence database releases.

The Backbone database is a relational database which distinguishes between a simple sequence (equivalent to a virtual or a raw Bioseq) and a complex Bioseq (equivalent to a segmented Bioseq). As a result, every raw or virtual Bioseq produced from the Backbone will have a gibbsq (GenInfo Backbone Seq Id). If that Bioseq is a component of a segmented Bioseq, then the segmented Bioseq will have a gibbmt (GenInfo Backbone Molecule Type Id) but no gibbsq. If the raw or virtual Bioseq is not part of a segmented Bioseq, then it will have both a gibbsq and a gibbmt (the sequence and the molecule are the some).

This may seem confusing, and is, in fact, simply the result of the design of this database. For a user of Bioseqs derived from the GenInfo Backbone, it is enough to know three things. Every Bioseq from the Backbone will have a gibbsq, a gibbmt, or both. The gibbsq and gibbmt are simple integers from two independent series. Either a gibbsq or a gibbmt is sufficient to retrieve an entry, but the gibbsq is preferred if available to reference a specific sequence.

While sequences identified by a gibbsq or gibbmt are in practice very stable, they are not guaranteed to be stable. If a correction must be made to a sequence in the Backbone, its id will not be changed. See "gi" below for a guaranteed stable id. Backbone sequences for nucleic acids are assigned a GenBank accession number in addition to its backbone ids by NCBI.

genbank, embl, ddbj: The International Nucleic Acid Sequence Databases

NCBI (GenBank) in the U.S., the European Molecular Biology Laboratory datalibrary (EMBL) in Europe, and the DNA Database of Japan (DDBJ) in Japan are members of an international collaboration of nucleic acid sequence databases. Each collects data, often directly submitted by authors, and makes releases of its data in it's own format independently of each other. However, there are agreements in place for all the parties to exchange information with each other in an attempt to avoid duplication of effort and provide a world wide comprehensive database to their users. So a release by one of these databases is actually a composite of data derived from all three sources.

All three databases assign a mnemonic name (called a LOCUS name by GenBank and DDBJ, and an entry name by EMBL) which is meant to carry meaning encoded into it. The first few letters indicate the organism and next few a gene product, and so on. There is no concerted attempt to keep an entry name the same from release to release, nor is there any attempt for the same entry to have the same entry name in the three different databases (since they construct entry names using different conventions). While many people are used to referring to entries by name (and thus name is included in a Textseq-id) it is a notoriously unreliable way of identifying a Bioseq and should normally be avoided.

All three databases also assign an Accession Number to each entry. Accession numbers do not convey meaning, other than in a bookkeeping sense. Unlike names, accession numbers are meant to be same for the same entry, no matter which database one looks in. Thus, accession number is the best id for a Bioseq from this collaboration. Unfortunately rules for the use of accession numbers have not required that an accession number uniquely identify a sequence. A database may change an accession when it merely changes the annotation on an entry. Conversely, a database may not change an accession even though it has changed the sequence itself. There is no consistency about when such events may occur. There is also no exact method of recording the history of an entry in this collaboration, so such accession number shifts make it possible to lose track of entries by outside users of the databases. With all these caveats, accession numbers are still the best identifiers available within this collaboration.

A database release may be considered a snapshot of the database at a frozen moment of time. So a name or accession AND the database release IS a unique identifier for a Bioseq. For this reason it is provided in the Textseq-id structure. Be warned however, that depending on what data service is being queried, retrieval may not make use of the release information. Finally, EMBL assigns a version number to each entry. For entries derived from EMBL, the combination of accession number and version number is supposed to uniquely identify a sequence.

pir: PIR International

The PIR database is also produced through an international collaboration with contributors in the US at the Protein Identification Resource of the National Biomedical Research Foundation (NBRF), in Europe  at the Martinsried Institute for Protein Sequences (MIPS), and in Japan at the International Protein Information Database in Japan (JIPID). They also use an entry name and accession number. The PIR accession numbers, however, are not related to the GenBank/EMBL/DDBJ accession numbers in any way and have a very different meaning. In PIR, the entry name identifies the sequence, which is meant to be the "best version" of that protein. The accession numbers are in transition from a meaning more similar to the GenBank/EMBL/DDBJ accessions, to one in which an accession is associated with protein sequences exactly as they appeared in specific publications. Thus, at present, PIR ids may have both an accession and a name, they will move to more typically having either a name or an accession, depending on what is being cited, the "best" sequence or an original published sequence.

swissprot: SWISS-PROT

The SWISS-PROT database was created by Amos Bairoch at the University of Geneva in Switzerland (thus the name) and he continues to direct and develop it in its current collaborative relationship with EMBL. SWISS-PROT is derived from many sources including PIR, the GenInfo Backbone, and the translated coding regions from the GenBank/EMBL/DDBJ nucleic acid databases, among others. SWISS-PROT follows the same name and accession number conventions as GenBank/EMBL/DDBJ. The name is meant to be easily remembered and codes biological information, but is not a stable identifier from release to release. The accession is meant to be a stable identifier from release to release, but conveys only bookkeeping information. Unlike PIR accession numbers, the SWISS-PROT accession numbers are coordinated with those of GenBank/EMBL/DDBJ and do not conflict.

prf: Protein Research Foundation

The Protein Research Foundation in Japan has a large database of protein sequence and peptide fragments derived from the literature. Again, there is a name and an accession number. Since this database is meant only to record the sequence as it appeared in a particular publication, the relationship between the id and the sequence is quite stable in practice.

patent: Citing a Patent

The minimal information to unambiguously identify a sequence in a patent is first to unambiguously identify the patent (by the Patent-seq-id.cit, see Bibliographic References for a discussion of Id-pat) and then providing an integer serial number to identify the sequence within the patent. The sequence data for sequence related patents are now being submitted to the international patent offices in computer readable form, and the serial number for the sequence is assigned by the processing office. However, older sequence related patents were not assigned serial numbers by the processing patent offices. For those sequences the serial number is assigned arbitrarily (but still uniquely). Note that a sequence with a Patent-seq-id just appeared as part of a patent document. It is NOT necessarily what was patented by the patent document.

pdb: Citing a Biopolymer Chain from a Structure Database

The Protein Data Bank (PDB, also known as the Brookhaven Database), is a collection of data about structures of biological entities such hemoglobin or cytochrome c. The basic entry in PDB is a structural model of a molecule, not a sequence as in most sequence databases. A molecule may have multiple chains. So a PDB-seq-id has a string for the PDB entry name (called PDB-mol-id here) and a single character for a chain identifier within the molecule. The use of the single character just maps the PDB practice. The character may be a digit, a letter, or even a space (ASCII 32). As with the databases using the Textseq-id, the sequence of the chain in PDB associated with this information is not stable, so to be unambiguous the id must also include the release date.

giim: GenInfo Import Id

A Giimport-id is a temporary id used to identify sequences imported into the GenInfo system at NCBI from a variety of sources. Currently this id type is used in the NCBI ASN.1 and Entrez:Sequences releases to provide a uniform id type across sequence from all sources. The giim is not stable from release to release. The use of giim is a temporary measure until long term, stable identifiers such as "gi" below can be assigned (first or second quarter of 1993).

gi: A Stable, Uniform Id Applied to Sequences From All Sources

A Seq-id of type "gi" is a simple integer assigned to a sequence by the NCBI "ID" database. It can be applied to a Bioseq of any representation class, nucleic acid or protein. It uniquely identifies a sequence from a particular source. If the sequence changes at all, then a new "gi" is assigned. The "gi" does not change if only annotations are changed. Thus the "gi" provides a simple, uniform way of identifying a stable coordinate system on a Bioseq provided by data sources which may not themselves have stable ids. This is the identifier of choice for all references to Bioseqs through features or alignments. See discussion below.

Seq-id: The C Implementation

A Seq-id is implemented in C as a ValNode with a typedef SeqIdPtr ValNodePtr. The type of the Seq-id is given in ValNode->choice and a series of #defines are used to indicate the type of the Seq-id. The ValNode->data.intvalue is used for the integer types and ValNode->data.ptrvalue for the other types as in the following table.

Seq-id

Value

#define

ASN.1 name

Type in ValNode->data

0

SEQID_NOT_SET

not-set

not needed

1

SEQID_LOCAL

local

ObjectIdPtr

2

SEQID_GIBBSQ

gibbsq

integer

3

SEQID_GIBBMT

gibbmt

integer

4

SEQID_GIIM

giim

GiimPtr

5

SEQID_GENBANK

genbank

TextSeqIdPtr

6

SEQID_EMBL

embl

TextSeqIdPtr

7

SEQID_PIR

pir

TextSeqIdPtr

8

SEQID_SWISSPROT

swissprot

TextSeqIdPtr

9

SEQID_PATENT

patent

PatentSeqIdPtr

10

SEQID_OTHER

other

TextSeqIdPtr

11

SEQID_GENERAL

general

DbtagPtr

12

SEQID_GI

gi

integer

13

SEQID_DDBJ

ddbj

TextSeqIdPtr

14

SEQID_PRF

prf

TextSeqIdPtr

15

SEQID_PDB

pdb

PDBSeqIdPtr

Since a SeqIdPtr is a ValNodePtr, a special SeqIdNew() is not provided, although the usual SeqIdAsnRead(), SeqIdAsnWrite(), and SeqIdFree() functions are provided. Since SET OF and SEQUENCE OF Seq-id are common, SeqIdSetAsnRead(), SeqIdSetAsnWrite(), and SeqIdSetFree() functions are provided. They assume that the SeqIdPtr passed is the head of a chain of SeqIds connect through the ValNodePtr->next and with the last ValNodePtr->next equal to NULL. SeqIdDup() provides a fast function for duplicating SeqIds.

A large number of additional functions for manipulating SeqIds are described in the Sequence Utilities chapter.

NCBI ID Database: Imposing Stable Seq-ids

As described in the Data Model chapter, Bioseqs provide a simple integer coordinate system through which a host of different data and analytical results can be easily associated with each other, even with scientists working independently of each other and on heterogeneous systems. For this model to work, however, requires stable identifiers for these integer coordinate systems. If one scientist notes a coding region from positions 10-50 of sequence "A", then the database adds a single base pair at position 5 of "A" without changing the identifier of "A", then at the next release of the database the scientist's coding region is now frame-shifted one position and invalid. Unfortunately this is currently the case due to the casual use of sequence identifiers by most existing databases.

Since NCBI integrates data from many different databases which follow their own directions, we must impose stable ids on an unstable starting material. While a daunting task, it is not, in the main, impossible. We have built a database called "ID", whose sole task is to assign and track stable sequence ids. ID assigns "gi" numbers, simple arbitrary integers which stably identify a particular sequence coordinate system.

The first time ID "sees" a Bioseq, say EMBL accession A00000, it checks to see if it has a Bioseq from EMBL with this accession already. If not, it assigns a new GI, say 5, to the entry and adds it to the Bioseq.id chain (the original EMBL id is not lost). It also replaces all references in the entry (say in the feature table) to EMBL A00000 to GI 5. This makes the annotations now apply to a stable coordinate system.

Now EMBL sends an update of the entry which is just a correction to the feature table. The same process occurs, except this time there is a previous entry with the same EMBL accession number. ID retrieves the old entry and compares the sequence of the old entry with the new entry. Since they are identical it reassigns GI 5 to the same entry, converts the new annotations, and stores it as the most current view of that EMBL entry.

Now ID gets another update to A00000, but this time the sequence is different. ID assigns a new GI, say 6, to this entry. It also updates the sequence history (Seq-inst.hist, see the Biological Sequences chapter) of both old and new entries to make a doubly linked list. The GI 5 entry has a pointer that it has been replaced by GI 6, and the GI 6 entry has a pointer showing it replaced GI 5. When NCBI makes a new data release the entry designated GI 6 will be released to represent EMBL entry A00000. However, the ASN.1 form of the data contains an explicit history. A scientist who annotated a coding region on GI 5 can discover that it has been replaced by GI 6. The GI 5 entry can still be retrieved from ID, aligned with GI 6, and the scientist can determine if her annotation is still valid on the new entry. If she annotated using the accession number instead of the GI, of course, she could be out of luck.

Since ID is attempting to order a chaotic world, mistakes will inevitably be made. However, it is clear that in the vast majority of cases it is possible to impose stable ids. As scientists and software begin to use the GI ids and reap the benefits of stable ids, the world may gradually become less chaotic. The Seq-inst.hist data structure can even be used by data suppliers to actively maintain an explicit history without ID having to infer it, which would be the ideal case.

Seq-loc: Locations on a Bioseq

A Seq-loc is a location on a Bioseq of any representation class, nucleic acid or protein. All Bioseqs provide a simple integer coordinate system from 0 to (length -1) and all Seq-locs refer to that coordinate system. All Seq-locs also explicitly the Bioseq (coordinate system) to which they apply with a Seq-id. Most objects which are attached to or reference sequences do so through a Seq-loc. Features are blocks of data attached by a Seq-loc. An alignment is just a collection of correlated Seq-locs. A segmented sequence is built from other sequences by reference to Seq-locs.

Seq-locs come in many varieties.

null: A Gap

A null Seq-loc can be used in a Seq-loc with many components to indicate a gap of unknown size. For example it is used in segmented sequences to indicate such gaps between the sequenced pieces.

empty: A Gap in an Alignment

A alignment (see Sequence Alignments) may require that every Seq-loc refer to a Bioseq, even for a gap. They empty type fulfills this need.

whole: A Reference to a Whole Bioseq

This is just a shorthand for the Bioseq from 0 to (length -1). This form is falling out of favor at NCBI because it means one must retrieve the referenced Bioseq to determine the length of the location. An interval covering the whole Bioseq is equivalent to this and more useful. One the other hand, if an unstable Seq-id is used here, it always applies to the full length of the Bioseq, even if the length changes. This was the original rationale for this type. And it may still be valid while unstable sequences persist.

int: An Interval on a Bioseq

An interval is a single continuous region of defined length on a Bioseq. A single integer value (Seq‑interval.from), another single integer value (Seq-interval.to), and a Seq-id (Seq-interval.id) are required. The "from" and "to" values must be in the range 0 to (length -1) of the Bioseq cited in "id". If there are uncertainty about either the "from" or "to" values, it is expressed in additional fields "fuzz-from" and/or "fuzz-to", and the "from" and "to" values can be considered a "best guess" location. This design means that simple software can ignore fuzzy values, but they are not lost to more sophisticated tools.

The "from" value is ALWAYS less than or equal to the "to" value, no matter what strand the interval is on. It may be convenient for software to present intervals on the minus strand with the "to" value before the "from" value, but internally this is NEVER the case. This requirement means that software which determines overlaps of locations need never treat plus or minus strand locations differently and it greatly simplifies processing.

The value of Seq-interval.strand is the only value different in intervals on the plus or minus strand. Seq-interval.strand is OPTIONAL since it is irrelevant for proteins, but operationally it will DEFAULT to plus strand on nucleic acid locations where it is not supplied.

The plus or minus strand is an attribute on each simple Seq-loc (interval or point) instead of as an operation on an arbitrarily complex location (as in the GenBank/EMBL/DDBJ flatfile Feature Table) since it means even very complex locations can be processed to a base pair location in simple linear order, instead of requiring that the whole expression be processed and resolved first.

packed-int: A Series of Intervals

A Packed-seqint is simply a SEQUENCE OF Seq-interval. That means the location is resolved by evaluating a series of Seq-interval in order. Note that the Seq-intervals in the series do not need all be on the same Bioseq or on the same strand.

pnt: A Single Point on a Sequence

A Seq-point is essentially one-half of a Seq-interval and the discussion (above) about fuzziness and strand applies equally to Seq-point.

packed-pnt: A Collection of Points

A Packed-seqpnt is an optimization for attaching a large number of points to a single Bioseq. Information about the Seq-id, strand, or fuzziness need not be duplicated for every point. Of course, this also means it must apply equally to all points as well. This would typically be the case for listing all the cut sites of a certain restriction enzyme, for example.

mix: An Arbitrarily Complex Location

A Seq-loc-mix is simply a SEQUENCE OF Seq-loc. The location is resolved by resolving each Seq-loc in order. The component Seq-locs may be of any complexity themselves, making this definition completely recursive. This means a relatively small amount of software code can process locations of extreme complexity with relative ease.

A Seq-loc-mix might be used to represent a segmented sequence with gaps of unknown length. In this case it would consist of some elements of type "int" for intervals on Bioseqs and some of type "null" representing gaps of unknown length. Another use would be to combine a Seq-interval representing an untranslated leader, with a Packed-seqint from a multi-exon coding region feature, and another Seq-interval representing an untranslated 3' end, to define the extent of an mRNA on a genomic sequence.

equiv: Equivalent Locations

This form is simply a SET OF Seq-loc which are equivalent to each other. Such a construct could be used to represent alternative splicing, for example (and is when translating the GenBank/EMBL/DDBJ location "one-of"). However note that such a location can never resolve to a single result. Further, if there are multiple "equiv" forms in a complex Seq-loc, it is unclear if all possible combinations are valid. In general this construct should be avoided unless there is no alternative.

bond: A Chemical Bond Between Two Residues

The data elements in a Seq-bond are just two Seq-points. The meaning is that these two points have a chemical bond between them (which is different than describing just the location of two points). At NCBI we have restricted its use to covalent bonds. Note that the points may be on the same (intra-chain bond) or different (inter-chain bond) Bioseqs completely explicitly.

feat: A Location Indirectly Referenced Through A Feature

This one is really for the future, when not only Bioseqs, but features have stable ids. The meaning is "the location of this feature". This way one could give a valid location by citing, for example a Gene feature, which would resolve to the location of that gene on a Bioseq. When identifiable features become common (see Sequence Features) this will become a very useful location.

Seq-loc: The C Implementation

Since a Seq-loc is a CHOICE of many types a SeqLocPtr is typedefed as a ValNodePtr. The ValNodePtr->choice indicates the type of SeqLoc and a series of #defines provide the values in a convenient way. The ValNodePtr->data.ptrvalue contains a pointer to the appropriate data structure as in the table below:

Seq-loc

Value

#define

ASN.1 name

Type in ValNode->data

1

SEQLOC_NULL

null

not needed

2

SEQLOC_EMPTY

empty

SeqIdPtr

3

SEQLOC_WHOLE

whole

SeqIdPtr

4

SEQLOC_INT

int

SeqIntPtr

5

SEQLOC_PACKED_INT

packed-int

SeqLocPtr

6

SEQLOC_PNT

pnt

SeqPntPtr

7

SEQLOC_PACKED_PNT

packed-pnt

PackSeqPntPtr

8

SEQLOC_MIX

mix

SeqLocPtr

9

SEQLOC_EQUIV

equiv

SeqLocPtr

10

SEQLOC_BOND

bond

SeqBondPtr

11

SEQLOC_FEAT

feat

ChoicePtr

Note that SEQLOC_MIX and SEQLOC_EQUIV types have a SeqLocPtr in their data.ptrvalue. This is expected since they are a SEQUENCE OF or SET OF Seq-loc and data.ptrvalue contains a pointer to the head of the linked list of ValNodes connect through their ->next pointers. SEQLOC_PACKED_INT is implemented this way as well, for simplicity, although each Seq-loc in that chain will be, by definition, of type SEQLOC_INT.

Like Seq-id, above, there is no SeqLocNew() function since it is just a ValNode, but there are the usual SeqLocAsnRead(), SeqLocAsnWrite(), and SeqLocFree() functions. In addition there are SeqLocSetAsnWrite(), SeqLocSetAsnRead(), and SeqLocSetFree() functions.

PackSeqPnt has some extra functions as well. PackSeqPntNum() returns the number of points in the the PackSeqPnt. PackSeqPntGet() will return a point given an index (0 to (number of points -1)) of the point. PackSeqPntPut() will add a point to the PackSeqPnt. These functions are to hide the complexity of managing the set of points.

A series of #defines for nucleic acid strands are provided to map to the ASN.1 ENUMERATED type. They are:

#define Seq_strand_unknown 0

#define Seq_strand_plus 1

#define Seq_strand_minus 2

#define Seq_strand_both 3

#define Seq_strand_both_rev 4

#define Seq_strand_other 255

In addition, there are a large number of utility functions for working with SeqLocs described in the chapter on Sequence Utilities. This allow traversal of complex locations, comparison of locations for overlap, conversion of coordinates in locations, and ability to open a window on a Bioseq through a location.

ASN.1 Specification: seqloc.asn

--$Revision: 2.0 $

--**********************************************************************

--

--  NCBI Sequence location and identifier elements

--  by James Ostell, 1990

--

--**********************************************************************

 

NCBI-Seqloc DEFINITIONS ::=

BEGIN

 

EXPORTS Seq-id, Seq-loc, Seq-interval, Packed-seqint, Seq-point, Packed-seqpnt,

        Na-strand, Giimport-id;

 

IMPORTS Object-id, Int-fuzz, Dbtag, Date FROM NCBI-General

        Id-pat FROM NCBI-Biblio

        Feat-id FROM NCBI-Seqfeat;

 

--*** Sequence identifiers ********************************

--*

 

Seq-id ::= CHOICE {

    local Object-id ,      -- local use

    gibbsq INTEGER ,         -- Geninfo backbone seqid

    gibbmt INTEGER ,         -- Geninfo backbone moltype

    giim Giimport-id ,       -- Geninfo import id

    genbank Textseq-id ,

    embl Textseq-id ,

    pir Textseq-id ,

    swissprot Textseq-id ,

    patent Patent-seq-id ,

    other Textseq-id ,       -- catch all

    general Dbtag ,          -- for other databases

    gi INTEGER ,             -- GenInfo Integrated Database

   ddbj Textseq-id ,        -- DDBJ

   prf Textseq-id ,         -- PRF SEQDB

   pdb PDB-seq-id }         -- PDB sequence

 

Patent-seq-id ::= SEQUENCE {

    seqid INTEGER ,         -- number of sequence in patent

    cit Id-pat }           -- patent citation

 

Textseq-id ::= SEQUENCE {

    name VisibleString OPTIONAL ,

    accession VisibleString OPTIONAL ,

    release VisibleString OPTIONAL ,

    version INTEGER OPTIONAL }

 

Giimport-id ::= SEQUENCE {

    id INTEGER ,               -- the id to use here

    db VisibleString OPTIONAL ,  -- dbase used in

    release VisibleString OPTIONAL }   -- the release

 

PDB-seq-id ::= SEQUENCE {

   mol PDB-mol-id ,          -- the molecule name

   chain INTEGER DEFAULT 32 ,-- a single ASCII character, chain id

    rel Date OPTIONAL }   -- release date, month and year

 

PDB-mol-id ::= VisibleString  -- name of mol, 4 chars

  

--*** Sequence locations **********************************

--*

 

Seq-loc ::= CHOICE {

    null NULL ,           -- not placed

    empty Seq-id ,        -- to NULL one Seq-id in a collection

    whole Seq-id ,        -- whole sequence

    int Seq-interval ,    -- from to

    packed-int Packed-seqint ,

    pnt Seq-point ,

    packed-pnt Packed-seqpnt ,

    mix Seq-loc-mix ,

    equiv Seq-loc-equiv ,  -- equivalent sets of locations

    bond Seq-bond ,

    feat Feat-id }         -- indirect, through a Seq-feat

   

 

Seq-interval ::= SEQUENCE {

    from INTEGER ,

    to INTEGER ,

    strand Na-strand OPTIONAL ,

    id Seq-id ,    -- WARNING: this used to be optional

    fuzz-from Int-fuzz OPTIONAL ,

    fuzz-to Int-fuzz OPTIONAL }

 

Packed-seqint ::= SEQUENCE OF Seq-interval

 

Seq-point ::= SEQUENCE {

    point INTEGER ,

    strand Na-strand OPTIONAL ,

    id Seq-id ,     -- WARNING: this used to be optional

    fuzz Int-fuzz OPTIONAL }

 

Packed-seqpnt ::= SEQUENCE {

    strand Na-strand OPTIONAL ,

    id Seq-id ,

    fuzz Int-fuzz OPTIONAL ,

    points SEQUENCE OF INTEGER }

 

Na-strand ::= ENUMERATED {          -- strand of nucleid acid

    unknown (0) ,

    plus (1) ,

    minus (2) ,              

    both (3) ,                -- in forward orientation

    both-rev (4) ,            -- in reverse orientation

    other (255) }

 

Seq-bond ::= SEQUENCE {         -- bond between residues

   a Seq-point ,           -- connection to a least one residue

   b Seq-point OPTIONAL }  -- other end may not be available

 

Seq-loc-mix ::= SEQUENCE OF Seq-loc   -- this will hold anything

 

Seq-loc-equiv ::= SET OF Seq-loc      -- for a set of equivalent locations

 

END

C Structures and Functions: objloc.h

/*  objloc.h

* ===========================================================================

*

*                            PUBLIC DOMAIN NOTICE                         

*               National Center for Biotechnology Information

*                                                                         

*  This software/database is a "United States Government Work" under the  

*  terms of the United States Copyright Act.  It was written as part of   

*  the author's official duties as a United States Government employee and

*  thus cannot be copyrighted.  This software/database is freely available

*  to the loclic for use. The National Library of Medicine and the U.S.   

*  Government have not placed any restriction on its use or reproduction. 

*                                                                         

*  Although all reasonable efforts have been taken to ensure the accuracy 

*  and reliability of the software and data, the NLM and the U.S.          

*  Government do not and cannot warrant the performance or results that   

*  may be obtained by using this software or data. The NLM and the U.S.   

*  Government disclaim all warranties, express or implied, including      

*  warranties of performance, merchantability or fitness for any particular

*  purpose.                                                               

*                                                                         

*  Please cite the author in any work or product based on this material.  

*

* ===========================================================================

*

* File Name:  objloc.h

*

* Author:  James Ostell

*  

* Version Creation Date: 4/1/91

*

* $Revision: 2.0 $

*

* File Description:  Object manager interface for module NCBI-Seqloc

*

* Modifications: 

* --------------------------------------------------------------------------

* Date    Name        Description of modification

* -------  ----------  -----------------------------------------------------

*

*

* ==========================================================================

*/

 

#ifndef _NCBI_Seqloc_

#define _NCBI_Seqloc_

 

#ifndef _ASNTOOL_

#include <asn.h>

#endif

#ifndef _NCBI_General_

#include <objgen.h>

#endif

#ifndef _NCBI_Biblio_

#include <objbibli.h>

#endif

 

typedef ValNodePtr SeqIdPtr;

typedef ValNodePtr SeqLocPtr;

 

#ifndef _NCBI_Seqfeat_

#include <objfeat.h>      /* after Seqloc to avoid cycles */

#endif

 

#ifdef __cplusplus

extern "C" {

#endif

 

/*****************************************************************************

*

*   Seqloc loader

*

*****************************************************************************/

extern Boolean SeqLocAsnLoad PROTO((void));

 

/*****************************************************************************

*

*   internal structures for NCBI-Seqloc objects

*

*****************************************************************************/

 

/*****************************************************************************

*

*   SeqId is a choice using an ValNode, most types in data.ptrvalue

*      except integers, in data.intvalue

*   choice:

*   0 = not set

    1 = local Object-id ,      -- local use

    2 = gibbsq INTEGER ,         -- Geninfo backbone seqid

    3 = gibbmt INTEGER ,         -- Geninfo backbone moltype

    4 = giim Giimport-id ,       -- Geninfo import id

    5 = genbank Textseq-id ,

    6 = embl Textseq-id ,

    7 = pir Textseq-id ,

    8 = swissprot Textseq-id ,

    9 = patent Patent-seq-id ,

    10 = other Textseq-id ,       -- catch all

    11 = general Dbtag          -- for other databases

    12 = gi  INTEGER          -- GenInfo Integrated Database

    13 = ddbj Textseq-id

   14 = prf Textseq-id ,         -- PRF SEQDB

   15 = pdb PDB-seq-id          -- PDB sequence

*

*****************************************************************************/

#define SEQID_NOT_SET ( (Uint1)0)

#define SEQID_LOCAL ( (Uint1)1)

#define SEQID_GIBBSQ ( (Uint1)2)

#define SEQID_GIBBMT ( (Uint1)3)

#define SEQID_GIIM ( (Uint1)4)

 

/*---

 * WARNING: CODE in objloc.c, especially SeqIdPrint() requires that

 * GENBANK through SwissProt be contiguous numbers

 * in the following order.

 *-----*/

#define SEQID_GENBANK ( (Uint1)5)

#define SEQID_EMBL ( (Uint1)6)

#define SEQID_PIR ( (Uint1)7)

#define SEQID_SWISSPROT ( (Uint1)8)

 

 

#define SEQID_PATENT ( (Uint1)9)

#define SEQID_OTHER ( (Uint1)10)

#define SEQID_GENERAL ( (Uint1)11)

#define SEQID_GI ( (Uint1)12)

#define SEQID_DDBJ ((Uint1)13)

#define SEQID_PRF ((Uint1)14)

#define SEQID_PDB ((Uint1)15)

 

Boolean SeqIdAsnWrite PROTO((SeqIdPtr anp, AsnIoPtr aip, AsnTypePtr atp));

SeqIdPtr SeqIdAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

SeqIdPtr SeqIdFree PROTO((SeqIdPtr anp));

SeqIdPtr SeqIdDup PROTO((SeqIdPtr oldid));

 

/*****************************************************************************

*

*   These routines process sets or sequences of SeqId's

*

*****************************************************************************/

Boolean SeqIdSetAsnWrite PROTO((SeqIdPtr anp, AsnIoPtr aip, AsnTypePtr settype, AsnTypePtr elementtype));

SeqIdPtr SeqIdSetAsnRead PROTO((AsnIoPtr aip, AsnTypePtr settype, AsnTypePtr elementtype));

SeqIdPtr SeqIdSetFree PROTO((SeqIdPtr anp));

 

 

/*****************************************************************************

*

*   PatentSeqId

*

*****************************************************************************/

typedef struct patentseqid {

    Int2 seqid;

    IdPatPtr cit;

} PatentSeqId, PNTR PatentSeqIdPtr;

 

PatentSeqIdPtr PatentSeqIdNew PROTO((void));

Boolean PatentSeqIdAsnWrite PROTO((PatentSeqIdPtr psip, AsnIoPtr aip, AsnTypePtr atp));

PatentSeqIdPtr PatentSeqIdAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

PatentSeqIdPtr PatentSeqIdFree PROTO((PatentSeqIdPtr psip));

 

/*****************************************************************************

*

*   TextSeqId

*

*****************************************************************************/

typedef struct textseqid {

    CharPtr name,

        accession,

        release;

   Int2 version;             /* INT2_MIN (ncbilcl.h) = not set */

} TextSeqId, PNTR TextSeqIdPtr;

 

TextSeqIdPtr TextSeqIdNew PROTO((void));

Boolean TextSeqIdAsnWrite PROTO((TextSeqIdPtr tsip, AsnIoPtr aip, AsnTypePtr atp));

TextSeqIdPtr TextSeqIdAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

TextSeqIdPtr TextSeqIdFree PROTO((TextSeqIdPtr tsip));

 

/*****************************************************************************

*

*   Giim

*

*****************************************************************************/

typedef struct giim {

    Int4 id;

    CharPtr db,

        release;

} Giim, PNTR GiimPtr;

 

GiimPtr GiimNew PROTO((void));

Boolean GiimAsnWrite PROTO((GiimPtr gip, AsnIoPtr aip, AsnTypePtr atp));

GiimPtr GiimAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

GiimPtr GiimFree PROTO((GiimPtr gip));

 

 

/*****************************************************************************

*

*   PDBSeqId

*

*****************************************************************************/

typedef struct pdbseqid {

    CharPtr mol;

   Uint1 chain;        /* 0 = no chain set.  default = 32 */

   DatePtr rel;

} PDBSeqId, PNTR PDBSeqIdPtr;

 

PDBSeqIdPtr PDBSeqIdNew PROTO((void));

Boolean PDBSeqIdAsnWrite PROTO((PDBSeqIdPtr tsip, AsnIoPtr aip, AsnTypePtr atp));

PDBSeqIdPtr PDBSeqIdAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

PDBSeqIdPtr PDBSeqIdFree PROTO((PDBSeqIdPtr tsip));

 

/*****************************************************************************

*

*   SeqLoc

*   SeqLoc is a choice using an ValNode, most types in data.ptrvalue

*      except integers, in data.intvalue

*   choice:

    1 = null NULL ,           -- not placed

    2 = empty Seq-id ,        -- to NULL one Seq-id in a collection

    3 = whole Seq-id ,        -- whole sequence

    4 = int Seq-interval ,    -- from to

    5 = packed-int Packed-seqint ,

    6 = pnt Seq-point ,

    7 = packed-pnt Packed-seqpnt ,

    8 = mix SEQUENCE OF Seq-loc ,

    9 = equiv SET OF Seq-loc ,  -- equivalent sets of locations

    10 = bond Seq-bond

    11 = feat Feat-id    -- indirect through a feature

*

*****************************************************************************/

#define SEQLOC_NULL ( (Uint1)1)

#define SEQLOC_EMPTY ( (Uint1)2)

#define SEQLOC_WHOLE ( (Uint1)3)

#define SEQLOC_INT ( (Uint1)4)

#define SEQLOC_PACKED_INT ( (Uint1)5)

#define SEQLOC_PNT ( (Uint1)6)

#define SEQLOC_PACKED_PNT ( (Uint1)7)

#define SEQLOC_MIX ( (Uint1)8)

#define SEQLOC_EQUIV ( (Uint1)9)

#define SEQLOC_BOND ( (Uint1)10)

#define SEQLOC_FEAT ( (Uint1)11)

 

Boolean SeqLocAsnWrite PROTO((SeqLocPtr anp, AsnIoPtr aip, AsnTypePtr atp));

SeqLocPtr SeqLocAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

SeqLocPtr SeqLocFree PROTO((SeqLocPtr anp));

 

 

/*****************************************************************************

*

*   these routines work on set/seq of SeqLoc

*

*****************************************************************************/

Boolean SeqLocSetAsnWrite PROTO((SeqLocPtr anp, AsnIoPtr aip, AsnTypePtr set, AsnTypePtr element));

SeqLocPtr SeqLocSetAsnRead PROTO((AsnIoPtr aip, AsnTypePtr orig, AsnTypePtr element));

SeqLocPtr SeqLocSetFree PROTO((SeqLocPtr anp));

 

/*****************************************************************************

*

*   SeqInt

*

*****************************************************************************/

typedef struct seqint {

    Int4 from,

        to;

    Uint1 strand;

    SeqIdPtr id;    /* seq-id */

    IntFuzzPtr if_from,

               if_to;

} SeqInt, PNTR SeqIntPtr;

 

SeqIntPtr SeqIntNew PROTO((void));

Boolean SeqIntAsnWrite PROTO((SeqIntPtr sip, AsnIoPtr aip, AsnTypePtr atp));

SeqIntPtr SeqIntAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

SeqIntPtr SeqIntFree PROTO((SeqIntPtr sip));

 

/*****************************************************************************

*

*   Packed-int

*

*****************************************************************************/

 

Boolean PackSeqIntAsnWrite PROTO((SeqLocPtr sip, AsnIoPtr aip, AsnTypePtr atp));

SeqLocPtr PackSeqIntAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

 

/*****************************************************************************

*

*   SeqLocMix

*

*****************************************************************************/

 

Boolean SeqLocMixAsnWrite PROTO((SeqLocPtr anp, AsnIoPtr aip, AsnTypePtr atp));

SeqLocPtr SeqLocMixAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

 

/*****************************************************************************

*

*   SeqLocEquiv

*

*****************************************************************************/

 

Boolean SeqLocEquivAsnWrite PROTO((SeqLocPtr anp, AsnIoPtr aip, AsnTypePtr atp));

SeqLocPtr SeqLocEquivAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

 

/*****************************************************************************

*

*   SeqPnt

*

*****************************************************************************/

typedef struct seqpoint {

    Int4 point;

    Uint1 strand;

    SeqIdPtr id;    /* seq-id */

    IntFuzzPtr fuzz;

} SeqPnt, PNTR SeqPntPtr;

 

SeqPntPtr SeqPntNew PROTO((void));

Boolean SeqPntAsnWrite PROTO((SeqPntPtr spp, AsnIoPtr aip, AsnTypePtr atp));

SeqPntPtr SeqPntAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

SeqPntPtr SeqPntFree PROTO((SeqPntPtr spp));

 

/*****************************************************************************

*

*   PackSeqPnt

*

*****************************************************************************/

#define PACK_PNT_NUM 100     /* number of points per block */

 

typedef struct packseqpnt {

    SeqIdPtr id;    /* seq-id */

    IntFuzzPtr fuzz;

    Uint1 strand,

          used;       /* number of pnts used */

    Int4 pnts[PACK_PNT_NUM];

    struct packseqpnt PNTR next;   /* builds up chain of points */

} PackSeqPnt, PNTR PackSeqPntPtr;

 

PackSeqPntPtr PackSeqPntNew PROTO((void));

Boolean PackSeqPntAsnWrite PROTO((PackSeqPntPtr pspp, AsnIoPtr aip, AsnTypePtr atp));

PackSeqPntPtr PackSeqPntAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

PackSeqPntPtr PackSeqPntFree PROTO((PackSeqPntPtr pspp));

Int4 PackSeqPntGet PROTO((PackSeqPntPtr pspp, Int4 index));

Boolean PackSeqPntPut PROTO((PackSeqPntPtr pspp, Int4 point));

Int4 PackSeqPntNum PROTO((PackSeqPntPtr pspp));

 

/*****************************************************************************

*

*   SeqBond

*

*****************************************************************************/

typedef struct seqbond {

    SeqPntPtr a,

                b;

} SeqBond, PNTR SeqBondPtr;

 

SeqBondPtr SeqBondNew PROTO((void));

Boolean SeqBondAsnWrite PROTO((SeqBondPtr sbp, AsnIoPtr aip, AsnTypePtr atp));

SeqBondPtr SeqBondAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

SeqBondPtr SeqBondFree PROTO((SeqBondPtr sbp));

 

 

/*****************************************************************************

*

*   strand types

*

*****************************************************************************/

#define Seq_strand_unknown 0

#define Seq_strand_plus 1

#define Seq_strand_minus 2

#define Seq_strand_both 3

#define Seq_strand_both_rev 4

#define Seq_strand_other 255

 

#ifdef __cplusplus

}

#endif

 

#endif