> next up previous
Next: Bibliography

Philosophy and Definition for a Universal Genetic Sequence Database 1

Thomas D. Schneider 2

version = 1.82 of philgen.tex 2000 Feb 29

ABSTRACT

Modern sequence databases have many problems because they are not carefully defined. For example, when one searches for homology with a given sequence, many duplicate sequences are found in each database and similar but not identical results are obtained from other databases. The extra copies of inconsistent information are slowing down research. This situation could easily be avoided by removing redundancy in the databases, but this goal is not a fundamental component of the database design and has been neglected. A clear statement of goals for storing genetic sequence information is required. To this end, five documents, in decreasing order of importance, are proposed:
(1) The Philosophy document defines guiding principles for the design and use of the database.
(2) The Definition document identifies what is to be stored in the database, following the guidance of the philosophy. It is machine and computer-language independent.
(3) The Implementation document translates the definition into computer code which can be run on machines.
(4) Examples of all database objects allow users to test their database analysis programs.
(5) A Tutorial explains how the database is organized.
A philosophy and the beginnings of a definition are given in this paper.



Individual researchers can help by:
1. providing correct and complete sequence data;
2. annotating their sequences with experimentally deduced features;
3. providing standard genetic names for all features;
4. updating the annotation as new knowledge is obtained;
5. periodically verifying the accuracy of the data and annotation in the public
database;
6. volunteering to curate data for which they are an expert.

GenBank runs the risk of becoming so much noise. It isn't just a matter of collecting all sequences like some avid bowerbird, researchers should be able to get information OUT of GenBank as well.--Ingrid Jakobsen [1]

Unless the architecture and the nature of the information represented in a database are concisely and comprehensively defined, the database itself will be of limited value.--David George [2]

At the time of writing, data-base technology is widely misunderstood. Its role as the foundation stone of future data processing is often not appreciated. The techniques used in many organizations contain the seeds of immense future difficulties. Data independence is often thrown to the winds. Data organizations in use prevent the data being employed as it should be.--James Martin [3]



INTRODUCTION

The way data are stored in a database affects the kind and quality of the science that can be done with the database. My scientific work has led me to a particular view of how genetic sequence databases can be organized to make them useful for research. Fifteen years ago I began work on E. coli ribosome binding sites by typing into a computer the sequences around the initiation codons of 63 genes. I soon realized that if I wanted to look at different regions relative to the starts, I would have to edit all the sequences by hand. This was likely to lead to many errors [4], and I would have to repeat the editing for each new analysis. To avoid that, I decided to store all the known sequences and then to write a program which could extract the portions I needed for a particular analysis. To tell the program what I wanted, I invented a language for doing the extractions and named it Delila, for DEoxyribonucleic acid LIbrary LAnguage [5,6,7]. In this language one first specifies a sequence region:

  organism S.cerevisiae;
  chromosome III;
  gene LEU2;
Then one requests particular parts of that sequence:
  get from gene begin -30 to gene begin +40;
This will extract the sequence from 30 bases before the first base of the initiation codon to 40 bases after it. This instruction is ``relative'' because it defines a sequence relative to a defined chromosomal location. Alternatively, one can give the absolute coordinates:
  get from 90935 -30 to 90935 +40;
or equivalently
  get from 90905 to 90975;
Editing instructions is much easier, safer, faster and more reliable than editing sequences directly. In addition, other programs can create or modify a list of instructions, so powerful ways of manipulating large sets of sequences became possible [8]. The Delila program has been enormously useful for many studies. In particular, it formed the basis for the first use of neural nets [9] and information theory [10,11,12] to analyze binding sites. In a recent analysis we manipulated almost 3600 sequences by this method [13], and in unpublished work we have easily worked with 8000 sequence fragments.

At about the same time that Delila was being created, the international sequence databases were formed to reduce the worldwide duplication of effort in capturing sequence data. The databases have been highly successful in this endeavor. It would be very useful to write a ``Delila II'' program which could take full advantage of the larger databases. As with Delila I, Delila II could use the names of genetic features to define the sequences of interest. Since the location of many genetic features is well defined, or could be defined by international convention [14], this approach would make the Delila instructions reasonably insensitive to changes in the growing database. Scientists around the world could exchange their lists of sequence instructions, and could build on each other's work.

As described in the next section, the present structure of the GenBank database [15] makes these kinds of powerful sequence manipulations difficult. We are gathering sequences at an exponentially increasing rate, but the locations of genetic objects are not being stored so that they can be retrieved automatically. Because there is no overall idea of how the data could be used, the data are not being stored in a sensible way. Just as we once had a crisis about capturing sequence data, we now are approaching a crisis about compiling and annotating sequences.

There are two basic reasons for these problems. First, there are no published principles guiding the design of the database. Second, there is no published and accepted documentation that defines the entire database. Without a written document, a researcher cannot recognize that the database structure is incorrect. Documentation that does exist is often incomplete and unclear. The documentation usually describes implementation of the database and does not say why the database was built this way. There is no defined set of rules by which the database is run. The content and quality of data varies.

To solve these problems, this paper proposes the adoption of a collection of guiding principles for sequence databases. A system of fundamental or motivating principles is a ``philosophy'' [16]. Without a guiding philosophy we will eventually be unable to use much of the data we are spending millions of dollars to collect [17,18].

A philosophy is not enough. We must use the philosophy to precisely define the contents of the database. We must publish the definition so that the ideas can be widely disseminated and critiqued. The importance of documenting computer programs is widely understood; the same should apply to databases. Examples of such documents are given in the references [5,19,20,21].

I do not merely mean that we need manuals which explain the database to outsiders or describe its current status. Instead, I envision documents equivalent to the constitution and laws of a country. They guide the course of events. They are ``alive'' and continuously debated. They are changed when times change. But if they are not changed, then they are strictly followed. GenBank needs such guidance.

This paper also proposes that we should store genetic sequence information in a way close to that found in nature. This would allow the sequence information to be gathered into huge but well structured and easily understood data objects. Instructions for obtaining those objects would be easy to write, and instruction lists could be exchanged between working scientists. Attaining a clean and organized sequence database will require the concerted efforts of not only all the database staffs to precisely define the data structure, but also those of every molecular biologist who works with sequence data to provide and check the data in the structure. Design and development of advanced database access languages such as Delila II will be blocked until the database is clean.





DATABASE PROBLEMS: a ``Taxonomy'' of Errors

There are many kinds of errors in the GenBank database [22,23,24,25,26,27]. We discuss some of them here, along with examples, to give the reader an idea of the enormous magnitude of the problem and to explain how these problems affect designing and using Delila-like sequence access languages for scientific research.

Readers are encouraged to report the errors they find in GenBank and to make sure that the correction appears in the database. The current electronic mail address for error reports is update@ncbi.nlm.nih.gov.

These are difficult problems and we cannot expect them to be solved overnight. One step forward is to clarify our goals so that we can work together toward solutions. The following Philosophy and Definition are proposed as a starting point.





A DATABASE PHILOSOPHY

Each principle strongly affects how one goes about designing and constructing a database, but the detailed design is not given by the principle. Principles are like axioms in geometry, while (ideally) the design and implementation are like theorems built up from the axioms.

\framebox{
\parbox{5.0in}{
{\bf Principle 1: There is only one official copy of each piece
of information in the primary database.
} } }

This principle implies that all other unofficial copies must be derived automatically. If duplications were allowed, then one duplicate could be modified without update of another. This leads to inconsistencies. Inconsistencies lead directly to errors when the wrong item is deleted or to data loss when an item is deleted which happens to carry non-redundant data [3].

When two authors publish overlapping sequences, a single merged view of the data should be created which contains all features in one place. Discrepancies should be recorded so that each original publication can be reconstructed automatically. In the present system, every scientist must perform the merge, and must locate all the differences because there is no guarantee that a merged view exists. Worse, when a merged sequence has been generated, the original entries are still kept, so the poor researcher must fight even more duplication. The multiplication of this wasted effort is enormous on a world-wide scale. Conversely, funding for curators (people who merge, correct and annotate sequences) is highly economical because a single person can save effort on the part of many people around the world.

A common objection to this proposal is to ask whether one can trust the person who does the merge. This is invalid, not only because a person doing such a merge must be an expert in the relevant sequences, but also because attempting to merge sequence data almost invariably reveals inconsistencies in the sequence data itself. By resolving these conflicts, the data are improved. In the current plans of NCBI, the original fragments are kept and can be used to generate a merged sequence [29]. This allows any researcher to confirm the merge, but also allows one to see it as merged. However, Principle 1 implies that it would be better to have only one copy of the data, and to produce the original unmerged and erroneous reports upon demand. The database would become easier to use because there would be far fewer sequences and they would be more carefully reviewed. Furthermore, if sequences are kept separate and merges are to be performed automatically, the failure mode is unmerged sequences, which is not what most biologists want. Conversely, if sequences are kept merged, computation is only needed when the original sequence is requested. This looks exactly like a Delila extraction (``get sequence of reference 6''), and therefore would be easy to do.



\framebox{
\parbox{5.0in}{
{\bf Principle 2: A user should not be required to extract the original
paper(s) in order to work with a sequence.}
}
}
The user should be provided with complete information describing each database object so that they can access, manipulate, process, do statistics, read or check a sequence. When one is manipulating thousands of sequence fragments [13], it is impractical to look up every relevant paper. Even if one could obtain and read all the papers, the current doubling time of 1.8 years for the database makes it impossible to keep up to date.

This principle implies that the database should capture biological knowledge so robustly that users will never feel that they must go back to the literature to learn the features of a sequence. 100 years from now nobody will want to open thousands of dusty and cracked journal volumes or scan thousands of feet of microfilm to read the original papers--they will want to look directly in the database, and they will want to be able to automatically identify every type of sequence object.



\framebox{
\parbox{5.0in}{
{\bf Principle 3: The database structure should allow...
...ved subsets
of the database to have the same form as the original database.}
} }
This allows one to use the same tools on the subset as on the original set. Any analysis of sequence data can be thought of as using a subset of a large database, even if the subset contains only one sequence member. For example, a subset could be the sequence of the plasmid pBR322. With a simple program, a subset of that subset could be created which contains HaeIII restriction digestion fragments. If this principle is followed, it would be easy to apply the same program again to the HaeIII fragment subset but with HpaII to produce a double digest. Another kind of useful subset is a set of aligned sequences. Every kind of sequence analysis can be performed on subsets of the database so tools for creating subsets should have highest priority, and the database should be arranged to facilitate extraction of subsets, as in the Delila system [6,8,9,7,]. The current database was not designed to permit the construction of subsets. This makes it difficult to analyze small regions embedded in a large sequence such as a whole chromosome [31].



\framebox{
\parbox{5.0in}{
{\bf Principle 4:
Scientists are primarily responsibl...
...tabase editors and reviewers arbitrate the final form
of each data object.
} } }
Each person who submits a sequence is responsible for its completeness and accuracy. It is impossible for a database staff to ensure this. Furthermore, as the scientist learns more about a sequence, he or she is obliged to enter the new data. If this is not done, the scientist's work will be effectively unavailable to other scientists because data in the original papers cannot be found by electronic searches. As additional data are contributed by other investigators, an editorial/referee process similar to that of major scientific journals could insure that any discrepancies are resolved and that new data are properly incorporated or merged. The community must deal with issues such as how many errors in a sequence of a given length are acceptable, what mechanisms will be used to detect errors and misinterpretation of natural variants as sequencing errors.



\framebox{
\parbox{5.0in}{
{\bf Principle 5:
The primary responsibility of the
d...
...phy, a definition, an implementation description, examples and a tutorial.
} } }

``This [not only] includes the physical organization of the data themselves ... but also includes providing a well thought out plan concerning how contributing scientists will interact with the database and for developing procedures for ensuring that these efforts are effectively coordinated.'' --David George [2]

When database design is mixed with implementation and the philosophy is neglected it is difficult to know what the overall strategy is or how a database is likely to be represented in the future. A set of easily accessible documents which thoroughly address these issues would solve many problems. These documents should continue to be developed as advances in computer technology and molecular biology are made.

1.
Philosophy. The philosophy of the database describes the principles which guide its organization. This lets everyone know WHY the database has been structured as it is. The philosophy must be written out in detail because only then can people challenge the fundamental ideas implied by the current implementation.

2.
Definition or Design. The definition document describes WHAT kinds of data are in the database, and their organization. This includes a database schema which diagrams the interconnections between the data items [3,5]. The principles that comprise the philosophy define a framework for designing the database but there can be many designs that match a given philosophy. The organization of a genetic database should allow input, modification and retrieval of information, yet it should also present the data in a form corresponding to the user's mental model of genetics. Without these characteristics it is difficult to write efficient programs which access and manipulate the data and it is difficult to answer biological questions. All data types should be defined in detail. The way that the data are to be stored physically is not part of the definition. It is, however, necessary for the author of the definition document to create definitions that can be implemented efficiently.

This document has not existed for GenBank, and as a result there is no agreement as to what should be contained in the database and how the data should be organized. This led to inconsistently used feature types and varying degrees of annotation. The entire database must be defined, not just the ``features'' [21].

3.
Implementation. HOW the database is constructed in a specific computer language is described by this document. It is possible to have several different implementations for the same philosophy and definition. 100 years from now the implementation may well be different, but the design could be the same.

The following kinds of questions are answered by the implementation document: Should the database be relational (i.e. like a table of numbers) or object oriented (i.e. like a set of nested objects)? Should storage be in flat files (i.e. a linear set of characters) or as structures resident in memory (i.e. always in active computer memory)? Limitations on the lengths of names and other implementation-dependent decisions are described. Syntax may be defined in languages like BNF [32] or ASN.1 [29], but these do not address the philosophy or definitions issues.

A good implementation-free database definition allows easier migration to new computer platforms [6,7,20].

Maintaining a clear distinction between implementation, design and philosophy is critical for the long term usefulness of a database. Implementation of a database without a clearly stated definition and philosophy is a poor engineering practice that lead to the problems described in this paper. Just as one draws plans before building a bridge, one should write the philosophy and definition before implementing any computer program. If changes are required, the definition should be altered before the database or code is modified. This was done for the Delila system [5]. This disciplined approach has several healthy consequences:

(a)
The database or code is always fully documented.
(b)
Other people can comment on proposed changes before they are implemented.
(c)
Changes are made only after careful consideration and users are well warned of impending changes.

What can be done given that we already have a database, but don't have these documents? The appropriate response is to step back, set aside all political [33], commercial [34] and other considerations and develop the philosophy and design from a fresh viewpoint. Current database implementation can then be guided toward the new design.

The philosophy, definition and implementation documents each serve a different purpose and should be distinct separate documents. Existing documents mix the highest definitions of database structure and philosophy with the niggling details of what character should go where in a particular data object [20,29,21]. Not only is this confusing to the reader, but it completely obscures the overall strategy. If the document is unclear, it is difficult to recognize the problems associated with it, and therefore difficult to suggest corrections. The absence of a full set of documentation is a disservice to the end user. Maintaining a clear distinction between implementation, design and philosophy is critical for the long term usefulness of a database.

4.
Examples. Examples of every database object should be given so that scientists can see how to express themselves in the database language. A small test set containing every defined data item would be most useful for the development of applications that call upon GenBank data.

5.
Tutorial. This document describes how to create and maintain the database so that anybody can retrieve information from it and help to build it.

Most of the problems with GenBank described earlier in this paper could have been prevented if the documents described above had been created and an extensive series of programs designed to check the integrity of the database had been written and used. Such check programs must be written based on a defining document, not from the current implementation of the database, because that is not definitive. A test of the defining and implementation documents is that they allow anybody to write a program to check the database structure.



\framebox{
\parbox{5.0in}{
{\bf Principle 6:
Everyone using the database is responsible for reporting errors,
however small.
} } }
If an error is not reported, it will, of course, remain in the database where it could lead to erroneous biological or medical interpretations. Suppose that a gene is sequenced from two individuals, one of whom is a normal homozygote and the other of whom is a carrier of a recessive allele. Both sequences are reported but the recessive variant is not documented as a mutation, although this was known from another source. Someone with the disorder is found to carry two copies of the recessive allele, but because of the incorrect annotation in the database, that individual is not recognized to be expressing a mutant gene product, but is simply thought to carry a natural polymorphism. This could have serious consequences for diagnosis and treatment of such a patient. It is conceivable that both the database staff and the authors could be held legally responsible. We need to find incentives that encourage investigators to report corrections.



\framebox{
\parbox{5.0in}{
{\bf Principle 7:
For determining scientific preceden...
...ject data into
the database has higher priority than physical publication.
} } }
Placing data into the database is a form of electronic publishing [34] that establishes precedence.

``That is if you were the first to clone and sequence a particular gene, the next person who sequences the homologue would find your sequence already in the database, and would be obliged to cite your entry, even if it had not yet been published.''--Brian Fristensky [35]

This principle is actually a variation (or corollary) of Principle 1, in that it gives the highest authority to the unique copy of each data item in the database. Publication of the data in other media has lower authority.

Sequence objects that are not entered into the database should not be considered discoveries, just as an invention is not a patent until it is published by the Patent Office. For example, suppose that a DNA segment has been sequenced and inserted into the database. Two years later someone confirms a predicted protein sequence and publishes it but does not update the database. Others who are interested in studying thousands of genes simultaneously do not have the resources to search for this additional data, and therefore don't use it or are forced to redo the experiments. Credit would not be given to the original work. Unannotated information will be lost or inaccessible in the coming electronic age.

This principle does not deny that the original publication contains valuable information and provides insights into the structure and function of the sequence that is described in the database.



\framebox{
{\bf Principle 8:
There is only one biology.
}
}

It is well recognized that the biology of other organisms is similar to human biology. The pervasive use of nucleic acids, the central dogma and general metabolism point to the unity of biology, while general physical principles and evolution will undoubtedly apply to biological systems on other planets [36,37,12]. Biology is universal.

This principle has two consequences.

1.
There should only be one international sequence database. The plethora of formats makes sequence analysis much more difficult than it should be. Submissions to multiple databases increases the likelyhood of redundant and inconsistent data. Politically motivated squabbling over database format and over who should accept data submissions should no longer be tolerated by molecular biologists. To maintain consistency, there should be only one submission point and only one dispersion point, overseen by an internationally supported standards committee [14].

This does not mean that all the work should be done in one place and that there should be only one group of people doing database work. To the contrary, there is so much work to be done that much of it should be spread out to other data centers. But this should be controlled from a single point to avoid duplication of effort [33]. Loose federations of databases, advocated by some [38], would cause intractable inconsistencies. `...the community would be better served by the convenience of ``one-stop shopping'' ' [14].

All data which can be associated with a sequence should be directly accessible in a single unified format and within a single data structure. The profusion of specialized databases of different formats limits the usefulness of computer tools that are applicable to many biological problems [39]. Close linkage between nucleic acid, protein, structural and other databases is only a first step towards a complete and uniform synthesis of molecular biological data.

This is a variation of Principle 1, but applied to whole databases. Derived databases should be automatically created from the common database so that they can be kept up to date with no human intervention (Principle 3). This would ensure that complete and consistent annotation is in the subsets and we will no longer be forced to navigate through a forest of inconsistent databases.

The lack of an internationally accepted standard format means that scientists from different parts of the world have a difficult time exchanging data.

2.
The database can follow a natural taxonomic and genetic organization plan. The current databases are oriented towards human efforts. Thus the concept of an ``entry'' in GenBank is a reflection of individual researchers' efforts at obtaining a sequence in a particular genetic region. As more sequences become available, they are all thrown together in a messy pile of entries. Everybody who wants to do thorough work is forced by this disorganization to run programs to figure out where the overlaps are and is forced to read many papers to determine how or whether the sequences are related. They are also forced to merge the features themselves.

The current implementation of GenBank emphasizes sequences and neglects new information about the sequences--there is no reliable mechanism for capturing information not associated with the original sequence. Information from different scientists must be merged together, yet it must be possible to trace back to the original data in every case. However, being able to trace back is much less important than providing a clean and merged view of the data for scientific research. The justification for creating the sequence database is scientific, not historical, research.

In nature, sequences have natural packages. The primary one is the taxonomic classification focused on species. Thus we should look at a database and see a series of data objects representing species and particular strains. Each species has a set of chromosomes. As sequencing proceeds, the chromosome sequences are filled out, but this view of the data does not change. When the chromosome is finished, there would be a single data object in the database rather then a pile of disconnected data like yeast chromosome III. For example, this is the organization of the Online Mendelian Inheritance in Man catalogue [40].

One common objection to a natural scheme is that it is only one ``view'' on the data. However, the natural viewpoint (as described more fully in the next section), is the one that biologists take of biology, and fundamental biology does not change rapidly so this is the most stable possible representation of the data [3]. Representing the original publication--as is now practiced--is unstable because it ignores new data. There is no robust mechanism for capturing new feature data and corrections into the current GenBank. Following Principle 3, any desired subset (or ``view'') could be created from a natural scheme. For example, if we wish to study all of the cytochromes, we would simply list the species, strain, chromosome (if known) and genetic location of each one. For careful scientific research one would always want this list, so why not make it play an active part in obtaining the data? Such a list of names can be made stable by an international standards committee, and it will remain stable as the data coalesces. In contrast, the current schemes eventually destroy the usefulness of every name list created. Scientists are unable to exchange stable lists or to update each other's lists. This slows down research.

A frequently espoused ``view'' of the data is the sequencer's own scientific contribution. This could easily be created from the merged data by running a simple program. Although we may feel proprietary attachments to the data today, 100 years from now nobody will care who sequenced what. So we may as well begin now to store the data in a natural and objective form. The longer we wait, the more difficult this task will become.

One question that commonly arises is how we should handle polymorphisms, strain differences and mobile genetic elements. The simplest way to do this is to store a single canonical sequence and then to record all the changes necessary to create a particular strain automatically. This appears to raise difficult questions. Which of the possible sequences should it be? Who will decide this? Surprisingly, it doesn't matter. We can overlap all known sequences to generate a canonical collage. By recording the strain information carefully, a program can generate any particular pure strain we want.

A DATABASE DEFINITION



The preliminary definition given below is for the organization of natural sequences. A clean way to handle artificial constructions, which conforms to Principle 1, is to define a language in which one can specify the natural and synthetic sequences which are to be joined together [19]. It would then be possible to store only the instructions for creating the constructs. This would allow the artificial sequences to be created upon demand, so they would not require redundancy or much storage.



Following Principle 8, the remainder of the database should be organized into a nested series of named objects: species, strain, chromosome, genetic locus and individual sequence object or sub-locus. A ``schema'' is a model of the data one wishes to represent in a database [3]. Rather than developing a complete formal schema of a natural organization, Fig. 2 and Fig. 3 show an example of the general concept.


  
Figure 2: Overview of a ``natural'' organization for sequence data.
\vspace{6.5in}
\special{psfile=''philgen2.ps''
hoffset=50 voffset=-50
hscale=100 vscale=100
angle=0}
The database (rectangle) consists of several species (ellipses). Each species contains one or more chromosomes (circles c1, c2, and c3). Individual sequenced pieces of nucleic acid are stored within the chromosome data objects (line segments or squiggles p1, p2 and p3). Strain information could be stored either at the species level (which is likely to be inefficient in storage because it is redundant) or as specific changes at the sequence level (which would be storage-efficient and could use the same format as mutations and sequence conflicts).


  
Figure 3: Detail of species 2, chromosome 1 in Fig. 2.
\vspace{6.5in}
\special{psfile=''philgen3.ps''
hoffset=50 voffset=50
hscale=50 vscale=50
angle=0}
The region separating sequence piece 1 (line segment p1) and piece 2 (p2) is dashed to indicate that it is not sequenced. Three genetic loci (rectangles) are defined, each containing several sequence objects (arrows o1, o2, and o3). Note that the loci and objects may overlap without causing difficulty in the scope of names, so long as the names and the extents of loci and objects are chosen carefully. For example, locus 2 object 1 (in the unsequenced region) is distinct from locus 3 object 1 (which has been sequenced), while in locus 1, object 3 overlaps object 2. Object orientation is indicated by the direction of the arrows. Objects and loci can span across disconnected pieces.



Sequence Object Definition: A sequence object is a (possibly discontinuous) region of genetic material with distinct properties. According to Principle 2, the data defining an object and associated with that object must be complete. New information about the object must be continuously added to the database to keep it up to date, even if no new sequence information has been obtained.

1.
Every object has a type. A rigidly controlled list of types must be defined. New types are added as new biological features are discovered. The types must form a logical, non-overlapping set of definitions of biological objects. Each type must be defined as part of the written definition and carefully distinguished from related types so that objects are not assigned the wrong type. Having a consistently defined type makes it possible to automatically obtain a comprehensive list of objects for study. Clear, written definitions are absolutely required. A complete list of types and their definitions is beyond the scope of this paper. For the current GenBank definitions see [21].

2.
Every object has a name. For a computer to find and manipulate an object, the object must have a name which is unique within a certain scope. Complete names have 4 parts consisting of the types: species, strain, gene locus and specific sequence object. For example E. coli, K12, lac, Z refers to the lacZ coding region. (Chromosome names are optional because genetic loci are, by convention, unique to whole genomes.) The species and strain information must be associated with the object, since we must be able to perform manipulations which create chimeric sequences, and the origin of those sequences must be maintained. Names allow one to define genetic locations RELATIVE to an object. This has the extreme advantage that the position of the object may shift, but one's instructions for obtaining the object would not be affected. For example, after specifying the object mentioned above, we can then say that we are interested in the region from -60 to +40 around the beginning of Z (i.e. the ribosome binding site). This particular instruction has specified the same sequence for many years, and this will remain true even as the entire sequence of E. coli is being completed. In contrast, an absolute position (e.g. 314159) written down by a user of a database subset is useless as soon as sequences are merged because at least part of the sequence must be renumbered.

It is possible to assign each data object an official multi-part name. The most logical and useful starting point for choosing a name for an object is its standard genetic name [14]. Although these names may change as standards become better, they are more stable than anything else. The database defines the international standard names. A list of synonyms for outdated names could be associated with each object. By this means, one may take an old list of objects and use it for the most part in a later version of the database. This would allow scientists to exchange reliable, stable lists that define subsets of the database, and therefore it is extremely important for the future of sequence analysis.

In computer programming languages, such as Pascal [41], names only apply to a certain region of code called the scope of the name. The same concept already applies to genetic names (Fig. 2, Fig. 3). The widely accepted convention is that all genetic locus names are consistent within one species so that each name only refers to one locus. Because the scope is limited to a species, there is no confusion as long as the species is always given as part of the name. Names within a given genetic locus should also have their scope limited to that locus. Thus an intron named ``5'' is acceptable when used within its scope. Until there are scope rules in a database, such simple names cannot be used reliably [21].

The ``note'' in the current GenBank implementation does not satisfy the naming requirements because the data contained within it cannot be obtained precisely with a single straight forward algorithm. The use of notes for naming objects should be completely eliminated. All such notes should be replaced with appropriate biological names.

3.
Every object which represents a change of the sequence has that change recorded in a computer manipulatable format. Without a precise algorithm for how to change the object, programs which perform large statistical analysis of the database cannot be built. Once again, ``note'' fails to satisfy this requirement.

A single scheme can be used to define insertions, deletions and base changes. A set {L, R, S} consisting of two base positions (L, left and R, right, integers with L < R) and a ``substitution sequence'' (S) is defined for each mutation. The operation to generate the mutant sequence has two steps. First, the sequence between but not including L and R is deleted. Second, the substitution sequence S is placed between the original positions corresponding to L and R. Examples:

{3, 4, `A'} inserts an A between positions 3 and 4;

{3, 5, `$\phi$' } deletes base 4. $\phi$ is a symbol designating an empty sequence;

{3, 5, `G'} replaces base 4 with a G.

{3, 20, `GTGC'} replaces bases 4 through 19 with GTGC.

L and R may have values outside the bounds of the sequence being modified to allow for deletions or attachment of new sequence to one or the other end of a sequence. It is possible for S to point to a sequence elsewhere in the database, by giving the full name of the other sequence. This would allow the creation of artificial constructs.

4.
No object is ever duplicated. Duplication in the primary database often leads to inconsistency when one of the objects is subsequently corrected but the other is not. It also wastes space. Duplicated objects bias massive statistical analyses of sequences, and may invalidate their conclusions. This is an example of Principle 1.

5.
Every object ALWAYS has a machine parsable record of the evidence supporting it. Evidence for sequence objects falls into two categories: those determined by experimental data (such as footprinting, S1 mapping, etc.) and those determined by a computer program using only sequence data. In both cases a list of method names and references must be defined. For programs it is important to record the algorithm, the program name, and the version number and a reference. Computer programs that search sequences embody models about what is in sequences and are therefore subjective. Consensus sequences should never be used because they are an extremely poor model [13]. Objects located by their sequence patterns are effectively unsubstantiated hypotheses and should always be considered tentative. They should be removed when experimental data supporting or refuting them are published. Computer search results are useful as predictions, but unless they are identifiable, they interfere with statistical analysis of the database by contaminating data sets. A cleaner solution would be to disallow any objects that do not have experimental evidence.

6.
Every object always has one or more references. These allow one to locate and repeat the original experimental data or program run. The ability to do this would allow anybody to check the database for errors and fraud.

7.
Every object has a natural location. Ultimately, this is given by a chromosome name and positions on the chromosome. Despite the current trend, binding sites are not discrete box-like objects defined by a consensus sequence. This is most clearly demonstrated by sequence logos [39,11]. It is generally misleading to record two outer points as the edges of a binding site because the site may extend well beyond the conventionally accepted consensus sequence or ``box'' [13]. A consensus or ``box'' is merely a poor model of a binding site, not the site itself. Because determining the boundary should be the subject of careful and continuing experimental, information theory or statistical analysis, binding site locations should only be recorded as a ``zero position'' and an orientation. They may be either asymmetric or symmetric. Asymmetric sites require orientation information, but dyadic symmetry sites do not. Symmetric sites may be ``odd'', in which case there is a central base and so there are an odd number of bases no matter what the extent of the site is. Only the position of the central base needs to be recorded. By our convention [11], this base is the zero position because it is convenient to label it 0 on aligned listings of odd or asymmetric sites extracted from a database (Fig. 1). Symmetric sites may also be ``even'' and therefore lack a central base. In this case the zero position to be recorded in a database is between two bases, having a position of $i+\frac{1}{2}$, where i is an integer. For the purposes of producing aligned listings or sequence logos of even binding sites extracted from a database, we set i=0, so that the center of the symmetry lies between bases 0 and 1.

8.
Whenever possible, objects should be constructed by using the unique names of smaller objects. For example, exons may end at a splice donor site, the end of the transcript, the end of the actual sequence or the end of the known sequence. Except for the last these are not distinguished in GenBank, so it is impossible to automatically list all true donor sites by using the end of the exon. Introns, exons, and other long sequence objects should be recorded by naming two other objects such as the donor and acceptor sites. These binding sites are the actual objects recognized by the cellular machinery so they are the fundamental data to be recorded. Naming them allows direct automatic access to the individual sites, in addition to the defined regions.



Sequence Piece Definition: A sequence (or ``piece'') is a series of nucleic-acid or amino acids represented by alphabetic symbols. Alignment gaps are allowed, and are symbolized by a dash (-). Each sequence has associated with it:

1.
A type. This indicates the kind of molecule being described. It may be DNA, RNA or protein. It is useful to allow definition of alphabetic (a to z), numeric, and symbolic vectors also, as this allows sequence logos to be defined in subsets of the database [39].

2.
A number of strands. Typical molecules are single or double stranded, but higher numbers are possible.

3.
A topology. The molecule may be linear, circular or repeated. Branched sugars or RNAs will also have to be defined eventually.

4.
A coordinate system. This follows from Principle 3 because when one is working with a partial fragment of a sequence, it is useful to have the original numbering system maintained in order to compare results output from different programs. The Delila System does this [6]. For example, suppose one wanted to determine RNA folded structures for a series of mutant sequences. A flexible coordinate system would maintain most of the numbering despite small insertions or deletions. This would facilitate comparison of the different folds. By Principle 3, the original database should also have a coordinate system capable of handling complex coordinate systems. When a single sequence is derived from several other sequences, each portion may have its own coordinates. Type, topology and coordinates may be listed compactly in the form:
DNA ds C(1 100)(120 150)(15 1)(150 200)
This indicates that a double stranded (ds) circular (C) DNA sequence is being described. The sequence, given 5' to 3', starts at base 1 and proceeds to 100. There is a gap in the numbering (e.g. from a deletion or a chimeric construction), followed by a segment numbered from 120 up to 150, then an inserted segment numbered from 15 down to 1. Finally the circle is closed with sequences running from 150 to 200. With this method, mutant sequences can be created which still have nearly the same numbering as the original sequence. This facilitates comparison between sequences. In this scheme, coordinates of each base must be defined with two numbers, 2@135 could be a notation to indicate the second set of sequence, base 135 of the example given above.

In nature no two chromosomes are the same, so how can we have a coordinate system? In any particular case we have a specific sequence. It is technically easy to store a canonical sequence and then to store instructions for creating all the polymorphisms and variations observed in nature. Features could be automatically moved to any specific strain sequence created on the fly.

5.
An alignment. When a set of sequences are extracted and aligned, a single base must be designated as the aligned base. By Principle 3, this subset of the database must be in the same form as the original database. The first base of the sequence can be the default alignment point. The primary database would not have alignment gaps.

6.
A species designation list. More than one species may be required to describe artificial constructions. In this case each coordinate segment must be identified.

7.
A map location, in standard genetic coordinates. This includes the orientation of the sequence relative to the chromosome (if known).

8.
A list of associated objects for the sequence. Every object is associated with its originating species (or inorganic synthesis), but there can be a large number of objects on a nucleic acid. To allow programs full flexibility, these objects must include not only introns, exons and coding regions, but also the known points of protein modification, crosslinking and three dimensional structure (Principle 8). This complete viewpoint allows one (for example) to directly extract a subset of the database and then to compare protein structure to intron structure.

9.
References. A list of references for the sequence and objects.



Reference Definition: A reference is a citation to the primary scientific literature. The reference must include: authors, title, journal, volume, pages, year. The name, address, phone, email address and other identifying information of the originator should also be stored when it is available. Because this kind of extra information would allow more rapid use of the database and would tend to keep authors responsible for upgrading their data, it should be made available to the public as part of the database. For implementation, a practical modern format for these data is the BiBTeX format since it can be parsed by machine and allows direct typesetting [42].

CONCLUSION

A major goal for a genetic sequence database is to allow us to easily isolate specific sequences for analysis. We would like to be able to write:

species "Homo sapiens"; (* define the species *)
strain "J. D. Watson";  (* define the strain *)
maternal chromosome;    (* define the chromosome set *)
locus gene ADH;         (* a locus whose type is gene, named ADH *)
locus exon 3;           (* a sub-locus of the ADH locus *)
get all locus;          (* by implication the last defined sub-locus would be used *)
and obtain the sequence. This is not possible now for many reasons described in this paper. To support instructions like this, we should store named and typed objects with as few absolute coordinates as possible and the data should be entirely machine parsable:
locus   type              name   location
-----   ----              ----   -------- 
agene   cap               c      2456 orientation: +
agene   splice-donor      d      4595 orientation: +
agene   splice-acceptor   a      4896 orientation: +
agene   polyA             p      5041 orientation: +
agene   exon              1      cap c, splice-donor d - 1
agene   intron            1      splice-donor d, splice-acceptor a
agene   exon              2      splice-acceptor a + 1, polyA p
agene   mRNA              m      exon 1, exon 2

The problems described here led me to propose a basic philosophy and a design for the database. Discussing fundamental design issues, rather than the nitty gritty of implementation, may seem to be many years behind the current status of the databases, but we must begin to have this level of discourse if the databases are ever to become well designed, comprehensible, stable, and more scientifically valuable. The task before us is monumental. Every molecular biologist has a stake in the future form of the database and we can each play an important role in helping to achieve a richly annotated but clean database.



ACKNOWLEDGEMENTS

I thank Brian Fristensky for suggesting the priority principle for GenBank publication; David George, for his rewording on Principles 4 and 5, and for recognizing the relationship between Principles 1 and 7; Pete Rogan for suggesting the editorial review system in Principle 4 and the medical example discussed in Principle 6; and Sue Aldor for wording in Principle 8. Many useful discussions of these topics on the internet news groups bionet.general, bionet.molbio.bio-matrix and bionet.molbio.genbank would not have been possible without the support of the bionet news groups by David Kristofferson [43]. I thank David Lipman for suggesting the ``taxonomy'' of errors. Stacy Bartram discovered the duplicated accession numbers. I thank Michael J. Cinkosky, Kirill Degtiarenko, Paul N. Hengen, Ingrid Jakobsen, David Lipman, Maureen Madden, Jake Maizel, Jim Ostell, Peter Rogan, Denise Rubens, Kenn Rudd, and Bruce Shapiro for useful discussions and insightful comments on the manuscript.



 
next up previous
Next: Bibliography
Tom Schneider
2000-02-29