Thomas D. Schneider 2
version = 1.82 of philgen.tex 2000 Feb 29
ABSTRACT
Modern sequence databases have many problems
because they are not carefully defined.
For example,
when one searches for homology with a given sequence,
many duplicate sequences are found in each database
and similar but not identical results are obtained from other databases.
The extra copies of inconsistent information
are slowing down research.
This situation could easily be avoided by removing redundancy in
the databases, but this goal is not a fundamental component
of the database design and has been neglected.
A clear statement of goals for storing genetic
sequence information is required.
To this end,
five documents, in decreasing order of importance, are proposed:
(1) The Philosophy document
defines guiding principles for the design and use of the database.
(2) The Definition document identifies what is to be stored in the database,
following the guidance of the philosophy.
It is machine and computer-language independent.
(3) The Implementation document translates the definition into
computer code which can be run on machines.
(4) Examples of all database objects allow users to test their
database analysis programs.
(5) A Tutorial explains how the database is organized.
A philosophy and the beginnings of a
definition are given in this paper.
Individual researchers can help by:
1. providing correct and complete sequence data;
2. annotating their sequences with experimentally deduced features;
3. providing standard genetic names for all features;
4. updating the annotation as new knowledge is obtained;
5. periodically verifying the accuracy of the data and annotation
in the public
database;
6. volunteering to curate data for which they are an expert.
GenBank runs the risk of becoming so much noise. It isn't just a matter of collecting all sequences like some avid bowerbird, researchers should be able to get information OUT of GenBank as well.--Ingrid Jakobsen [1]
Unless the architecture and the nature of the information represented in a database are concisely and comprehensively defined, the database itself will be of limited value.--David George [2]
At the time of writing, data-base technology is widely misunderstood. Its role as the foundation stone of future data processing is often not appreciated. The techniques used in many organizations contain the seeds of immense future difficulties. Data independence is often thrown to the winds. Data organizations in use prevent the data being employed as it should be.--James Martin [3]
INTRODUCTION
The way data are stored in a database affects the kind and quality of the science that can be done with the database. My scientific work has led me to a particular view of how genetic sequence databases can be organized to make them useful for research. Fifteen years ago I began work on E. coli ribosome binding sites by typing into a computer the sequences around the initiation codons of 63 genes. I soon realized that if I wanted to look at different regions relative to the starts, I would have to edit all the sequences by hand. This was likely to lead to many errors [4], and I would have to repeat the editing for each new analysis. To avoid that, I decided to store all the known sequences and then to write a program which could extract the portions I needed for a particular analysis. To tell the program what I wanted, I invented a language for doing the extractions and named it Delila, for DEoxyribonucleic acid LIbrary LAnguage [5,6,7]. In this language one first specifies a sequence region:
organism S.cerevisiae; chromosome III; gene LEU2;Then one requests particular parts of that sequence:
get from gene begin -30 to gene begin +40;This will extract the sequence from 30 bases before the first base of the initiation codon to 40 bases after it. This instruction is ``relative'' because it defines a sequence relative to a defined chromosomal location. Alternatively, one can give the absolute coordinates:
get from 90935 -30 to 90935 +40;or equivalently
get from 90905 to 90975;Editing instructions is much easier, safer, faster and more reliable than editing sequences directly. In addition, other programs can create or modify a list of instructions, so powerful ways of manipulating large sets of sequences became possible [8]. The Delila program has been enormously useful for many studies. In particular, it formed the basis for the first use of neural nets [9] and information theory [10,11,12] to analyze binding sites. In a recent analysis we manipulated almost 3600 sequences by this method [13], and in unpublished work we have easily worked with 8000 sequence fragments.
At about the same time that Delila was being created, the international sequence databases were formed to reduce the worldwide duplication of effort in capturing sequence data. The databases have been highly successful in this endeavor. It would be very useful to write a ``Delila II'' program which could take full advantage of the larger databases. As with Delila I, Delila II could use the names of genetic features to define the sequences of interest. Since the location of many genetic features is well defined, or could be defined by international convention [14], this approach would make the Delila instructions reasonably insensitive to changes in the growing database. Scientists around the world could exchange their lists of sequence instructions, and could build on each other's work.
As described in the next section, the present structure of the GenBank database [15] makes these kinds of powerful sequence manipulations difficult. We are gathering sequences at an exponentially increasing rate, but the locations of genetic objects are not being stored so that they can be retrieved automatically. Because there is no overall idea of how the data could be used, the data are not being stored in a sensible way. Just as we once had a crisis about capturing sequence data, we now are approaching a crisis about compiling and annotating sequences.
There are two basic reasons for these problems. First, there are no published principles guiding the design of the database. Second, there is no published and accepted documentation that defines the entire database. Without a written document, a researcher cannot recognize that the database structure is incorrect. Documentation that does exist is often incomplete and unclear. The documentation usually describes implementation of the database and does not say why the database was built this way. There is no defined set of rules by which the database is run. The content and quality of data varies.
To solve these problems, this paper proposes the adoption of a collection of guiding principles for sequence databases. A system of fundamental or motivating principles is a ``philosophy'' [16]. Without a guiding philosophy we will eventually be unable to use much of the data we are spending millions of dollars to collect [17,18].
A philosophy is not enough. We must use the philosophy to precisely define the contents of the database. We must publish the definition so that the ideas can be widely disseminated and critiqued. The importance of documenting computer programs is widely understood; the same should apply to databases. Examples of such documents are given in the references [5,19,20,21].
I do not merely mean that we need manuals which explain the database to outsiders or describe its current status. Instead, I envision documents equivalent to the constitution and laws of a country. They guide the course of events. They are ``alive'' and continuously debated. They are changed when times change. But if they are not changed, then they are strictly followed. GenBank needs such guidance.
This paper also proposes that we should store genetic sequence information in a way close to that found in nature. This would allow the sequence information to be gathered into huge but well structured and easily understood data objects. Instructions for obtaining those objects would be easy to write, and instruction lists could be exchanged between working scientists. Attaining a clean and organized sequence database will require the concerted efforts of not only all the database staffs to precisely define the data structure, but also those of every molecular biologist who works with sequence data to provide and check the data in the structure. Design and development of advanced database access languages such as Delila II will be blocked until the database is clean.
DATABASE PROBLEMS: a ``Taxonomy'' of Errors
There are many kinds of errors in the GenBank database [22,23,24,25,26,27]. We discuss some of them here, along with examples, to give the reader an idea of the enormous magnitude of the problem and to explain how these problems affect designing and using Delila-like sequence access languages for scientific research.
The sequences are a part of the data set described in reference [13]. The LOCUS name (some of which no longer exist in the database) is followed by the alignment or ``zero'' position (L) and the sequence. Except for the last two sequences, L is also the length of sequence given on the intron side. The set of numbers above the sequences are the positions (relative to the alignment position) written vertically. The data shown is all of the data provided by the authors in GenBank database number 62 (December 1989); dashes indicate missing data. The edge of available sequence data would rarely correlate perfectly with multiples of 5, so we conclude that different authors arbitrarily decided what was important, and did not report all the data which they have. Our statistical analysis showed that the acceptor site extends from position +2to at least position -25, and perhaps as far as -30 and beyond [13]. Thus the blocks of intentionally missing data affect statistical studies of these sequences. |
There are an estimated 1000 or more such cases that the National Center for Biotechnology Information (NCBI, Bethesda, MD) has identified for clean up, now that NCBI is responsible for GenBank (J. Ostell, personal communication). The most common consequence for biologists is multiple ``hits'' obtained during sequence searches. These repeatedly waste a researcher's time and prevent one from easily seeing some important matches because searches always limit the degree of matching. Duplication also blocks surveys of the database because each researcher must comb through perhaps thousands of duplicated and overlapping sequences to obtain a clean set for analysis. Because of name changes, this tedious process must be repeated in its entirety every time a new version of the database appears. Automated cleaning methods may make errors or will miss useful data.
The GenBank database is kept in two parts, the old entries and the new and changed entries. The changes are periodically incorporated into the database. When one asks for sequences by electronic mail, one receives both the old and the new entries. From the user's viewpoint, all recently corrected sequences are duplicated!
When duplication is found, the researcher then must look up the original papers and try to decide which is correct. Typically the researcher will not tell the database staff the results, so every other researcher has to repeat the effort. Clearly all this clerical work cannot be performed if a single researcher wants to manipulate 100,000 sequence fragments!
A severe example of unmerged sequences is the 315357 bp yeast chromosome III [31] (Accession X59720 in GenBank 82.0). 95 entries other than the complete sequence have a source organism Saccharomyces cerevisiae and have ``III'' somewhere else in the entry. Because chromosome location is not consistently recorded, not all of these are on chromosome III. (For example, this set includes subtilisin-like protease III.) There is no way at present to automatically locate just the entries relevant to chromosome III. Conversely, features, such as threonine tRNA, are not recorded in the larger entry. As a result these known annotations are often missed. The authors could have merged all these sequences together to create a single entry.
One consequence of incorrect feature placement is that it prevents one from writing general instructions such as ``get all Saccharomyces cerevisiae exons from beginning to ending''.
misc_feature 3616..3628
/number=3
/note="potential pseudo [4]; putative"
The ``misc_feature'' of the current
implementation is not acceptable by the criterion that it is not precise.
Although it would be best to eliminate them,
this would not allow new data types to be recorded.
I suggest that notes and misc features be used only rarely.
They should also be temporary and eliminated
as soon as possible.
Readers are encouraged to report the errors they find in GenBank and to make sure that the correction appears in the database. The current electronic mail address for error reports is update@ncbi.nlm.nih.gov.
These are difficult problems and we cannot expect them to be solved overnight. One step forward is to clarify our goals so that we can work together toward solutions. The following Philosophy and Definition are proposed as a starting point.
A DATABASE PHILOSOPHY
Each principle strongly affects how one goes about designing and constructing a database, but the detailed design is not given by the principle. Principles are like axioms in geometry, while (ideally) the design and implementation are like theorems built up from the axioms.

This principle implies that all other unofficial copies must be derived automatically. If duplications were allowed, then one duplicate could be modified without update of another. This leads to inconsistencies. Inconsistencies lead directly to errors when the wrong item is deleted or to data loss when an item is deleted which happens to carry non-redundant data [3].
When two authors publish overlapping sequences, a single merged view of the data should be created which contains all features in one place. Discrepancies should be recorded so that each original publication can be reconstructed automatically. In the present system, every scientist must perform the merge, and must locate all the differences because there is no guarantee that a merged view exists. Worse, when a merged sequence has been generated, the original entries are still kept, so the poor researcher must fight even more duplication. The multiplication of this wasted effort is enormous on a world-wide scale. Conversely, funding for curators (people who merge, correct and annotate sequences) is highly economical because a single person can save effort on the part of many people around the world.
A common objection to this proposal is to ask whether one can trust the person who does the merge. This is invalid, not only because a person doing such a merge must be an expert in the relevant sequences, but also because attempting to merge sequence data almost invariably reveals inconsistencies in the sequence data itself. By resolving these conflicts, the data are improved. In the current plans of NCBI, the original fragments are kept and can be used to generate a merged sequence [29]. This allows any researcher to confirm the merge, but also allows one to see it as merged. However, Principle 1 implies that it would be better to have only one copy of the data, and to produce the original unmerged and erroneous reports upon demand. The database would become easier to use because there would be far fewer sequences and they would be more carefully reviewed. Furthermore, if sequences are kept separate and merges are to be performed automatically, the failure mode is unmerged sequences, which is not what most biologists want. Conversely, if sequences are kept merged, computation is only needed when the original sequence is requested. This looks exactly like a Delila extraction (``get sequence of reference 6''), and therefore would be easy to do.

This principle implies that the database should capture biological knowledge so robustly that users will never feel that they must go back to the literature to learn the features of a sequence. 100 years from now nobody will want to open thousands of dusty and cracked journal volumes or scan thousands of feet of microfilm to read the original papers--they will want to look directly in the database, and they will want to be able to automatically identify every type of sequence object.



``This [not only] includes the physical organization of the data themselves ... but also includes providing a well thought out plan concerning how contributing scientists will interact with the database and for developing procedures for ensuring that these efforts are effectively coordinated.'' --David George [2]
When database design is mixed with implementation and the philosophy is neglected it is difficult to know what the overall strategy is or how a database is likely to be represented in the future. A set of easily accessible documents which thoroughly address these issues would solve many problems. These documents should continue to be developed as advances in computer technology and molecular biology are made.
This document has not existed for GenBank, and as a result there is no agreement as to what should be contained in the database and how the data should be organized. This led to inconsistently used feature types and varying degrees of annotation. The entire database must be defined, not just the ``features'' [21].
The following kinds of questions are answered by the implementation document: Should the database be relational (i.e. like a table of numbers) or object oriented (i.e. like a set of nested objects)? Should storage be in flat files (i.e. a linear set of characters) or as structures resident in memory (i.e. always in active computer memory)? Limitations on the lengths of names and other implementation-dependent decisions are described. Syntax may be defined in languages like BNF [32] or ASN.1 [29], but these do not address the philosophy or definitions issues.
A good implementation-free database definition allows easier migration to new computer platforms [6,7,20].
Maintaining a clear distinction between implementation, design and philosophy is critical for the long term usefulness of a database. Implementation of a database without a clearly stated definition and philosophy is a poor engineering practice that lead to the problems described in this paper. Just as one draws plans before building a bridge, one should write the philosophy and definition before implementing any computer program. If changes are required, the definition should be altered before the database or code is modified. This was done for the Delila system [5]. This disciplined approach has several healthy consequences:
What can be done given that we already have a database, but don't have these documents? The appropriate response is to step back, set aside all political [33], commercial [34] and other considerations and develop the philosophy and design from a fresh viewpoint. Current database implementation can then be guided toward the new design.
The philosophy, definition and implementation documents each serve a different purpose and should be distinct separate documents. Existing documents mix the highest definitions of database structure and philosophy with the niggling details of what character should go where in a particular data object [20,29,21]. Not only is this confusing to the reader, but it completely obscures the overall strategy. If the document is unclear, it is difficult to recognize the problems associated with it, and therefore difficult to suggest corrections. The absence of a full set of documentation is a disservice to the end user. Maintaining a clear distinction between implementation, design and philosophy is critical for the long term usefulness of a database.
Most of the problems with GenBank described earlier in this paper could have been prevented if the documents described above had been created and an extensive series of programs designed to check the integrity of the database had been written and used. Such check programs must be written based on a defining document, not from the current implementation of the database, because that is not definitive. A test of the defining and implementation documents is that they allow anybody to write a program to check the database structure.


``That is if you were the first to clone and sequence a particular gene, the next person who sequences the homologue would find your sequence already in the database, and would be obliged to cite your entry, even if it had not yet been published.''--Brian Fristensky [35]
This principle is actually a variation (or corollary) of Principle 1, in that it gives the highest authority to the unique copy of each data item in the database. Publication of the data in other media has lower authority.
Sequence objects that are not entered into the database should not be considered discoveries, just as an invention is not a patent until it is published by the Patent Office. For example, suppose that a DNA segment has been sequenced and inserted into the database. Two years later someone confirms a predicted protein sequence and publishes it but does not update the database. Others who are interested in studying thousands of genes simultaneously do not have the resources to search for this additional data, and therefore don't use it or are forced to redo the experiments. Credit would not be given to the original work. Unannotated information will be lost or inaccessible in the coming electronic age.
This principle does not deny that the original publication contains valuable information and provides insights into the structure and function of the sequence that is described in the database.
It is well recognized that the biology of other organisms is similar to human biology. The pervasive use of nucleic acids, the central dogma and general metabolism point to the unity of biology, while general physical principles and evolution will undoubtedly apply to biological systems on other planets [36,37,12]. Biology is universal.
This principle has two consequences.
This does not mean that all the work should be done in one place and that there should be only one group of people doing database work. To the contrary, there is so much work to be done that much of it should be spread out to other data centers. But this should be controlled from a single point to avoid duplication of effort [33]. Loose federations of databases, advocated by some [38], would cause intractable inconsistencies. `...the community would be better served by the convenience of ``one-stop shopping'' ' [14].
All data which can be associated with a sequence should be directly accessible in a single unified format and within a single data structure. The profusion of specialized databases of different formats limits the usefulness of computer tools that are applicable to many biological problems [39]. Close linkage between nucleic acid, protein, structural and other databases is only a first step towards a complete and uniform synthesis of molecular biological data.
This is a variation of Principle 1, but applied to whole databases. Derived databases should be automatically created from the common database so that they can be kept up to date with no human intervention (Principle 3). This would ensure that complete and consistent annotation is in the subsets and we will no longer be forced to navigate through a forest of inconsistent databases.
The lack of an internationally accepted standard format means that scientists from different parts of the world have a difficult time exchanging data.
The current implementation of GenBank emphasizes sequences and neglects new information about the sequences--there is no reliable mechanism for capturing information not associated with the original sequence. Information from different scientists must be merged together, yet it must be possible to trace back to the original data in every case. However, being able to trace back is much less important than providing a clean and merged view of the data for scientific research. The justification for creating the sequence database is scientific, not historical, research.
In nature, sequences have natural packages. The primary one is the taxonomic classification focused on species. Thus we should look at a database and see a series of data objects representing species and particular strains. Each species has a set of chromosomes. As sequencing proceeds, the chromosome sequences are filled out, but this view of the data does not change. When the chromosome is finished, there would be a single data object in the database rather then a pile of disconnected data like yeast chromosome III. For example, this is the organization of the Online Mendelian Inheritance in Man catalogue [40].
One common objection to a natural scheme is that it is only one ``view'' on the data. However, the natural viewpoint (as described more fully in the next section), is the one that biologists take of biology, and fundamental biology does not change rapidly so this is the most stable possible representation of the data [3]. Representing the original publication--as is now practiced--is unstable because it ignores new data. There is no robust mechanism for capturing new feature data and corrections into the current GenBank. Following Principle 3, any desired subset (or ``view'') could be created from a natural scheme. For example, if we wish to study all of the cytochromes, we would simply list the species, strain, chromosome (if known) and genetic location of each one. For careful scientific research one would always want this list, so why not make it play an active part in obtaining the data? Such a list of names can be made stable by an international standards committee, and it will remain stable as the data coalesces. In contrast, the current schemes eventually destroy the usefulness of every name list created. Scientists are unable to exchange stable lists or to update each other's lists. This slows down research.
A frequently espoused ``view'' of the data is the sequencer's own scientific contribution. This could easily be created from the merged data by running a simple program. Although we may feel proprietary attachments to the data today, 100 years from now nobody will care who sequenced what. So we may as well begin now to store the data in a natural and objective form. The longer we wait, the more difficult this task will become.
One question that commonly arises is how we should handle polymorphisms, strain differences and mobile genetic elements. The simplest way to do this is to store a single canonical sequence and then to record all the changes necessary to create a particular strain automatically. This appears to raise difficult questions. Which of the possible sequences should it be? Who will decide this? Surprisingly, it doesn't matter. We can overlap all known sequences to generate a canonical collage. By recording the strain information carefully, a program can generate any particular pure strain we want.
A DATABASE DEFINITION
The preliminary
definition given below is for the organization of natural sequences.
A clean way to handle artificial constructions, which conforms to
Principle 1, is
to define a language in which one can
specify the natural and synthetic sequences
which are to be joined together [19].
It would then be possible to store only the instructions for creating
the constructs.
This would allow the artificial sequences to be created
upon demand, so they would not require redundancy or much storage.
Following Principle 8, the remainder of the
database should be organized
into a nested series of named objects:
species, strain, chromosome, genetic locus and individual sequence object
or sub-locus.
A ``schema'' is a model of the data one wishes to represent in a
database [3]. Rather than developing a complete formal schema
of a natural organization,
Fig. 2
and
Fig. 3
show an example of the general concept.
The region separating sequence piece 1 (line segment p1) and piece 2 (p2) is dashed to indicate that it is not sequenced. Three genetic loci (rectangles) are defined, each containing several sequence objects (arrows o1, o2, and o3). Note that the loci and objects may overlap without causing difficulty in the scope of names, so long as the names and the extents of loci and objects are chosen carefully. For example, locus 2 object 1 (in the unsequenced region) is distinct from locus 3 object 1 (which has been sequenced), while in locus 1, object 3 overlaps object 2. Object orientation is indicated by the direction of the arrows. Objects and loci can span across disconnected pieces. |
Sequence Object Definition:
A sequence object is a
(possibly discontinuous)
region of genetic material with distinct properties.
According to Principle 2, the data defining an object
and associated with that object must be
complete. New information about the object must be continuously added to the
database to keep it up to date, even if no new sequence information has been
obtained.
It is possible to assign each data object an official multi-part name. The most logical and useful starting point for choosing a name for an object is its standard genetic name [14]. Although these names may change as standards become better, they are more stable than anything else. The database defines the international standard names. A list of synonyms for outdated names could be associated with each object. By this means, one may take an old list of objects and use it for the most part in a later version of the database. This would allow scientists to exchange reliable, stable lists that define subsets of the database, and therefore it is extremely important for the future of sequence analysis.
In computer programming languages, such as Pascal [41], names only apply to a certain region of code called the scope of the name. The same concept already applies to genetic names (Fig. 2, Fig. 3). The widely accepted convention is that all genetic locus names are consistent within one species so that each name only refers to one locus. Because the scope is limited to a species, there is no confusion as long as the species is always given as part of the name. Names within a given genetic locus should also have their scope limited to that locus. Thus an intron named ``5'' is acceptable when used within its scope. Until there are scope rules in a database, such simple names cannot be used reliably [21].
The ``note'' in the current GenBank implementation does not satisfy the naming requirements because the data contained within it cannot be obtained precisely with a single straight forward algorithm. The use of notes for naming objects should be completely eliminated. All such notes should be replaced with appropriate biological names.
A single scheme can be used to define insertions, deletions and base changes. A set {L, R, S} consisting of two base positions (L, left and R, right, integers with L < R) and a ``substitution sequence'' (S) is defined for each mutation. The operation to generate the mutant sequence has two steps. First, the sequence between but not including L and R is deleted. Second, the substitution sequence S is placed between the original positions corresponding to L and R. Examples:
{3, 4, `A'} inserts an A between positions 3 and 4;
{3, 5, `
' }
deletes base 4.
is a symbol designating an empty sequence;
{3, 5, `G'} replaces base 4 with a G.
{3, 20, `GTGC'} replaces bases 4 through 19 with GTGC.
L and R may have values outside the bounds of the sequence being modified to allow for deletions or attachment of new sequence to one or the other end of a sequence. It is possible for S to point to a sequence elsewhere in the database, by giving the full name of the other sequence. This would allow the creation of artificial constructs.
Sequence Piece Definition:
A sequence (or ``piece'')
is a series of nucleic-acid or amino acids represented by alphabetic
symbols. Alignment gaps are allowed, and are symbolized by a dash (-). Each
sequence has associated with it:
In nature no two chromosomes are the same, so how can we have a coordinate system? In any particular case we have a specific sequence. It is technically easy to store a canonical sequence and then to store instructions for creating all the polymorphisms and variations observed in nature. Features could be automatically moved to any specific strain sequence created on the fly.
Reference Definition:
A reference is a citation to the primary scientific literature. The reference
must include: authors, title, journal, volume, pages, year.
The name, address, phone, email address
and other identifying information
of the originator
should also be stored
when it is available. Because
this kind of extra information would allow more
rapid use of the database and would tend to keep authors responsible
for upgrading their data, it should be made available to the public
as part of the database. For implementation, a practical
modern format for these data is the BiBTeX format
since it can be parsed by machine and allows direct typesetting
[42].
CONCLUSION
A major goal for a genetic sequence database is to allow us to easily isolate specific sequences for analysis. We would like to be able to write:
species "Homo sapiens"; (* define the species *) strain "J. D. Watson"; (* define the strain *) maternal chromosome; (* define the chromosome set *) locus gene ADH; (* a locus whose type is gene, named ADH *) locus exon 3; (* a sub-locus of the ADH locus *) get all locus; (* by implication the last defined sub-locus would be used *)and obtain the sequence. This is not possible now for many reasons described in this paper. To support instructions like this, we should store named and typed objects with as few absolute coordinates as possible and the data should be entirely machine parsable:
locus type name location ----- ---- ---- -------- agene cap c 2456 orientation: + agene splice-donor d 4595 orientation: + agene splice-acceptor a 4896 orientation: + agene polyA p 5041 orientation: + agene exon 1 cap c, splice-donor d - 1 agene intron 1 splice-donor d, splice-acceptor a agene exon 2 splice-acceptor a + 1, polyA p agene mRNA m exon 1, exon 2
The problems described here led me to propose a basic philosophy and a design for the database. Discussing fundamental design issues, rather than the nitty gritty of implementation, may seem to be many years behind the current status of the databases, but we must begin to have this level of discourse if the databases are ever to become well designed, comprehensible, stable, and more scientifically valuable. The task before us is monumental. Every molecular biologist has a stake in the future form of the database and we can each play an important role in helping to achieve a richly annotated but clean database.
ACKNOWLEDGEMENTS
I thank Brian Fristensky for suggesting the priority principle for GenBank publication; David George, for his rewording on Principles 4 and 5, and for recognizing the relationship between Principles 1 and 7; Pete Rogan for suggesting the editorial review system in Principle 4 and the medical example discussed in Principle 6; and Sue Aldor for wording in Principle 8. Many useful discussions of these topics on the internet news groups bionet.general, bionet.molbio.bio-matrix and bionet.molbio.genbank would not have been possible without the support of the bionet news groups by David Kristofferson [43]. I thank David Lipman for suggesting the ``taxonomy'' of errors. Stacy Bartram discovered the duplicated accession numbers. I thank Michael J. Cinkosky, Kirill Degtiarenko, Paul N. Hengen, Ingrid Jakobsen, David Lipman, Maureen Madden, Jake Maizel, Jim Ostell, Peter Rogan, Denise Rubens, Kenn Rudd, and Bruce Shapiro for useful discussions and insightful comments on the manuscript.