Introduction and
Biological Databases
Overview
Introduction to biological
Databases
Introduction
What Is a Database?
What Is
Bioinformatics?
Types of Databases
Goal
Biological Databases
Scope
Pitfalls of Biological databases
Applications
Information Retrieval from
Limitations
Biological databases
What is Bioinformatics?
Bioinformatics involves the technology
that uses computers for storage,
retrieval, manipulation, and distribution
of information related to biological
macromolecules
Bioinformatics & Computational Biology
Goal
Better understand a living cell and how it
functions at the molecular level
Scope
The development of computational tools
and databases
The application of these tools and
databases in generating biological
knowledge
Scope
Tools development:
Writing software for sequence,
structural, and functional analysis
Construction and curating of biological
databases
Tools: Used in three areas
Molecular Sequence Analysis
Molecular Structural Analysis
Molecular Functional Analysis
Sequence Analysis
Sequence Alignment
Sequence Database Searching
Motif and Pattern Discovery
Gene and Promoter Finding
Reconstruction of Evolutionary
Relationships
...
Structural Analysis
• Protein and nucleic acid structure
Analysis
Comparison
Classification
Prediction
Functional Analysis
Gene Expression Profiling
Protein– Protein Interaction Prediction
Protein Sub cellular Localization
Prediction
Metabolic Pathway Reconstruction
...
Applications
Drug design
Agricultural biotechnology
Forensic DNA analysis
Limitations
Fighting a battle without intelligence is
inefficient and dangerous
Introduction to
Biological Databases
What is a Database?
Type of Databases:
Relational Databases
Object-Oriented Databases
Biological Databases
Primary Databases
Secondary Databases
Databases in Bioinformatics
Sequence databases
Sequence analysis
Functional genomics
Literature databases
Structural databases
Metabolic pathway databases
Specialized databases
Pitfalls of Biological Databases
Errors in Sequence Databases
Redundancy in the Primary Sequence
Databases
False or Incomplete Genes Annotations
Errors in Nucleotide Sequences
sequencing errors
frame-shifts
Contaminated with sequences from
cloning vectors
Exceptional Care for sequences
produced before the 1990s
Redundancy
repeated submission
identical or overlapping sequences by
the same or different authors
revision of annotations
dumping of expressed sequence tags
(EST) data
poor database management
Bioinformatics Databases
Growing steadily in number
Growing amazingly in size
Specialization
Which genome they contain (mouse, human, all of them)
Which types of information about the genome they contain
Contain information such as
Sequences: of bases and of residues
Structure: 3d conformations of known proteins
Families: Which sets of genes are known to be homologous
Annotations: which processes each gene is involved in
And lots of other information
The definitive source….
• More than 1300 DB
• http://nar.oxfordjournals.org/content/39/suppl_1.toc
DNA Sequence
databases
Main repositories:
GenBank (US)
(http://www.ncbi.nlm.nih.gov/Genbank/index.html)
EMBL (Europe)
(http://www.ebi.ac.uk/embl/)
DDBJ (Japan)
(http://www.ddbj.nig.ac.jp/)
Primary databases
DNA sequences are identical
EMBL Database
Number of entries
(current 199,575,971)
Graphs created on 22 November 2010
http://www.ebi.ac.uk/embl/Services/DBStats/
www.ncbi.nlm.nih.gov
ENTREZ
NCBI (USA) National Center for Biotechnology Information
PubMed: The biomedical literature (PubMed)
Nucleotide sequence database (Genbank)
Protein sequence database
http://www.ncbi.nlm.nih.gov/Entrez/
Structure: three-dimensional macromolecular structures
Genome: complete genome assemblies
PopSet: population study data sets
OMIM: Online Mendelian Inheritance in Man
Taxonomy: organisms in GenBank
Books: online books
ProbeSet: gene expression and microarray datasets
3D Domains: domains from Entrez Structure
UniSTS: markers and mapping data
SNP: single nucleotide polymorphisms
CDD: conserved domains
Journals: journals in Entrez
UniGene: gene-oriented clusters of transcript sequences
PMC: full-text digital archive of life sciences journal literature
PubMed is…
• National Library of Medicine's search service
• >20 million citations in MEDLINE
• links to participating online journals
• PubMed tutorial (via side bar)
Entrez integrates…
• the scientific literature;
• DNA and protein sequence databases;
• 3D protein structure data;
• population study data sets;
• assemblies of complete genomes
Entrez is a search and retrieval
system
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Sequence Databases
Annotated sequence databases
SWISS-PROT, GenBank etc…
Usage: identifying function, retrieving information
Low-annotation sequence databases
EST databases, high-throughput genome sequences
Usage: discovery of new genes
General Protein Databases
SWISS-PROT
– Manually curated
– high-quality annotations, less data
GenPept/TREMBL
– Translated coding sequences from GenBank/EMBL
– Few annotations, more up to date
PIR
– Phylogenetic-based annotations
All 3 now combining efforts to form UniProt (http://www.uniprot.org)
Low-annotation Databases
ESTs (Expressed Sequence Tags)
Low quality sequences generated by high
-volume sequencing the 3’ or 5’ end of
cDNAs
High-throughput genome sequences
Produced by mass-sequencing of
genomic DNA
Non-redundant Databases
Sequence data only: cannot be browsed, can
only be searched using a sequence
Combine sequences from more than one
database
Examples:
NR Nucleic (genbank+EMBL+DDBJ+PDB
DNA)
NR Protein (SWISS-
PROT+TrEMBL+GenPept+PDB protein)
Sequence & Structure Databases
PDB (Protein Databank)
Stores 3-dimensional atomic coordinates for biological molecules including
protein and nucleic acids
Data obtained by X-ray crystallography, NMR, or computer modelling
http://www.rcsb.org/pdb/
MMDB (Molecular Modelling database)
Over 28,000 3D macromolecular structures, including proteins and
polynucleotides
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure)
SCOP (Structural Classification of Proteins)
Classification of proteins according to structural and evolutionary relationships
File Formats
GenBank/GB, genbank flatfile format
NBRF format
EMBL, EMBL flatfile format
Swissprot
GCG, single sequence format of GCG software
DNAStrider, for common Mac program
Pearson/Fasta, a common format used by Fasta programs and others
Phylip3.2, sequential format for Phylip programs
Phylip, interleaved format for Phylip programs (v3.3, v3.4)
Plain/Raw, sequence data only (no name, document, numbering)
MSF multi sequence format used by GCG software
PAUP"s multiple sequence (NEXUS) format
ASN.1 format used by NCBI
EMBL Format
ID TRBG361 standard; mRNA; PLN; 1859 BP.
XX
AC X56734; S46826;
XX
SV X56734.1
XX
DT 12-SEP-1991 (Rel. 29, Created)
FH Key Location/Qualifiers
DT 15-MAR-1999 (Rel. 59, Last updated, Version 9)
FH
XX
DE Trifolium repens mRNA for non-cyanogenic beta- FT source 1..1859
glucosidase FT /db_xref="taxon:3899"
XX FT /mol_type="mRNA"
KW beta-glucosidase. FT /organism="Trifolium repens"
XX FT /tissue_type="leaves"
OS Trifolium repens (white clover) FT /clone_lib="lambda gt10"
OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; FT /clone="TRE361"
Tracheophyta; FT CDS 14..1495
OC Spermatophyta; Magnoliophyta; eudicotyledons; core
FT /db_xref="GOA:P26204"
eudicots; rosids;
OC eurosids I; Fabales; Fabaceae; Papilionoideae; FT /db_xref="SWISS-PROT:P26204"
Trifolieae; Trifolium. FT /note="non-cyanogenic"
XX FT /EC_number="3.2.1.21"
RN [5] FT /product="beta-glucosidase"
RP 1-1859 FT /protein_id="CAA40058.1"
RX MEDLINE; 91322517. FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI
RX PUBMED; 1907511. FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK
RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.; FT DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ
RT "Nucleotide and derived amino acid sequence of the
FT VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR
cyanogenic
RT beta-glucosidase (linamarase) from white clover FT CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD
(Trifolium repens L.)."; FT DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF
RL Plant Mol. Biol. 17(2):209-219(1991). FT IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ
XX FT EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA
RN [6] FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"
RP 1-1859 FT mRNA 1..1859
RA Hughes M.A.; FT /evidence=EXPERIMENTAL
RT ; XX
RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ
SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
databases.
RL M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE, aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60
MEDICAL SCHOOL, NEW CASTLE cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120
RL UPON TYNE, NE2 4HH, UK tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga 180
XX aggtgcagta aacgaaggcg gtagaggacc aagtatttgg gataccttca cccataaata 240
DR GOA; P26204. tccagaaaaa ataagggatg gaagcaatgc agacatcacg gttgaccaat atcaccgcta 300
DR MENDEL; 11000; Trirp;1162;11000. caaggaagat gttgggatta tgaaggatca aaatatggat tcgtatagat tctcaatctc 360
DR SWISS-PROT; P26204; BGLS_TRIRP. ttggccaaga atactcccaa agggaaagtt gagcggaggc ataaatcacg aaggaa
XX
LOCUS SCU49845
Genbank Format
5028 bp DNA PLN
gene 687..3158
21-JUN-1999
/gene="AXL2"
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial
CDS 687..3158
cds, and Axl2p
/gene="AXL2"
(AXL2) and Rev7p (REV7) genes, complete cds.
/note="plasma membrane glycoprotein"
ACCESSION U49845
/codon_start=1
VERSION U49845.1 GI:1293613
/function="required for axial budding pattern of
KEYWORDS .
cerevisiae"
SOURCE Saccharomyces cerevisiae (baker's yeast)
/product="Axl2p"
ORGANISM Saccharomyces cerevisiae
/protein_id="AAA98666.1"
Eukaryota; Fungi; Ascomycota; Saccharomycotina;
/db_xref="GI:1293615"
Saccharomycetes;
/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQY
Saccharomycetales; Saccharomycetaceae;
TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLL
Saccharomyces.
VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNG
REFERENCE 1 (bases 1 to 5028)
VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTA
AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and
TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGN
Lawrence,C.W.
YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPAN
TITLE Cloning and sequence of REV7, a gene whose function
DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTN
is required for
DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMD
DNA damage-induced mutagenesis in Saccharomyces
NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTS
cerevisiae
CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANK
JOURNAL Yeast 10 (11), 1503-1509 (1994)
NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLS
MEDLINE 95176709
SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNL
PUBMED 7871890
YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKS
REFERENCE 2 (bases 1 to 5028)
HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSME
AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M.
VDFSNKSNVNVGQVKDIHGRIPEML
TITLE Selection of axial growth sites in yeast requires
BASE COUNT 1510 a 1074 c 835 g 1609 t
Axl2p, a novel
ORIGIN
plasma membrane glycoprotein
1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaa
JOURNAL Genes Dev. 10 (7), 777-793 (1996)
61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtag
MEDLINE 96194260
121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaa
PUBMED 8846915
181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taat
REFERENCE 3 (bases 1 to 5028)
241
AUTHORS Roemer,T.
TITLE Direct Submission
JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale
University, New
Haven, CT, USA
FEATURES Location/Qualifiers
source 1..5028
/organism="Saccharomyces cerevisiae"
/db_xref="taxon:4932"
Swissprot format
Specialized Sequence Databases
Focus on a specific type of sequences
Sequences are often modified or specially
annotated
Usage depends on the database
Examples:
Ribosomal RNA databases
Immunology databases
Protein domain databases
Pfam (http://www.sanger.ac.uk/Software/Pfam/)
• Collection of multiple sequence alignments and hidden Markov models covering many
common protein domains and families
SMART (a Simple Modular Architecture Research Tool)
• Identification and annotation of genetically mobile domains and the analysis of domain
architectures
• (http://smart.embl-heidelberg.de/help/smart_about.shtml
CDD (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi)
• Combines SMART and Pfam databases
• Easier and quicker search
Sequence Motif Databases
Scan Prosite (http://www.expassy.org/prosite)
and PRINTS
(http://bioinf.man.ac.uk/dbbrowser/PRINTS/)
Store conserved motifs occurring in
nucleic acid or protein sequences
Motifs can be stored as consensus
sequences, alignments, or using
statistical representations such as
residue frequency tables
Ribosomal RNA Databases
RDP (Michigan State University, USA)
http://rdp.cme.msu.edu/html/
rRNA database (University of Antwerp,
Belgium)
http://rrna.uia.ac.be/
ribosomal RNA sequences are pre-aligned
according to their secondary structure
Usage: creating data sets for molecular
phylogeny, especially for microbial taxonomy
and identification
Immunological Sequence Databases
The Kabat Database of Sequences of Proteins of
Immunological Interest
www.hgmp.mrc.ac.uk/Bioinformatics/Databases/kabatp-
help.html
Sequences are classified according to antigen
specificity, and available in pre-aligned format
The Immunogenetics database (IMGT)
http://imgt.cnusc.fr:8104/
Focuses on immunoglobulins, T-cell receptors and MHC
genes
Genome Databases
Focus on one organism or group of organisms:
Colibase (E. coli and related species)
http://colibase.bham.ac.uk/
GDB (human) http://www.gdb.org/
Flybase (Drosophila) http://flybase.bio.indiana.edu/
WormBase (C. elegans) http://wormbase.org
AtDB (Arabidopsis) http://www.arabidopsis.org
SGD (S. cerevisiae) http://genome-www.stanford.edu/Saccharomyces/
Expression Databases
RNA expression
Results of microarray experiments measuring the change in
specific mRNA content under certain conditions
Array Express (EBI) and Geo (NCBI)
Not user friendly
Proteome databases
2D gel electrophoresis images representing the protein content of a
cell or tissue under specific conditions
SWISS 2D PAGE at http://us.expasy.org/ch2d/
Other Database Types
Literature
MEDLINE (http://ncbi.nlm.nih.gov/PubMed/)
HighWire (http://www.highwire.org)
Variation
dbSNP (http://ncbi.nlm.nih.gov/SNP/)
HGBase (http://hgbase/interactiva/de)
Metabolic pathways
KEGG (http://kegg.genome.ad.jp/kegg/)
WIT (http://wit.mcs/anl.gov/WIT2)
Organisms and nomenclature
Taxonomies (e.g.: http://ncbi.nlm.nih.gov/Taxonomy/ )
Mendel (http://mbclserver.rutgers.edu/CPGN)
Methods for Accessing Data
local installation
screen scraping
BioPerl
FTP sites
Local Installations
SRS
Need to obtain license from Lion Biosceinces
Download data from FTP sites
Ensembl
"framework to organize biology around the
sequences of large genomes"
www.ensembl.org
Screen Scraping
URL spoofing
construction of URLs that replicate the query
html parsing
extraction of results from html pages returned by query
Requirements
html module
knowledge of query mechanism
Method NOT advocated by most data providers
BioPerl
BioPerl is a collection of modules that
facilitates the development of Perl scripts
for bioinformatics applications.
www.bioperl.org
SWISSPROT
http://www.ebi.ac.uk/swissprot/
European/Swiss Bioinformatics Institute 1986
Highly accurate, hand curated resource
Aims:
Have a high level of annotation
Often by the people who have been
working with the gene
Have a low level of redundancy
Have a high level of integration with other
databases
TrEMBL
http://www.ebi.ac.uk/trembl/
SWISSPROT’s Big Brother
All genes which have been left out of SWISSPROT
Computer annotated rather than human annotated
PROSITE
http://ca.expasy.org/prosite/
Families of proteins
Can search using regular expressions
Similar to unix commands using wildcards, etc.
E.g., [AC]-x-V-x(4)-{ED}
Interpreted as:
[Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
Families exhibit these patterns
So we can search over families
1574 documents about 1308 different patterns
PFAM
http://pfam.sanger.ac.uk/
Maintained by the Sanger Centre (Cambridge)
Protein families aligned using HMMs
Hidden Markov Models (see later lecture)
Given a new sequence
Find families which the sequence might fit into
Sequence Coverage
11912 families
Split into Pfam-A (high quality) and Pfam-B (low quality)
SCOP and CATH
SCOP http://scop.mrc-lmb.cam.ac.uk/scop/
http://www.cathdb.info/
Structural Classification of Proteins
Hierarchically ordered and manually curated
38221 PDB Entries
110800 Domains
CATH
Classification of protein domain structures
124 folds
226 Superfamily
1148 Sequence family
14473 Domain
Using Databases
with the FASTA Format
May need to know the FASTA format
For residue sequences
First line must start with a > sign
First line contains identification information
for gene
Other lines contain the residue sequence
OK to have a ragged right format
Usually OK to have lower case (but check)
Example FASTA Format
Geninfo num, assigned by the NCBI
Indicates that SWISS-PROT was source database
SWISS-PROT Identifier
Molecule Name
•
> gi|121664|sp|P00435|GSHC_BOVIN GLUTATHIONE PEROXIDASE
• mcaaqrsaaalaaaaprtvyafsarplaggepfnlsslrgkvllienvak
• slcgttvrdytqmndlqrrlgprglvvlgfpcnqfghqenakneeilncl
• yvrpgggfepnfmlfekcevngekahplfaflrevlptpsddatalmtdp
• kfitwspvcrndvswnfekflvgpdgvpvrrysrrfltidiepdietlls
• qgasa
Analyzing Results
Using PERL Scripts
Database servers now do:
Increasingly specific analysis of your results
But you will eventually need to do analysis
Ideal programming language is PERL
Designed to manipulate text and files
Can use it to play around with (manipulate) strings
Will be using it in the coursework
PERL Tutorial
PDB Format
The PDB format consists of a collection of fixed format records that
describe :
Atomic coordinates,
Chemical and biochemical features
Experimental details of the structure determination
Some structural features such as
Secondary structure assignments,
Hydrogen bonding
Biological assemblies
Active sites