BGGN-213 Bioinformatics Project Guide

The find-a-gene project is an assignment for BGGN-213 requiring students to submit a report in PDF format addressing ten specific questions related to bioinformatics techniques. Key tasks include performing BLAST searches, generating sequence alignments, creating phylogenetic trees, and analyzing protein structures. The preliminary report is due on November 16, 2017, with the final submission required by December 5, 2017, and late submissions will not be accepted.

Uploaded by

kero

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views6 pages

BGGN-213 Bioinformatics Project Guide

Uploaded by

kero

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

BGGN-213: FOUNDATIONS OF BIOINFORMATICS

The find-a-gene project assignment

[Link]
Dr. Barry Grant
Nov 2017

Overview:
The find-a-gene project is a required assignment for BGGN-213. You should prepare a written
report in PDF format that has responses to each question labeled [Q1] - [Q10] below. You may
wish to consult the scoring rubric at the end of this document and the example report provided
online.
The objective with this assignment is for you to demonstrate your grasp of database searching,
sequence analysis, structure analysis and the R environment that we have covered in class.

Due Date:
Your responses to questions Q1-Q4 are due at the beginning of class Thursday November
16th (11/16/17). Note that these answers can be obtained very quickly (at best within 10 or 15
minutes), so if you don’t succeed at first, just keep trying.
The complete assignment, including responses to all questions, is due at the beginning of class
Tuesday December 5th (12/5/17). Late responses will not be accepted under any
circumstances.

Submission instructions:
Email your PDF document as an attachment named BGGN213_F17_[yourUCSDname].pdf to
both Ileana (ileenamitra@[Link]) and me (bjgrant@[Link]). For example, my
document would be named BGGN213_F17_bjgrant.pdf
Be sure to include your UCSD email and PID number on the first page of your report.
Email your preliminary report with answers to Q1-Q4 by November 16th so we can determine if
you have found a novel gene. Submit this preliminary report as one document with screen shots
of the results inserted appropriately.
See the demonstration report linked to on the course website for an example of format. I will
email you my decision; proceed with subsequent questions only after we are sure you have
found a novel gene.
For the final report add your results for Q5-Q10 to the preliminary report and send a final
document containing the results for all questions. Please do not send only Q5-Q10 answers as
the final report.
Questions:
[Q1] Tell me the name of a protein you are interested in. Include the species and the accession
number. This can be a human protein or a protein from any other species as long as it's function
is known.
If you do not have a favorite protein, select human RBP4 or KIF11. Do not use beta globin as
this is in the worked example report that I provide you with online.

[Q2] Perform a BLAST search against a DNA database, such as a database consisting of
genomic DNA or ESTs. The BLAST server can be at NCBI or elsewhere. Include details of the
BLAST method used, database searched and any limits applied (e.g. Organism).

Also include the output of that BLAST search in your document. If appropriate, change the font
to Courier size 10 so that the results are displayed neatly. You can also screen capture a
BLAST output (e.g. alt print screen on a PC or on a MAC press ⌘-shift-4. The pointer becomes
a bulls eye. Select the area you wish to capture and release. The image is saved as a file called
Screen Shot [].png in your Desktop directory). It is not necessary to print out all of the
blast results if there are many pages.

On the BLAST results, clearly indicate a match that represents a protein sequence,
encoded from some DNA sequence, that is homologous to your query protein. I need to
be able to inspect the pairwise alignment you have selected, including the E value and
score. It should be labeled a "genomic clone" or "mRNA sequence", etc. - but include no
functional annotation.

In general, [Q2] is the most difficult for students because it requires you to have a “feel”
for how to interpret BLAST results. You need to distinguish between a perfect match to
your query (i.e. a sequence that is not “novel”), a near match (something that might be
“novel”, depending on the results of [Q4]), and a non-homologous result.
If you are having trouble finding a novel gene try restricting your search to an organism
that is poorly annotated.

[Q3] Gather information about this “novel” protein. At a minimum, show me the protein
sequence of the “novel” protein as displayed in your BLAST results from [Q2] as FASTA
format (you can copy and paste the aligned sequence subject lines from your BLAST
result page if necessary) or translate your novel DNA sequence using a tool called
EMBOSS Transeq at the EBI. Don’t forget to translate all six reading frames; the ORF
(open reading frame) is likely to be the longest sequence without a stop codon. It may
not start with a methionine if you don’t have the complete coding region. Make sure the
sequence you provide includes a header/subject line and is in traditional FASTA format.
Here, tell me the name of the novel protein, and the species from which it derives. It is
very unlikely (but still definitely possible) that you will find a novel gene from an
organism such as S. cerevisiae, human or mouse, because those genomes have
already been thoroughly annotated. It is more likely that you will discover a new gene in
a genome that is currently being sequenced, such as bacteria or plants or protozoa.

[Q4] Prove that this gene, and its corresponding protein, are novel. For the purposes of
this project, “novel” is defined as follows. Take the protein sequence (your answer to
[Q3]), and use it as a query in a blastp search of the nr database at NCBI.
• If there is a match with 100% amino acid identity to a protein in the database, from the
same species, then your protein is NOT novel (even if the match is to a protein with a
name such as “unknown”). Someone has already found and annotated this sequence,
and assigned it an accession number.
• If the top match reported has less than 100% identity, then it is likely that your protein
is novel, and you have succeeded.
• If there is a match with 100% identity, but to a different species than the one you
started with, then you have likely succeeded in finding a novel gene.
• If there are no database matches to the original query from [Q1], this indicates that
you have partially succeeded: yes, you may have found a new gene, but no, it is not
actually homologous to the original query. You should probably start over.

[Q5] Generate a multiple sequence alignment with your novel protein, your original
query protein, and a group of other members of this family from different species. A
typical number of proteins to use in a multiple sequence alignment for this assignment
purpose is a minimum of 5 and a maximum of 20 - although the exact number is up to
you. Include the multiple sequence alignment in your report. Use Courier font with a size
appropriate to fit page width.
Side-note: Indicate your sequence in the alignment by choosing an appropriate name
for each sequence in the input unaligned sequence file (i.e. edit the sequence file so
that the species, or short common, names (rather than accession numbers) display in
the output alignment and in the subsequent answers below). The goal in this step is to
create an interesting an alignment for building a phylogenetic tree that illustrates
species divergence.
[Q6] Create a phylogenetic tree, using either a parsimony or distance-based approach.
Bootstrapping and tree rooting are optional. Use “simple phylogeny” online from the EBI
or any respected phylogeny program (such as MEGA, PAUP, or Phylip). Paste an image
of your Cladogram or tree output in your report.

[Q7] Generate a sequence identity based heatmap of your aligned sequences using R.
If necessary convert your sequence alignment to the ubiquitous FASTA format (Seaview
can read in clustal format and “Save as” FASTA format for example). Read this FASTA
format alignment into R with the help of functions in the Bio3D package. Calculate a
sequence identity matrix (again using a function within the Bio3D package). Then
generate a heatmap plot and add to your report. Do make sure your labels are visible
and not cut at the figure margins.

[Q8] Using R/Bio3D (or an online blast server if you prefer), search the main protein
structure database for the most similar atomic resolution structures to your aligned
sequences.
List the top 3 unique hits (i.e. not hits representing different chains from the same
structure) along with their Evalue and sequence identity to your query. Please also add
annotation details of these structures. For example include the annotation terms PDB
identifier (structureId), Method used to solve the structure (experimentalTechnique),
resolution (resolution), and source organism (source).

HINT: You can use a single sequence from your alignment or generate a consensus
sequence from your alignment using the Bio3D function consensus(). The Bio3D
functions [Link](), [Link]() and [Link]() are likely to be of most relevance
for completing this task. Note that the results of [Link]() contain the hits PDB
identifier (or [Link]) as well as Evalue and identity. The results of [Link]() contain
the other annotation terms noted above.
Note that if your consensus sequence has lots of gap positions then it will be better to
use an original sequence from the alignment for your search of the PDB. In this case
you could chose the sequence with the highest identity to all others in your alignment by
calculating the row-wise maximum from your sequence identity matrix.

[Q9] Generate a molecular figure of one of your identified PDB structures using VMD.
You can optionally highlight conserved residues that are likely to be functional. Please
use a white or transparent background for your figure (i.e. not the default black).
Based on sequence similarity. How likely is this structure to be similar to your “novel”
protein?

[Q10] Perform a “Target” search of ChEMBEL ( [Link] ) with

your novel sequence. Are there any Target Associated Assays and ligand efficiency
data reported that may be useful starting points for exploring potential inhibition of your
novel protein?

Scoring Rubric:
[45 total points available]

Q1 (4 points)
Protein name 1
Species 1
Accession number 1
Function known 1

Q2 (6 points)
Blast method 1
Database searched 1
Limits applied 1
Search output list (top hits) 1
Alignment of choice 1
Evalue and other alignment stats 1

Q3 (3 points)
Protein sequence of choice matches Subject above 1
Name in header 1
Species 1

Q4 (3 point)
Blastp output list with identities & Evalue 1
Top alignment shown with alignment statistics 1
Results indicates a “novel” gene found 1

Q5 (3 points)
MSA labeled with useful names 1
MSA trimmed appropriately (i.e. no gap overhangs) 1
Pasted MSA fits report page width (i.e. font, format) 1

Q6 (1 point)
Figure illustrates sequence clustering pattern 1

Q7 (10 points)
Heatmap figure included in report 5
Heatmap is legible (i.e. no labels obscured) 5

Q8 (10 points)
PDB identifiers from multiple species reported 5
Annotation of PDB source, resolution and technique 4
Annotation of Evalue and Sequence Identity 1

Q9 (4 points)
Structure figure provided 2
Uses white background for molecular figure 1
Figure of high resolution (i.e. not just snapshot) 1

Q10 (1 point)
Evidence of ChEMBEL searches 1

Bioinformatics Essay Assignment Guide
100% (1)
Bioinformatics Essay Assignment Guide
14 pages
Bioinformatics Exercises on TIGR and BLAST
100% (1)
Bioinformatics Exercises on TIGR and BLAST
6 pages
Bioinformatics Module: Genome Databases
No ratings yet
Bioinformatics Module: Genome Databases
20 pages
KEGG, ORF, and Protein Structure Insights
No ratings yet
KEGG, ORF, and Protein Structure Insights
11 pages
Sequence Alignment in Bioinformatics
No ratings yet
Sequence Alignment in Bioinformatics
9 pages
Using BLAST for Protein Sequence Alignment
No ratings yet
Using BLAST for Protein Sequence Alignment
9 pages
FASTA and BLAST Sequence Alignment Guide
No ratings yet
FASTA and BLAST Sequence Alignment Guide
45 pages
Advanced BLAST Techniques for Proteins
No ratings yet
Advanced BLAST Techniques for Proteins
17 pages
Bioinformatics Tools Overview at NYU
No ratings yet
Bioinformatics Tools Overview at NYU
50 pages
Bioinformatics Tutorial: Sequence Analysis Guide
No ratings yet
Bioinformatics Tutorial: Sequence Analysis Guide
16 pages
Sequence Alignment and BLAST Overview
No ratings yet
Sequence Alignment and BLAST Overview
21 pages
NCBI Bioinformatics: Sequence Analysis Guide
No ratings yet
NCBI Bioinformatics: Sequence Analysis Guide
12 pages
Overview of BLAST in Bioinformatics
100% (1)
Overview of BLAST in Bioinformatics
21 pages
Bioinformatics Test 2: GENE320 Exam
No ratings yet
Bioinformatics Test 2: GENE320 Exam
6 pages
Bioinformatics Assignment Cover Sheet
No ratings yet
Bioinformatics Assignment Cover Sheet
10 pages
Microbial Bioinformatics Sequence Analysis
No ratings yet
Microbial Bioinformatics Sequence Analysis
8 pages
Introduction to Bioinformatics Concepts
No ratings yet
Introduction to Bioinformatics Concepts
12 pages
Bioinformatics Mid-Term Exam Guide
No ratings yet
Bioinformatics Mid-Term Exam Guide
6 pages
Bioinformatics Exercises: Phylogenetic Trees
No ratings yet
Bioinformatics Exercises: Phylogenetic Trees
8 pages
Biological Sequence Database Overview
No ratings yet
Biological Sequence Database Overview
6 pages
Overview of BLAST and Its Variants
No ratings yet
Overview of BLAST and Its Variants
8 pages
Bioinformatics Databases Overview
No ratings yet
Bioinformatics Databases Overview
19 pages
Overview of Bioinformatics Techniques
No ratings yet
Overview of Bioinformatics Techniques
43 pages
Multalin Tool for Sequence Alignment
No ratings yet
Multalin Tool for Sequence Alignment
66 pages
Session 3 Assignment: BLAST Tasks
100% (3)
Session 3 Assignment: BLAST Tasks
5 pages
Introduction to Bioinformatics Course
No ratings yet
Introduction to Bioinformatics Course
35 pages
Sequence Analysis and Gene Detection
No ratings yet
Sequence Analysis and Gene Detection
14 pages
Bioinformatics Lab: BLAST Analysis
No ratings yet
Bioinformatics Lab: BLAST Analysis
6 pages
Comparative Genomics and Proteomics Insights
No ratings yet
Comparative Genomics and Proteomics Insights
28 pages
Overview of BLAST Sequence Tools
No ratings yet
Overview of BLAST Sequence Tools
19 pages
Bioinformatics Mid-Sem Exam 2017
No ratings yet
Bioinformatics Mid-Sem Exam 2017
2 pages
Applications of Bioinformatics in Biology
No ratings yet
Applications of Bioinformatics in Biology
9 pages
Bioinformatics: An Overview of Techniques
100% (1)
Bioinformatics: An Overview of Techniques
41 pages
Bioinformatics: Using BLAST for Sequence Analysis
No ratings yet
Bioinformatics: Using BLAST for Sequence Analysis
11 pages
Bioinformatics Assignment Guide: Tools & Genes
No ratings yet
Bioinformatics Assignment Guide: Tools & Genes
9 pages
DNA Sequence Notation and Analysis Guide
100% (3)
DNA Sequence Notation and Analysis Guide
8 pages
Bioinformatics Training: BLAST & PCR Techniques
No ratings yet
Bioinformatics Training: BLAST & PCR Techniques
21 pages
Bioinformatics Exam Questions & Answers
No ratings yet
Bioinformatics Exam Questions & Answers
8 pages
Genome Annotation Techniques and Tools
No ratings yet
Genome Annotation Techniques and Tools
75 pages
Bioinformatics Exam Questions & Answers
No ratings yet
Bioinformatics Exam Questions & Answers
8 pages
BLAST Genome Analysis Problem Set
No ratings yet
BLAST Genome Analysis Problem Set
10 pages
Understanding BLAST in Bioinformatics
No ratings yet
Understanding BLAST in Bioinformatics
18 pages
Intro to Bioinformatics Course Notes
No ratings yet
Intro to Bioinformatics Course Notes
56 pages
Data Retrieval Systems in Bioinformatics
75% (4)
Data Retrieval Systems in Bioinformatics
17 pages
Bioinformatics Resources Overview
No ratings yet
Bioinformatics Resources Overview
55 pages
Bioinformatics Inputs and Data Formats
No ratings yet
Bioinformatics Inputs and Data Formats
32 pages
Bioinformatics Practical Exercises Report
No ratings yet
Bioinformatics Practical Exercises Report
63 pages
BLAST and FASTA: Alignment Techniques Quiz
No ratings yet
BLAST and FASTA: Alignment Techniques Quiz
16 pages
Overview of BLAST Functionality
No ratings yet
Overview of BLAST Functionality
42 pages
Bioinformatics
No ratings yet
Bioinformatics
22 pages
NCBI BLAST Database Search Guide
No ratings yet
NCBI BLAST Database Search Guide
7 pages
Constructing Cladograms with Bioinformatics
No ratings yet
Constructing Cladograms with Bioinformatics
7 pages
Bioinformatics Databases Overview
100% (4)
Bioinformatics Databases Overview
82 pages
Understanding BLAST and FASTA Formats
0% (1)
Understanding BLAST and FASTA Formats
3 pages
Bioinformatics File Formats Overview
No ratings yet
Bioinformatics File Formats Overview
13 pages
Biopython Org DIST Docs Tutorial Tutorial HTML
No ratings yet
Biopython Org DIST Docs Tutorial Tutorial HTML
267 pages
Understanding PIR in Bioinformatics
No ratings yet
Understanding PIR in Bioinformatics
85 pages
Overview of Microarray Databases
No ratings yet
Overview of Microarray Databases
3 pages
Bioinformatics Lab Workshop Overview
No ratings yet
Bioinformatics Lab Workshop Overview
13 pages
File Formats for Sequence Alignment
No ratings yet
File Formats for Sequence Alignment
2 pages
On Bioinformatic Resources
No ratings yet
On Bioinformatic Resources
7 pages
Exploring Ensembl Genome Browser
No ratings yet
Exploring Ensembl Genome Browser
105 pages
Factors Influencing BLAST E-Value
No ratings yet
Factors Influencing BLAST E-Value
6 pages
Database Searching & Sequence Alignment
No ratings yet
Database Searching & Sequence Alignment
41 pages
Bioinformatics & Biostatistics Exam Guide
No ratings yet
Bioinformatics & Biostatistics Exam Guide
5 pages
Bioinformatics Exam Paper 2012-13
No ratings yet
Bioinformatics Exam Paper 2012-13
4 pages
Understanding Biological Databases in Bioinformatics
No ratings yet
Understanding Biological Databases in Bioinformatics
10 pages
BSC Biotechnology 6th Sem
No ratings yet
BSC Biotechnology 6th Sem
1 page
Building Phylogenetic Trees with MEGA
No ratings yet
Building Phylogenetic Trees with MEGA
3 pages
Heuristic Sequence Alignment Overview
No ratings yet
Heuristic Sequence Alignment Overview
19 pages
Innovations in Computational Biology
No ratings yet
Innovations in Computational Biology
12 pages
Michael Agostino - Practical Bioinformatics-Garland Science (2013)
No ratings yet
Michael Agostino - Practical Bioinformatics-Garland Science (2013)
397 pages
EMBOSS Bioinformatics Guide
No ratings yet
EMBOSS Bioinformatics Guide
14 pages
Introduction to Medical Informatics
No ratings yet
Introduction to Medical Informatics
6 pages
Bioinformatics Data Scientist Resume
No ratings yet
Bioinformatics Data Scientist Resume
2 pages
Overview of Functional Genomics
100% (1)
Overview of Functional Genomics
5 pages
Overview of Biological Databases
No ratings yet
Overview of Biological Databases
5 pages
Introduction to Bioinformatics BLS 211
No ratings yet
Introduction to Bioinformatics BLS 211
15 pages
Computational Biology: Data Analysis Guide
No ratings yet
Computational Biology: Data Analysis Guide
10 pages
Passionfruit Genomic Database Overview
No ratings yet
Passionfruit Genomic Database Overview
9 pages
Phylogenetic Tree of Butter Catfish
No ratings yet
Phylogenetic Tree of Butter Catfish
7 pages
Biology Genome-And-genomics
No ratings yet
Biology Genome-And-genomics
8 pages
Online Bioinformatics Workshop 2025
No ratings yet
Online Bioinformatics Workshop 2025
2 pages
Multiple Sequence Alignment
No ratings yet
Multiple Sequence Alignment
15 pages
BLAST Overview and Usage Guide
No ratings yet
BLAST Overview and Usage Guide
7 pages