Skip to content

a script which takes a de novo transcriptome assembly [fasta format], a reference genome [fasta file], and RNAseq reads [fastq format] as input, and outputs a gtf file

Notifications You must be signed in to change notification settings

terooatt/fasta-to-gtf

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 

Repository files navigation

fasta-to-gtf

This is a python script which uses a transcriptome assembly in fasta format to generate a gtf file (for use in genome/annotation visualization comparison).

It takes as input:

  1. a de novo transcriptome assembly [fasta format]
  2. a reference genome [fasta format]

It outputs a gtf file (tab separated columns) with the following fields:

  1. seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
  2. source - name of the program that generated this feature, or the data source (database or project name)
  3. feature - feature type name, e.g. Gene, Variation, Similarity
  4. start - Start position of the feature, with sequence numbering starting at 1.
  5. end - End position of the feature, with sequence numbering starting at 1.
  6. score - A floating point value.
  7. strand - defined as + (forward) or - (reverse).
  8. frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
  9. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

System requirements

  • Linux OS (tested on Fedora/Scientific Linux/Ubuntu)
  • Python version 2.7
  • BLAT version 34

About

a script which takes a de novo transcriptome assembly [fasta format], a reference genome [fasta file], and RNAseq reads [fastq format] as input, and outputs a gtf file

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%