Linux Command Line Exercises For NGS Data Processing
Linux Command Line Exercises For NGS Data Processing
byUmerZeeshanIjaz
ThepurposeofthistutorialistointroducestudentstothefrequentlyusedtoolsforNGSanalysisaswellas
givingexperienceinwritingoneliners.Copytherequiredfilestoyourcurrentdirectory,changedirectory(cd)
tothelinuxTutorialfolder,anddoalltheprocessinginside:
[uzi@quince-srv2 ~/]$ cp -r /home/opt/MScBioinformatics/linuxTutorial .
[uzi@quince-srv2 ~/]$ cd linuxTutorial
[uzi@quince-srv2 ~/linuxTutorial]$
IhavedeliberatelychosenAwkintheexercisesasitisalanguageinitselfandisusedmoreoftento
manipulateNGSdataascomparedtotheothercommandlinetoolssuchasgrep,sed,perletc.
Furthermore,havingacommandonawkwillmakeiteasiertounderstandadvancedtutorialssuchasIllumina
AmpliconsProcessingWorkflow.
InLinux,weuseashellthatisaprogramthattakesyourcommandsfromthekeyboardandgivesthemto
theoperatingsystem.MostLinuxsystemsutilizeBourneAgainSHell(bash),butthereareseveraladditional
shellprogramsonatypicalLinuxsystemsuchasksh,tcsh,andzsh.Toseewhichshellyouareusing,type
[uzi@quince-srv2 ~/linuxTutorial]$ echo $SHELL
/bin/bash
Toseewhereyouareinthefilesystem:
[uzi@quince-srv2 ~/linuxTutorial]$ pwd
/home/uzi/linuxTutorial
Listthefilesinthecurrentdirectory:
[uzi@quince-srv2 ~/linuxTutorial]$ ls
data
Nowtrydifferentcommandsfromthesheetgivenbelow:
LinuxCommandsCheatSheet
FileSystem
lslistitemsincurrentdirectory
ls -llistitemsincurrentdirectoryandshowinlongformattoseeperimissions,size,andmodification
date
ls -alistallitemsincurrentdirectory,includinghiddenfiles
ls -Flistallitemsincurrentdirectoryandshowdirectorieswithaslashandexecutableswithastar
ls dirlistallitemsindirectorydir
cd dirchangedirectorytodir
cd ..gouponedirectory
cd /gototherootdirectory
cd ~gototoyourhomedirectory
cd -gotothelastdirectoryyouwerejustin
pwdshowpresentworkingdirectory
mkdir dirmakedirectorydir
rm fileremovefile
rm -r dirremovedirectorydirrecursively
cp file1 file2copyfile1tofile2
cp -r dir1 dir2copydirectorydir1todir2recursively
mv file1 file2move(rename)file1tofile2
ln -s file linkcreatesymboliclinktofile
touch filecreateorupdatefile
cat fileoutputthecontentsoffile
less fileviewfilewithpagenavigation
head fileoutputthefirst10linesoffile
tail fileoutputthelast10linesoffile
tail -f fileoutputthecontentsoffileasitgrows,startingwiththelast10lines
vim fileeditfile
alias name 'command'createanaliasforacommand
System
shutdownshutdownmachine
rebootrestartmachine
dateshowthecurrentdateandtime
whoamiwhoyouareloggedinas
finger userdisplayinformationaboutuser
man commandshowthemanualforcommand
dfshowdiskusage
dushowdirectoryspaceusage
freeshowmemoryandswapusage
whereis appshowpossiblelocationsofapp
which appshowwhichappwillberunbydefault
ProcessManagement
psdisplayyourcurrentlyactiveprocesses
topdisplayallrunningprocesses
kill pidkillprocessidpid
kill -9 pidforcekillprocessidpid
Permissions
ls -llistitemsincurrentdirectoryandshowpermissions
chmod ugo filechangepermissionsoffiletougouistheuser'spermissions,gisthegroup's
permissions,andoiseveryoneelse'spermissions.Thevaluesofu,g,andocanbeanynumberbetween0
and7.
7fullpermissions
6readandwriteonly
5readandexecuteonly
4readonly
3writeandexecuteonly
2writeonly
1executeonly
0nopermissions
Networking
wget filedownloadafile
curl filedownloadafile
scp user@host:file dirsecurecopyafilefromremoteservertothedirdirectoryonyourmachine
scp file user@host:dirsecurecopyafilefromyourmachinetothedirdirectoryonaremote
server
scp -r user@host:dir dirsecurecopythedirectorydirfromremoteservertothedirectorydiron
yourmachine
ssh user@hostconnecttohostasuser
ssh -p port user@hostconnecttohostonportasuser
ssh-copy-id user@hostaddyourkeytohostforusertoenableakeyedorpasswordlesslogin
ping hostpinghostandoutputresults
whois domaingetinformationfordomain
dig domaingetDNSinformationfordomain
dig -x hostreverselookuphost
lsof -i tcp:1337listallprocessesrunningonport1337
Searching
grep pattern filessearchforpatterninfiles
grep -r pattern dirsearchrecursivelyforpatternindir
grep -rn pattern dirsearchrecursivelyforpatternindirandshowthelinenumberfound
grep -r pattern dir --include='*.extsearchrecursivelyforpatternindirandonlysearchin
fileswith.extextension
command | grep patternsearchforpatternintheoutputofcommand
find filefindallinstancesoffileinrealsystem
locate filefindallinstancesoffileusingindexeddatabasebuiltfromtheupdatedbcommand.Much
fasterthanfind
sed -i 's/day/night/g' filefindalloccurrencesofdayinafileandreplacethemwithnights
meanssubstitudeandgmeansglobalsedalsosupportsregularexpressions
Compression
tar cf file.tar filescreateatarnamedfile.tarcontainingfiles
tar xf file.tarextractthefilesfromfile.tar
tar czf file.tar.gz filescreateatarwithGzipcompression
tar xzf file.tar.gzextractatarusingGzip
gzip filecompressesfileandrenamesittofile.gz
gzip -d file.gzdecompressesfile.gzbacktofile
Shortcuts
ctrl+amovecursortobeginningofline
ctrl+fmovecursortoendofline
alt+fmovecursorforward1word
alt+bmovecursorbackward1word
Reference:http://cheatsheetworld.com/programming/unix-linux-cheat-sheet/
Exercise1:ExtractingreadsfromaFASTAfilebasedonsupplied
IDs
Awkisaprogramminglanguagewhichallowseasymanipulationofstructureddataandismostlyusedfor
patternscanningandprocessing.Itsearchesoneormorefilestoseeiftheycontainlinesthatmatchwiththe
specifiedpatternsandthenperformassociatedactions.Thebasicsyntaxis:
awk '/pattern1/ {Actions}
/pattern2/ {Actions}' file
TheworkingofAwkisasfollows
Awkreadstheinputfilesonelineatatime.
Foreachline,itmatcheswithgivenpatterninthegivenorder,ifmatchesperformsthecorresponding
action.
Ifnopatternmatches,noactionwillbeperformed.
Intheabovesyntax,eithersearchpatternoractionareoptional,Butnotboth.
Ifthesearchpatternisnotgiven,thenAwkperformsthegivenactionsforeachlineoftheinput.
Iftheactionisnotgiven,printallthatlinesthatmatcheswiththegivenpatternswhichisthedefaultaction.
Emptybraceswithoutanyactiondoesnothing.Itwontperformdefaultprintingoperation.
EachstatementinActionsshouldbedelimitedbysemicolon.
Sayyouhavedata.tsvwiththefollowingcontents:
$ cat data/test.tsv
blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2 ACTTTATATATT
blah_C3 ACTTATATATATATA
blah_C4 ACTTATATATATATA
blah_C5 ACTTTATATATT
BydefaultAwkprintseverylinefromthefile.
$ awk '{print;}' data/test.tsv
blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2 ACTTTATATATT
blah_C3 ACTTATATATATATA
blah_C4 ACTTATATATATATA
blah_C5 ACTTTATATATT
Weprintthelinewhichmatchesthepatternblah_C3
$ awk '/blah_C3/' data/test.tsv
blah_C3 ACTTATATATATATA
Awkhasnumberofbuiltinvariables.Foreachrecordi.eline,itsplitstherecorddelimitedbywhitespace
characterbydefaultandstoresitinthe$nvariables.Ifthelinehas5words,itwillbestoredin$1,$2,$3,$4
and$5.$0representsthewholeline.NFisabuiltinvariablewhichrepresentsthetotalnumberoffieldsina
record.
$ awk '{print $1","$2;}' data/test.tsv
blah_C1,ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2,ACTTTATATATT
blah_C3,ACTTATATATATATA
blah_C4,ACTTATATATATATA
blah_C5,ACTTTATATATT
$ awk '{print $1","$NF;}' data/test.tsv
blah_C1,ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2,ACTTTATATATT
blah_C3,ACTTATATATATATA
blah_C4,ACTTATATATATATA
blah_C5,ACTTTATATATT
AwkhastwoimportantpatternswhicharespecifiedbythekeywordcalledBEGINandEND.Thesyntaxisas
follows:
Although,wecanalsoskipexecutinglaterblocksforagivenlinebyusingnextkeyword:
$ awk '{printf $1"\t"}NR==3{print "";next}{print $1}' data/test.tsv
blah_C1 blah_C1
blah_C2 blah_C2
blah_C3
blah_C4 blah_C4
blah_C5 blah_C5
$ awk 'NR==3{print "";next}{printf $1"\t"}{print $1}' data/test.tsv
blah_C1 blah_C1
blah_C2 blah_C2
blah_C4 blah_C4
blah_C5 blah_C5
Youcanalsousegetlinetoloadthecontentsofanotherfileinadditiontotheoneyouarereading,for
example,inthestatementgivenbelow,thewhileloopwillloadeachlinefromtest.tsvintokuntilno
morelinesaretoberead:
$ awk 'BEGIN{while((getline k <"data/test.tsv")>0) print "BEGIN:"k}{print}'
data/test.tsv
BEGIN:blah_C1
ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
BEGIN:blah_C2
ACTTTATATATT
BEGIN:blah_C3
ACTTATATATATATA
BEGIN:blah_C4
ACTTATATATATATA
BEGIN:blah_C5
ACTTTATATATT
blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2 ACTTTATATATT
blah_C3 ACTTATATATATATA
blah_C4 ACTTATATATATATA
blah_C5 ACTTTATATATT
YoucanalsostoredatainthememorywiththesyntaxVARIABLE_NAME[KEY]=VALUEwhichyoucanlater
usethroughfor(INDEXinVARIABLE_NAME)command:
$ awk '{i[$1]=1}END{for (j in i) print j"<="i[j]}' data/test.tsv
blah_C1<=1
blah_C2<=1
blah_C3<=1
blah_C4<=1
blah_C5<=1
Givenallthatyouhavelearnedsofar,wearegoingtoextractreadsfromaFASTAfilebasedonIDssupplied
inafile.Say,wearegivenaFASTAfilewithfollowingcontents:
[uzi@quince-srv2 ~/linuxTutorial]$ cat data/test.fa
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG
>blah_C2
ACTTTATATATT
>blah_C3
ACTTATATATATATA
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT
andanIDsfile:
[uzi@quince-srv2 ~/linuxTutorial]$ cat data/IDs.txt
blah_C4
blah_C5
Afterlookingatthefile,itisimmediatelyclearthatthesequencesmayspanmultiplelines(forexample,for
blah_C1).IfwewanttomatchanID,wecanfirstlinearizethefilebyusingtheconditionaloperatoras
discussedabovetohavethedelimitedinformationofeachsequenceinoneline,andthenmakelogicto
performfurtherfunctionalityoneachlinelater.Ourlogicisthatforlinesthatcontainheaderinformation/^>/
wecandosomethingdifferently,andforotherlinesweuseprintftoremovenewlinecharacter:
[uzi@quince-srv2 ~/linuxTutorial]$ awk '{printf /^>/ ? $0 : $0}' data/test.fa
>blah_C1ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG>blah_C2ACTTTATATATT>blah_C3ACTTATATAT
ATATA>blah_C4ACTTATATATATATA>blah_C5ACTTTATATATT
Wecanthenputeachsequenceonaseparatelineandalsoputatabcharacter("\t")betweentheheader
andthesequence:
[uzi@quince-srv2 ~/linuxTutorial]$ awk '{printf /^>/ ? "\n"$0 : $0}'
data/test.fa
>blah_C1ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
>blah_C2ACTTTATATATT
>blah_C3ACTTATATATATATA
>blah_C4ACTTATATATATATA
>blah_C5ACTTTATATATT[uzi@quince-srv2 ~/linuxTutorial]$ awk '{printf /^>/ ?
"\n"$0"\t" : $0}' data/test.fa
>blah_C1
ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
>blah_C2
ACTTTATATATT
>blah_C3
ACTTATATATATATA
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT
WecanthenuseNR==1blocktostopprintinganewlinecharacterbeforethefirstheader(asyoucansee
thereisanemptyspace)andusenexttoignorethelaterblock:
ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
>blah_C2
ACTTTATATATT
>blah_C3
ACTTATATATATATA
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT
Wecanthenpipethisstreamtoanotherawkstatementusing"\t"asadelimeter(whichyoucanspecify
using-F)andusegsubtoremove>fromthestartofeachlinesinceourIDsfiledoesn'tcontainthat
character:
[uzi@quince-srv2 ~/linuxTutorial]$ awk 'NR==1{printf $0"\t";next}{printf /^>/ ?
"\n"$0"\t" : $0}' data/test.fa | awk -F"\t" '{gsub("^>","",$0);print $0}'
blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
blah_C2 ACTTTATATATT
blah_C3 ACTTATATATATATA
blah_C4 ACTTATATATATATA
blah_C5 ACTTTATATATT
NowweloadtheIDs.txtfileintheBEGINblock,storetheIDsinthememory,andinthestreamifthefirst
field($1)matchestheIDstoredinthememory,weoutputtheformattedrecord:
[uzi@quince-srv2 ~/linuxTutorial/data]$ awk 'NR==1{printf $0"\t";next}{printf
/^>/ ? "\n"$0"\t" : $0}' data/test.fa | awk -F"\t" 'BEGIN{while((getline k <
"data/IDs.txt")>0)i[k]=1}{gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}'
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT
WithBioawkitismuchsimplerasyoudon'thavetolinearizetheFASTAfileastherecordboundariesarethe
completesequenceboundariesandnotlines:
[uzi@quince-srv2 ~/linuxTutorial/data]$ bioawk -cfastx 'BEGIN{while((getline k
<"data/IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}' data/test.fa
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT
Bioawkcanalsotakeotherinputformatsthatyouspecifywith-c,withthefieldnamesasfollows(youcanuse
thecolumnpairsalternatively):
bed: $1:$chrom $2:$start $3:$end $4:$name $5:$score $6:$strand $7:$thickstart
$8:$thickend $9:$rgb $10:$blockcount $11:$blocksizes $12:$blockstarts
sam: $1:$qname $2:$flag $3:$rname $4:$pos $5:$mapq $6:$cigar $7:$rnext
$8:$pnext $9:$tlen $10:$seq $11:$qual
Exercise2:AlignmentStatisticsforMetagenomics/Population
Genomics
ForthisexercisewewilluseaC.DifficileRibotype078referencedatabasethatcomprisesof61contigs.Even
thoughitisasinglegenomeforwhichwehaveobtainedthesamples,theworkflowgivenbelowremains
similarforthemetagenomicsampleswhenyouhavecompletegenomesinsteadofcontigsinthereference
database(andsoIusethenomenclature:genomes/contigs).Beforeweanalyzeoursamples,wecandosome
qualitycontrolchecksonourrawsequencesusingFastQC.Runningthefollowingcommandwillgeneratea
M120_S2_L001_R1_001_fastqcfolderwithanhtmlpagefastqc_report.htmlinside.Youcanloadit
upinyourbrowsertoassessyourdatathroughgraphsandsummarytables.
[uzi@quince-srv2 ~/linuxTutorial]$ fastqc data/M120_*R1*.fastq
Started analysis of M120_S2_L001_R1_001.fastq
Approx 5% complete for M120_S2_L001_R1_001.fastq
Approx 10% complete for M120_S2_L001_R1_001.fastq
Approx 15% complete for M120_S2_L001_R1_001.fastq
Approx 20% complete for M120_S2_L001_R1_001.fastq
Approx 25% complete for M120_S2_L001_R1_001.fastq
Approx 30% complete for M120_S2_L001_R1_001.fastq
Approx 35% complete for M120_S2_L001_R1_001.fastq
Approx 40% complete for M120_S2_L001_R1_001.fastq
Approx 45% complete for M120_S2_L001_R1_001.fastq
Approx 50% complete for M120_S2_L001_R1_001.fastq
Approx 55% complete for M120_S2_L001_R1_001.fastq
Approx 60% complete for M120_S2_L001_R1_001.fastq
Approx 65% complete for M120_S2_L001_R1_001.fastq
Approx 70% complete for M120_S2_L001_R1_001.fastq
Approx 75% complete for M120_S2_L001_R1_001.fastq
Approx 80% complete for M120_S2_L001_R1_001.fastq
Approx 85% complete for M120_S2_L001_R1_001.fastq
Approx 90% complete for M120_S2_L001_R1_001.fastq
Approx 95% complete for M120_S2_L001_R1_001.fastq
Approx 100% complete for M120_S2_L001_R1_001.fastq
Analysis complete for M120_S2_L001_R1_001.fastq
[uzi@quince-srv2 ~/linuxTutorial]$
Forexample,hereisthefilegeneratedfortheaboveM120_S2_L001_R1_001.fastqfile:
Alternatively,youcanalsotrymyShellutilitiesforQCaswellasShellwrappersforEMBOSSutilities.
Nextweindexourreferencedatabasefile.Indexingspeedsupalignment,allowingthealignertoquicklyfind
short,nearexactmatchestouseasseedsforsubsequentfullalignments.
[uzi@quince-srv2 ~/linuxTutorial/data]$ bwa index Cdiff078.fa
UseBWAMEMtoalignpairedendsequences.Briefly,thealgorithmworksbyseedingalignmentswith
maximalexactmatches(MEMs)andthenextendingseedswiththeaffinegapSmithWatermanalgorithm
(SW).FromBWAdoc,itissuggestedthatfor70bporlongerIllumina,454,IonTorrentandSangerreads,
assemblycontigsandBACsequences,BWAMEMisusuallythepreferredalgorithm.Forshort
sequences,BWAbacktrackmaybebetter.BWASWmayhavebettersensitivitywhenalignmentgaps
arefrequent.
[uzi@quince-srv2 ~/linuxTutorial]$ bwa mem data/Cdiff078.fa
data/M120_*R1*.fastq data/M120_*R2*.fastq > aln-pe.sam
Wehavegeneratedasamfile(alnpe.sam)whichconsistoftwotypesoflines:headersandalignments.
Headersbeginwith@,andprovidemetadataregardingtheentirealignmentfile.Alignmentsbeginwithany
characterexcept@,anddescribeasinglealignmentofasequencereadagainstthereferencegenome.Note
thateachreadinaFASTQfilemayaligntomultipleregionswithinareferencegenome,andanindividualread
canthereforeresultinmultiplealignments.IntheSAMformat,eachofthesealignmentsisreportedona
separateline.Also,eachalignmenthas11mandatoryfields,followedbyavariablenumberofoptionalfields.
Eachofthefieldsisdescribedinthetablebelow:
Col
Field
Description
QNAME
Querytemplate/pairNAME
FLAG
bitwiseFLAG
RNAME
ReferencesequenceNAME
POS
1basedleftmostPOSition/coordinateofclippedsequence
MAPQ
MAPpingQuality(Phredscaled)
CIAGR
extendedCIGARstring
MRNM
MateReferencesequenceNaMe(=ifsameasRNAME)
MPOS
1basedMatePOSistion
TLEN
inferredTemplateLENgth(insertsize)
10
SEQ
querySEQuenceonthesamestrandasthereference
11
QUAL
queryQUALity(ASCII33givesthePhredbasequality)
OPT
variableOPTionalfieldsintheformatTAG:VTYPE:VALUE
12+
whereFLAGisdefinedas:
Flag
Chr
Description
0x0001
thereadispairedinsequencing
0x0002
thereadismappedinaproperpair
0x0004
thequerysequenceitselfisunmapped
0x0008
themateisunmapped
0x0010
strandofthequery(1forreverse)
0x0020
strandofthemate
0x0040
thereadisthefirstreadinapair
0x0080
thereadisthesecondreadinapair
0x0100
thealignmentisnotprimary
0x0200
thereadfailsplatform/vendorqualitychecks
0x0400
thereadiseitheraPCRoranopticalduplicate
SincetheflagsaregivenindecimalrepresentationintheSAMfile,youcanusethislinktocheckwhichflagis
set.WearegoingtouseSAMToolswhichprovidesvarioustoolsformanipulatingalignmentsintheSAM/BAM
format.TheSAM(SequenceAlignment/Map)format(BAMisjustthebinaryformofSAM)iscurrentlythede
factostandardforstoringlargenucleotidesequencealignments.Ifyouaredealingwithhighthroughput
metagenomicwholegenomeshotgunsequencingdata,youwillhavetodealwithSAM/BAMfiles.Seewhat
SAMtoolshavetooffer:
WecanthenuseaprogramSAMstattogetstatisticsonouraln-pe.samfile:
[uzi@quince-srv2 ~/linuxTutorial]$ samstat aln-pe.sam
Runningtheabovecodewillgenerateaalnpe.sam.samstat.htmlfilewhichyoucanopeninyourbrowser(be
patient,ittakesabitoftimetoload).Plotssuchas"ReadsLengthDistributions"and"BaseQuality
Distributions"maybeofinteresttoyou:
NowweconvertSAMfiletothebinaryBAMfile
[uzi@quince-srv2 ~/linuxTutorial]$ samtools view -h -b -S aln-pe.sam > alnpe.bam
Extractonlythosesequencesthatweremappedagainstthereferencedatabase.Use-F 4switch.
[uzi@quince-srv2 ~/linuxTutorial]$ samtools view -b -F 4 aln-pe.bam > alnpe.mapped.bam
Generateafilelengths.genomethatcontainstwoentriesperrow:genomeidentifierandthecorresponding
genomelength:
[uzi@quince-srv2 ~/linuxTutorial]$ samtools view -H aln-pe.mapped.bam | perl ne 'if ($_ =~ m/^\@SQ/) { print $_ }' | perl -ne 'if ($_ =~ m/SN:(.+)\s+LN:
(\d+)/) { print $1, "\t", $2, "\n"}' > lengths.genome
[uzi@quince-srv2 ~/linuxTutorial]$ cat lengths.genome
Cdiff078_C01
9165
Cdiff078_C02
93786
Cdiff078_C03
752
Cdiff078_C04
5361
Cdiff078_C05
70058
Cdiff078_C06
23538
Cdiff078_C07
98418
Cdiff078_C08
361074
Cdiff078_C09
45183
Cdiff078_C10
141523
Cdiff078_C11
21992
Cdiff078_C12
2353
Cdiff078_C13
133975
Cdiff078_C14
3374
Cdiff078_C15
9744
Cdiff078_C16
25480
Cdiff078_C17
293596
Cdiff078_C18
7057
Cdiff078_C19
73989
Cdiff078_C20
248092
Cdiff078_C21
41937
Cdiff078_C22
65693
Cdiff078_C23
21321
Cdiff078_C24
440055
Cdiff078_C25
210910
Cdiff078_C26
164162
Cdiff078_C27
22782
Cdiff078_C28
201701
Cdiff078_C29
13447
Cdiff078_C30
101704
Cdiff078_C31
146436
Cdiff078_C32
61153
Cdiff078_C33
59640
Cdiff078_C34
193273
Cdiff078_C35
18395
Cdiff078_C36
25573
Cdiff078_C37
61616
Cdiff078_C38
4117
Cdiff078_C39
110461
Cdiff078_C40
125351
Cdiff078_C41
38508
Cdiff078_C42
113221
Cdiff078_C43
500
Cdiff078_C44
547
Cdiff078_C45
613
Cdiff078_C46
649
Cdiff078_C47
666
Cdiff078_C48
783
Cdiff078_C49
872
Cdiff078_C50
872
Cdiff078_C51
879
Cdiff078_C52
921
Cdiff078_C53
955
Cdiff078_C54
1217
Cdiff078_C55
1337
Cdiff078_C56
1445
Cdiff078_C57
2081
Cdiff078_C58
2098
Cdiff078_C59
2512
Cdiff078_C60
2800
Cdiff078_C61
4372
[uzi@quince-srv2 ~/linuxTutorial]$
SortBAMfile.ManyofthedownstreamanalysisprogramsthatuseBAMfilesactuallyrequireasortedBAM
file.-mspecifiesthemaximummemorytouse,andcanbechangedtofityoursystem.
[uzi@quince-srv2 ~/linuxTutorial]$ samtools sort -m 1000000000 alnpe.mapped.bam aln-pe.mapped.sorted
Wewillnowusebedtools.ItisaveryusefulsuiteofprogramsforworkingwithSAM/BAM,BED,VCFandGFF
files,filesthatyouwillencoutermanytimesdoingNGSanalysis.-ibamswitchtakesindexedbamfilethatwe
generatedearlier,-dreportsthedepthateachgenomepositionwith1basedcoordinates,and-gisusedto
providethegenomelengthsfilewegeneratedearlier.Thecoverageflagsareexplainedpictoriallyfrom
genomecovmanpage:
Reference:http://bedtools.readthedocs.org/en/latest/_images/genomecov-glyph.png
[uzi@quince-srv2 ~/linuxTutorial]$ bedtools genomecov -ibam alnpe.mapped.sorted.bam -d -g lengths.genome > aln-pe.mapped.bam.perbase.cov
Lookatthefirstfewentriesinthefilegeneratedabove.Firstcolumnisgenomeidentifier,secondcolumnis
positionongenome,andthirdcolumniscoverage.
[uzi@quince-srv2 ~/linuxTutorial]$ head aln-pe.mapped.bam.perbase.cov
Cdiff078_C01
41
Cdiff078_C01
41
Cdiff078_C01
42
Cdiff078_C01
42
Cdiff078_C01
42
Cdiff078_C01
44
Cdiff078_C01
44
Cdiff078_C01
44
Cdiff078_C01
44
Cdiff078_C01
10
44
[uzi@quince-srv2 ~/linuxTutorial]$
Nowwewillcountonlythosepositionswherewehave>0coverage.
[uzi@quince-srv2 ~/linuxTutorial]$ awk -F"\t" '$3>0{print $1}' alnpe.mapped.bam.perbase.cov | sort | uniq -c > aln-pe.mapped.bam.perbase.count
Toseewhatwehavedone,usethecatcommand
[uzi@quince-srv2 ~/linuxTutorial]$ cat aln-pe.mapped.bam.perbase.count
9165 Cdiff078_C01
93786 Cdiff078_C02
752 Cdiff078_C03
5361 Cdiff078_C04
70058 Cdiff078_C05
23538 Cdiff078_C06
98418 Cdiff078_C07
333224 Cdiff078_C08
44803 Cdiff078_C09
141523 Cdiff078_C10
21969 Cdiff078_C11
2292 Cdiff078_C12
133974 Cdiff078_C13
1762 Cdiff078_C14
50 Cdiff078_C15
10232 Cdiff078_C16
293440 Cdiff078_C17
7057 Cdiff078_C18
73989 Cdiff078_C19
248092 Cdiff078_C20
41937 Cdiff078_C21
65447 Cdiff078_C22
21321 Cdiff078_C23
439123 Cdiff078_C24
210910 Cdiff078_C25
164162 Cdiff078_C26
22782 Cdiff078_C27
201701 Cdiff078_C28
13447 Cdiff078_C29
98510 Cdiff078_C30
146261 Cdiff078_C31
61153 Cdiff078_C32
44523 Cdiff078_C33
193180 Cdiff078_C34
18395 Cdiff078_C35
25573 Cdiff078_C36
61616 Cdiff078_C37
4117 Cdiff078_C38
62897 Cdiff078_C39
125351 Cdiff078_C40
38508 Cdiff078_C41
113221 Cdiff078_C42
442 Cdiff078_C43
649 Cdiff078_C46
663 Cdiff078_C47
766 Cdiff078_C48
580 Cdiff078_C51
1110 Cdiff078_C54
1445 Cdiff078_C56
2512 Cdiff078_C59
2800 Cdiff078_C60
[uzi@quince-srv2 ~/linuxTutorial]$
Wewillnowusetheabovefilewithlengths.genometocalculatetheproportionsofgenomes/contigs
coveredusingthefollowingoneliner.Itreadslengths.genomelinebyline,assignsthegenomeidentifierto
myArray[0],it'slengthtomyArray[1].Itthensearchestheidentifierinalnpe.mapped.bam.perbase.count,extractsthebasecount,andusesbctocalculatetheproportions.
[uzi@quince-srv2 ~/linuxTutorial]$ while IFS=$'\t' read -r -a myArray; do echo
-e "${myArray[0]},$( echo "scale=5;0"$(awk -v pattern="${myArray[0]}"
'$2==pattern{print $1}' aln-pe.mapped.bam.perbase.count)"/"${myArray[1]} | bc )
"; done < lengths.genome > aln-pe.mapped.bam.genomeproportion
[uzi@quince-srv2 ~/linuxTutorial]$ cat aln-pe.mapped.bam.genomeproportion
Cdiff078_C01,1.00000
Cdiff078_C02,1.00000
Cdiff078_C03,1.00000
Cdiff078_C04,1.00000
Cdiff078_C05,1.00000
Cdiff078_C06,1.00000
Cdiff078_C07,1.00000
Cdiff078_C08,.92286
Cdiff078_C09,.99158
Cdiff078_C10,1.00000
Cdiff078_C11,.99895
Cdiff078_C12,.97407
Cdiff078_C13,.99999
Cdiff078_C14,.52222
Cdiff078_C15,.00513
Cdiff078_C16,.40156
Cdiff078_C17,.99946
Cdiff078_C18,1.00000
Cdiff078_C19,1.00000
Cdiff078_C20,1.00000
Cdiff078_C21,1.00000
Cdiff078_C22,.99625
Cdiff078_C23,1.00000
Cdiff078_C24,.99788
Cdiff078_C25,1.00000
Cdiff078_C26,1.00000
Cdiff078_C27,1.00000
Cdiff078_C28,1.00000
Cdiff078_C29,1.00000
Cdiff078_C30,.96859
Cdiff078_C31,.99880
Cdiff078_C32,1.00000
Cdiff078_C33,.74652
Cdiff078_C34,.99951
Cdiff078_C35,1.00000
Cdiff078_C36,1.00000
Cdiff078_C37,1.00000
Cdiff078_C38,1.00000
Cdiff078_C39,.56940
Cdiff078_C40,1.00000
Cdiff078_C41,1.00000
Cdiff078_C42,1.00000
Cdiff078_C43,.88400
Cdiff078_C44,0
Cdiff078_C45,0
Cdiff078_C46,1.00000
Cdiff078_C47,.99549
Cdiff078_C48,.97828
Cdiff078_C49,0
Cdiff078_C50,0
Cdiff078_C51,.65984
Cdiff078_C52,0
Cdiff078_C53,0
Cdiff078_C54,.91207
Cdiff078_C55,0
Cdiff078_C56,1.00000
Cdiff078_C57,0
Cdiff078_C58,0
Cdiff078_C59,1.00000
Cdiff078_C60,1.00000
Cdiff078_C61,0
Wehaveatotalof61genomes/contigsinthereferencedatabase.Toseehowmanygenomes/contigswe
recovered,wewillusethefollowingoneliner:
[uzi@quince-srv2 ~/linuxTutorial]$ awk -F "," '{sum+=$NF} END{print "Total
Cdiff078_C25,57.4446
Cdiff078_C26,59.296
Cdiff078_C40,66.0074
Cdiff078_C27,59.4391
Cdiff078_C41,67.5941
Cdiff078_C28,59.8319
Cdiff078_C42,69.4415
Cdiff078_C29,60.961
Cdiff078_C43,4.812
Cdiff078_C46,29.3837
Cdiff078_C60,62.1336
Cdiff078_C47,7.95946
Cdiff078_C48,15.3436
[uzi@quince-srv2 ~/linuxTutorial]$
Sorttheoriginalbamfile
[uzi@quince-srv2 ~/linuxTutorial]$ samtools sort -m 1000000000 aln-pe.bam alnpe.sorted
NowwewillcheckalignmentstatisticsusingthePicardtools.Notethattheawkstatementgivenbelowisused
totransposetheoriginaltableandyoucandowithoutit.
[uzi@quince-srv2 ~/linuxTutorial]$ java -jar $(which
CollectAlignmentSummaryMetrics.jar) INPUT=aln-pe.sorted.bam OUTPUT=alnpe.sorted.alignment_stats.txt REFERENCE_SEQUENCE=data/Cdiff078.fa
[uzi@quince-srv2 ~/linuxTutorial]$ grep -vi -e "^#" -e "^$" alnpe.sorted.alignment_stats.txt | awk -F"\t" '{ for (i=1; i<=NF; i++) {a[NR,i] =
$i}}NF>p{p=NF}END{for(j=1;j<=p;j++){str=a[1,j];for(i=2; i<=NR; i++)
{str=str"\t"a[i,j];} print str}}'
CATEGORY
FIRST_OF_PAIR
SECOND_OF_PAIR PAIR
TOTAL_READS
425271
425038
850309
PF_READS
425271
425038
850309
PCT_PF_READS
PF_NOISE_READS
PF_READS_ALIGNED
407011
405258
812269
PCT_PF_READS_ALIGNED
0.957063
0.953463
0.955263
PF_ALIGNED_BASES
119451610
118113100
237564710
PF_HQ_ALIGNED_READS
401018
399295
800313
PF_HQ_ALIGNED_BASES
118606615
117274833
235881448
PF_HQ_ALIGNED_Q20_BASES
116971078
111640501
228611579
PF_HQ_MEDIAN_MISMATCHES
PF_MISMATCH_RATE
0.002359
0.007186
0.004759
PF_HQ_ERROR_RATE
0.002269
0.007065
0.004653
PF_INDEL_RATE
0.000124
0.00013
0.000127
MEAN_READ_LENGTH
299.093366
298.832657
298.963048
READS_ALIGNED_IN_PAIRS
404714
404545
809259
PCT_READS_ALIGNED_IN_PAIRS
0.994356
0.998241
0.996294
BAD_CYCLES
STRAND_BALANCE
0.500072
0.500484
0.500278
PCT_CHIMERAS
0.014823
0.014668
0.014746
PCT_ADAPTER
0.000285
0.000261
0.000273
SAMPLE
LIBRARY
READ_GROUP
[uzi@quince-srv2 ~/linuxTutorial]$
Thedetaileddescriptionofthesesummarymetricsaregivenhere.Fromthislink,PF_MISMATCH_RATE,
PF_HQ_ERROR_RATE,andPF_INDEL_RATEareofinteresttous.Ascanbeseen,theerrorratesarequite
lowandwecanproceedwiththeanalysis.NextwewouldliketocalculateGCbias.Forthispurpose,wewill
indexaln-pe.mapped.sorted.bamfile.
[uzi@quince-srv2 ~/linuxTutorial]$ samtools index aln-pe.mapped.sorted.bam
[uzi@quince-srv2 ~/linuxTutorial]$ for i in $(samtools view -H alnpe.mapped.sorted.bam | awk -F"\t" '/@SQ/{gsub("^SN:","",$2);print $2}'
);do samtools view -b aln-pe.mapped.sorted.bam $i > alnpe.mapped.sorted.$i.bam; java -Xmx2g -jar $(which CollectGcBiasMetrics.jar)
R=data/Cdiff078.fa I=aln-pe.mapped.sorted.$i.bam O=alnpe.mapped.sorted.${i}_GCBias.txt CHART=aln-pe.mapped.sorted.${i}_GCBias.pdf
ASSUME_SORTED=true; done
Intheaboveoneliner,CollectGcBiasMetrics.jarwillgenerateaGCbiasplotforeachcontig,andwill
looklikethese:
Nowcollateallthetxtfilestogether:
[uzi@quince-srv2 ~/linuxTutorial]$ for i in $(ls *_GCBias.txt); do awk -v
k="$i" '!/^#/ && !/^$/ && !/^GC/ && !/?/{print k"\t"$1"\t"$5}' $i; done | perl
-ane '$r{$F[0].":".$F[1]}=$F[2];unless($F[0]~~@s){push
@s,$F[0];}unless($F[1]~~@m){push @m,$F[1];}END{print
"Contigs\t".join("\t",@s)."\n";for($i=0;$i<@m;$i++){print
$m[$i];for($j=0;$j<@s;$j++){(not defined $r{$s[$j].":".$m[$i]})?print
"\t".0:print"\t".$r{$s[$j].":".$m[$i]};}print "\n";}}' | sed '1s/alnpe\.mapped\.sorted\.//g;1s/_GCBias\.txt//g' > aln-pe.mapped.sorted.bam.gcbias
Donotgetintimidatedbytheperlonelinerintheabovestatement.Ihaveextracteditfrommy
GENERATEtable.shscript.Ifthedatastreamisoftheform[Contig]\t[Feature]\t[Value],thenyou
canpipethestreamtoGENERATEtable.shtoobtainaContigXFeaturetable:
$ cat test.tsv
contig1 F1
12.2
contig1 F2
34.2
contig1 F3
45.2
contig2 F2
56.3
contig2 F3
56.2
contig3 F1
45.4
contig3 F2
56.3
contig4 F1
23.5
contig5 F1
24.5
$ cat GENERATEtable.sh
#!/bin/bash
less <&0| \
perl -ane '$r{$F[0].":".$F[1]}=$F[2];
unless($F[0]~~@s){
push @s,$F[0];}
unless($F[1]~~@m){
push @m,$F[1];}
END{
print "Contigs\t".join("\t",@s)."\n";
for($i=0;$i<@m;$i++){
print $m[$i];
for($j=0;$j<@s;$j++){
(not defined $r{$s[$j].":".$m[$i]})?print
"\t".0:print"\t".$r{$s[$j].":".$m[$i]};}
print "\n";}}'
$ cat test.tsv | ./GENERATEtable.sh
Contigs contig1 contig2 contig3 contig4 contig5
F1
12.2
45.4
23.5
24.5
F2
34.2
56.3
56.3
F3
45.2
56.2
Nowtakealookatthegeneratedtable:
[uzi@quince-srv2 ~/linuxTutorial]$ head aln-pe.mapped.sorted.bam.gcbias
Contigs Cdiff078_C01
Cdiff078_C02
Cdiff078_C03
Cdiff078_C04
Cdiff078_C05
Cdiff078_C06
Cdiff078_C07
Cdiff078_C08
Cdiff078_C09
Cdiff078_C10
Cdiff078_C11
Cdiff078_C12
Cdiff078_C13
Cdiff078_C14
Cdiff078_C15
Cdiff078_C16
Cdiff078_C17
Cdiff078_C18
Cdiff078_C19
Cdiff078_C20
Cdiff078_C21
Cdiff078_C22
Cdiff078_C23
Cdiff078_C24
Cdiff078_C25
Cdiff078_C26
Cdiff078_C27
Cdiff078_C28
Cdiff078_C29
Cdiff078_C30
Cdiff078_C31
Cdiff078_C32
Cdiff078_C33
Cdiff078_C34
Cdiff078_C35
Cdiff078_C36
Cdiff078_C37
Cdiff078_C38
Cdiff078_C39
Cdiff078_C40
Cdiff078_C41
Cdiff078_C42
Cdiff078_C43
Cdiff078_C46
Cdiff078_C47
Cdiff078_C48
Cdiff078_C51
Cdiff078_C54
Cdiff078_C56
Cdiff078_C59
Cdiff078_C60
19.855622
113.514035
5.94012 0
9.265957
21.189287
1.013634
21.79477
2.099257
8.004623
17.472841
4.059842
0.50366 0
3.954422
15.404681
4.36693 0
1.355168
1.898603
1.454454
11.850605
0.919685
0.229238
5.621592
0
3.322151
0.395449
1.167889
2.616454
0.554199
5.625801
3.721214
0.647487
1.787893
0.73865 2.481033
1.011679
2.647623
1.063016
1.10446 0
0.699548
1.484725
5.277073
23.930557
0
4.918962
1.302196
5.292116
0.488503
0.270507
1.796704
8.023401
1.168119
1.534896
0.32406 4.34351 0
0.738781
0.917049
0.646285
0.2347 0
1.275217
0.465283
2.237992
1.725467
1.625944
2.181692
1.672486
1.27811
0.362786
3.35627 0.457715
4.796381
3.673656
0.915163
1.283086
0.909904
2.225468
0.352394
1.259544
13.719311
1.329643
0.540768
0.765043
0.562234
8.036611
10
0.818789
1.092856
0.885359
1.311588
1.97216 0.494173
0.522956
0.080909
0
1.131649
1.071169
0.512673
2.959123
1.437926
1.849716
0.713225
0.63983 0.739728
1.748817
2.30876 1.411724
0.674416
1.696827
1.361078
0.582441
1.147602
1.42304 1.223984
0.463436
0.393384
11
2.714431
1.504252
0.997528
0.321222
0.794326
0
0.187891
2.288661
4.409023
0.569749
0.724603
0
0.315217
0.494047
0.247215
2.378187
1.226165
1.355215
1.150524
1.9538
0.491785
0.769493
0.719303
0.885474
1.083578
1.954739
1.513891
1.138604
0.904377
0.50476 0.713456
2.795913
1.355035
1.755064
1.204967
1.04366 0.260253
1.697555
0.566266
0.927834
1.018818
1.456152
0.857751
0.848433
0.892897
0.335209
0.829905
0.755116
2.490848
1.208937
0
0
YoucanthenusethefollowingRcodetogenerateaGCvsCoveragetablewhichshowsthatatveryGC,
coveragesgodown(notethatthesearethesmoothedvaluesacrossallgenomes/contigs):
Nowwecalculatemeanqualityscorebycycle
[uzi@quince-srv2 ~/linuxTutorial]$ java -Xmx2g -jar $(which
MeanQualityByCycle.jar) INPUT=aln-pe.mapped.sorted.bam OUTPUT=alnpe.mapped.sorted.mqc.txt CHART_OUTPUT=aln-pe.mapped.sorted.mqc.pdf
Wealsocalculatequalityscoredistribution
[uzi@quince-srv2 ~/linuxTutorial]$ java -Xmx2g -jar $(which
QualityScoreDistribution.jar) INPUT=aln-pe.mapped.sorted.bam OUTPUT=alnpe.mapped.sorted.qsd.txt CHART_OUTPUT=aln-pe.mapped.sorted.qsd.pdf
AnotherusefultoolisQualimapwhichoffersMultisampleBAMQC.
Touseit,weneedtogenerateinput.txtfilewhichcontainslistingofBAMfileswewanttocompare.To
savetime,Iamonlyconsidering3files:
[uzi@quince-srv2 ~/linuxTutorial]$ awk '{split($0,k,".");print k[4]"\t"$0}'
<(ls *.sorted.Cdiff078_C4*.bam | head -3) > input.txt
[uzi@quince-srv2 ~/linuxTutorial]$ cat input.txt
Cdiff078_C40
aln-pe.mapped.sorted.Cdiff078_C40.bam
Cdiff078_C41
aln-pe.mapped.sorted.Cdiff078_C41.bam
Cdiff078_C42
aln-pe.mapped.sorted.Cdiff078_C42.bam
77
TTAAAGTTAAACTTGTCATATTCATATCTGATTTTTCTACTAGATTCCTTTAAGTTATCCGAACATGAAGCAAGTAATT
TATCCTTAATTAAATTATAGACTTTACTTTCTTTATCAGATAAATCTTTAGCTTTTCCAATACCAGATATAGTAGGAAT
AATTGCATAGTGGTCTGTAACTTTAGATGAATTAAAAATAGACTTAAAGTTTGATTCATTGATTTTAAAATCTTCTTCA
AGTCCTTCTAATAATTCTTTCATAGTATTAACCATATCATTGGTTAAATACCTGCTATCCGTTC
CCCCCGGGGGGGGGGFGFGGGGGGGGGGGGGCEGGGGGFGGGGEFEEGGFEGGGDFFGFGGGFFFGGGGGGFFFGGGGF
EFFEBFGGGC>FGGFBFGGGFDF,@DEEFCFGGGGGGGGCEF,?
EGFGGGFDFGGFGGFDCDFFGFFDFFFBFBFFGFFGEFFAAFCEFFFFF5@09*2?
EE@A*>@AEF5@):=>E;EB**9:495*
AS:i:0 XS:i:0
M01808:26:000000000-A6KTK:1:1101:10136:1113
*
77
CTATTGGAACAAGTGGGGAACTGCAGTCGCCTAACAGAGAATATATTCGTTATCGAATTACATTATCTACTCAAGACAC
CAGTAGAACTCCTAAACTTCTTGAAATACAACTACATGATATACCAAAACCTCCTTATGAGAGACTTGGATTTGCAAGA
CCAGTTGTGTTGGATACTAACGGGGCTTGGGAAGCAGTGTTAGAAAATGCCTTTGATATTGTAGTAACAAGTGAAGTAA
ATGGCGCTGATATTCTGGAGTTTAAACTGCCATTTCATGATTCCAAGCGAGAGACATTAGACA
CCCCCGGGGEGGF
Extractingmappedreadswithheaders:
[uzi@quince-srv2 ~/linuxTutorial]$ bioawk -c sam -H '!and($flag,4)' aln-pe.sam
| less
@SQ
SN:Cdiff078_C01 LN:9165
@SQ
SN:Cdiff078_C02 LN:93786
@SQ
SN:Cdiff078_C03 LN:752
@SQ
SN:Cdiff078_C04 LN:5361
@SQ
SN:Cdiff078_C05 LN:70058
@SQ
SN:Cdiff078_C06 LN:23538
@SQ
SN:Cdiff078_C07 LN:98418
@SQ
SN:Cdiff078_C08 LN:361074
@SQ
SN:Cdiff078_C09 LN:45183
@SQ
SN:Cdiff078_C10 LN:141523
@SQ
SN:Cdiff078_C11 LN:21992
@SQ
SN:Cdiff078_C12 LN:2353
@SQ
SN:Cdiff078_C13 LN:133975
@SQ
SN:Cdiff078_C14 LN:3374
@SQ
SN:Cdiff078_C15 LN:9744
@SQ
SN:Cdiff078_C16 LN:25480
:
CreateFASTAfromBAM(usesrevcompifFLAG&16):
[uzi@quince-srv2 ~/linuxTutorial]$ samtools view aln-pe.bam | bioawk -c sam '{
s=$seq; if(and($flag, 16)) {s=revcomp($seq) } print ">"$qname"\n"s}' | less
>M01808:26:000000000-A6KTK:1:1101:19201:1002
NAAAAGAACTGGCAATTGAAAATAATATACCTGTATATCAACCAGTAAAGGCTAGAGATAAAGAATTTATAGATACAAT
TAAATCTTTAAATCCAGATGTAATAGTAGTTGTAGCTTTTGGACAGATACTTCCAAAAGGAATATTAGAGATTCCTAAG
TTTGGATGTATAAATGTTCATGTTTCTTTACTTCCAAAATATAGAGGTGCGGCACCTATAAATTGGGTAATAATAAATG
GTGAAGAAAAGACTGGTGTTACAACTATGTATATGGATGAAGGTCTAGATACTGGA
>M01808:26:000000000-A6KTK:1:1101:19201:1002
NCCAGTATCTAGACCTTCATCCATATACATAGTTGTAACACCAGTCTTTTCTTCACCATTTATTATTACCCAATTTATA
GGTGCCGCACCTCTATATTTTGGAAGTAAAGAAACATGAACATTTATACATCCAAACTTAGGAATCTCTAATATTCCTT
TTGGAAGTATCTGTCCAAAAGCTACAACTACTATTACATCTGGATTTAAAGATTTAATTGTATCTATAAATTCTTTATC
TCTAGCCTTTACTGGTTGATATACAGGTATATTATTTTCAATTGCCAGTTCTTTTA
>M01808:26:000000000-A6KTK:1:1101:12506:1003
NAAAGATATTATTTTTAGCCCTGGTGTTGTACCTGCTGTTGCTATTTTAGTAAGAATATTAACTAATTCTAATGAAGGC
GTGATAATTCAAAAGCCAGTGTATTACCCATTTGAAGCTAAGGTAAAGAGTAATAATAGGGAAGTTGTAAACAATCCTC
TAATATATGAAAATGGGACTTATAGAATGGATTATGATGATTTGGAAGAAAAAGCTAAGTGTAGCAACAATAAAGTACT
GATACTTTGTAGCCCTCACAATCCTGTTGGAAGAGTTTGGAGAGAAGATGAATTAAAAAAGGTT
>M01808:26:000000000-A6KTK:1:1101:12506:1003
NAGATTAAATGTTTTACTTGGAGCTATACATGTAACTATTTTATCCTTGTACTCTGGGCATAATGACTGTAAAGGAGTA
TGTTTAAATCCTTTTCTAATTAAATCAGAATGTATCTCATCAGCTATTATCCATAGGTCATATTTTTTACATATTTCTA
CAACCTTTTTTAATTCATCTTCTCTCCAAACTCTTCCAACAGGATTGTGAGGGCTACAAAGTATCAGTACTTTATTGTT
GCTACACTTAGCTTTTTCTTCCAAATCATCATAATCCATTCTATAAGTCCCATTTTCATATATT
:
Get%GCcontentfromreferenceFASTAfile:
[uzi@quince-srv2 ~/linuxTutorial]$ bioawk -c fastx '{ print ">"$name; print
gc($seq) }' data/Cdiff078.fa | less
>Cdiff078_C01
0.28096
>Cdiff078_C02
0.307669
>Cdiff078_C03
0.514628
>Cdiff078_C04
0.26898
>Cdiff078_C05
0.291059
>Cdiff078_C06
0.286006
>Cdiff078_C07
0.282794
>Cdiff078_C08
0.289484
:
GetthemeanPhredqualityscorefromaFASTQfile:
[uzi@quince-srv2 ~/linuxTutorial]$ bioawk -c fastx '{ print ">"$name; print
meanqual($qual) }' data/M120_S2_L001_R1_001.fastq | less
>M01808:26:000000000-A6KTK:1:1101:19201:1002
37.3788
>M01808:26:000000000-A6KTK:1:1101:12506:1003
36.9867
>M01808:26:000000000-A6KTK:1:1101:19794:1003
37.1694
>M01808:26:000000000-A6KTK:1:1101:20543:1021
37.01
>M01808:26:000000000-A6KTK:1:1101:14616:1037
33.9133
>M01808:26:000000000-A6KTK:1:1101:10885:1044
35.9502
:
Youwanttoseehowmanysequencesareshorter(lessthan1000bp?)
[uzi@quince-srv2 ~/linuxTutorial]$ bioawk -cfastx 'BEGIN{ s = 0} {if
(length($seq) < 1000) s += 1} END {print "Shorter sequences", s}'
data/Cdiff078.fa
Shorter sequences
12
YoucancountsequencesveryeffectivelywithBioawk,becauseNRnowstoresnumberofrecords:
[uzi@quince-srv2 ~/linuxTutorial]$ bioawk -cfastx 'END{print NR}'
data/630_S4_L001_R1_001.fastq
329396
FurtherReading
Inthecontextoftheexercises,itwillbehelpfulifyoucouldreadthroughthefollowingonlinetutorials,thoughit
isnotessential:
Bashtutorial(https://jack.logicalsystems.it/homepage/techinfo/Guida-Bash.txt)
Awkoneliners(http://www.pement.org/awk/awk1line.txt)
Sedoneliners(http://sed.sourceforge.net/sed1line.txt)
Perloneliners(http://www.catonmat.net/download/perl1line.txt)
VItutorial(http://www.nanocontact.cz/~trunec/education/unix/vi-tutor.txt)
LastUpdatedbyDrUmerZeeshanIjazon23/01/2015.