Linux Command Line Exercises - Linux+CSC Quick Reference
Linux Command Line Exercises - Linux+CSC Quick Reference
In these instructions the first character "$" in the command examples should not be typed, but it
denotes the command prompt.
Some command lines are too long to fit a line in printed form. These are indicated by a backslash “\” at
the end of line. It should not be included when typing in the command. For example
$ example command \
continues \
and continues
0 Download and unpack the exercise files (do that first time only):
Go to the workshop home page (www.csc.fi -> click at the workshop name link at the right column, scroll
down and download the tar archive from the link, save as …)
Open a terminal, cd to the folder where you downloaded the archive, unzip and untar the file:
$ cd inputs
$ ls -l
$ cd ..
Etc.
Go to the result subdirectory (which itself has subdirectories). The pdb-files have a line with "TITLE
somename". Below are some commands that you can try find out which file is about caffeine. What are
the other structures about?
3) Create subfolders
There are only three result* -directories. Create new ones: result4 and result5 for outputfiles 4
and 5 and move the outputfiles (out_4.pdb, out_5.pdb) to those directories.
Go to the moving-around directory where the original tar file is (dirs.tar.gz) with cd.
search and pressing "n" proceeds to the next occurrence of the keyword. You can also scroll the screen
with arrow keys when needed. Exit from the man page with "q"
/sort and press "enter" (note, you need to give this command while in the man page, not from
command prompt)
Press n as many times as needed until you find the flag for sorting by file size. You can also use some
other keyword to find that (e.g. size).
$ ls -S
4) Additional flags
Search for the flag that will reverse the sort order (i.e. print the largest file last). You can give the flags to
the ls command together (e.g. ls -la instead of ls -l -a).
Search for the flag that will show the file size in "human readable format" i.e. kB/MB/GB instead of
bytes (the default).
Tip: you can also search for the meaning of flags directly by /-S (which will look for occurrence of "-S"
(or even better "-S " (trailing space)) if you want to know what that flag does.
3 Using wildcards
Metadata: commands in this exercise: ls, cp
Metadata: targeting many files with wildcards
Linux enables wildcards or regular expressions to match files or strings that differ only in some
controlled ways.
$ ls
List only those files that have an "a" in the name and end in ".pdb"
$ ls *a*.pdb
List those files that have a name with seven characters and end in ".pdb".
Introduction to Linux: Exercises 4/10
$ ls ???????.pdb
2) Limit listing of the output files containing a range of numbers
$ ls */*.pdb
In this command the first "*" tells to look at all subdirectories in the current directory, and the second
"*" all strings Now limit the list to only those out-files that have a number 2-5 in their name:
$ ls */out_[2-5].pdb
What is the difference to this command?
$ ls */*[2-5].pdb
Change into the sub-directory linux-exercises/backupscript. In that directory you should have
the following files that are the solutions (so don’t open) for the following exercises:
enhancedbackupscript.sh simplebackupscript.sh
Based on the start from our befriendly.sh example, create a script that copies all files with a suffix
(e.g., test.dat) from your home directory automatically to a directory in /tmp/homebackup that is
first created by the same script. Use wildcards for that. Try to place some verbosity into the script by
using the echo-command, e.g.,
Using the possibility to store the output of a command in a local variable, create a directory-name that
includes the current date:
destination=/tmp/homebackup_$(date +%Y-%m-%d)
And rewrite the script to create dedicated backups that are distinguishable by this date. Hint: You can
then use the variable in connection with the mkdir command simply by $destination.
Introduction to Linux: Exercises 5/10
Change into the linux-exercises/chem directory. In that directory you should now have these files:
dimer.log : Gaussian quantum chemistry geometry optimization log file for a water dimer
dimer_scan.log : Gaussian log file for relaxed potential energy scan for stretching water dimer
distance
freq.log : Gaussian log file for a frequency calculation
cp2k.out : cp2k calculation ascii output file
cp2k.xyz : xyz format molecular structure file of liquid water
The part in the log file where the energy has converged is shown below. The final energy is printed on
the shadowed line.
Programs often print out messages in case something goes wrong or the user has chosen questionable
options. Is there anything in cp2k.out that we need to worry about?
4) look for the development of the convergence criteria, which of these is satisfied
last? (dimer.log)
Item Value Threshold Converged?
Maximum Force 0.057661 0.000450 NO
RMS Force 0.022508 0.000300 NO
Maximum Displacement 0.218107 0.001800 NO
RMS Displacement 0.108003 0.001200 NO
Try some of these, what do they do?
Are there errors or warnings? Was the preceding geometry optimization successful i.e. the structure is a
minimum on a potential energy surface? (hint. were forces and displacements converged?)
A line that specifies the coordinates for an oxygen atom looks like this:
$ grep O cp2k.xyz
How many lines was that?
$ grep O cp2k.xyz | wc
Introduction to Linux: Exercises 7/10
What does wc (short for word count) print out? Can we give it some flags? (try man wc, or google for it)
How many atoms in total? (we know it's only hydrogen and oxygen atoms, i.e. O and H)
$ wc -l cp2k.xy
What if your structure had also osmium atoms (Os). How would you change your commands?
The part in the output file that shows timing is like this:
*******************************************************************************
ENSEMBLE TYPE = NVE
STEP NUMBER = 1
TIME [fs] = 0.500000
CONSERVED QUANTITY [hartree] = -0.880615921354E+04
INSTANTANEOUS AVERAGES
CPU TIME [s] = 290.20 290.20
ENERGY DRIFT PER ATOM [K] = -0.810224161417E+02 0.000000000000E+00
POTENTIAL ENERGY[hartree] = -0.880838552803E+04 -0.880838552803E+04
KINETIC ENERGY [hartree] = 0.222631448393E+01 0.222631448393E+01
TEMPERATURE [K] = 305.326 305.326
*******************************************************************************
You could try this to get the timing:
$ more times
Start gnuplot with gnuplot and give
Explanation: in gnuplot the command is "plot", followed with the filename that has the data to
be plotted (as it is a string, it needs to be quoted), "using" tells gnuplot to use the following
columns in that file, "0" means the line number (first line=1, second line=2,… this will be the x-
axis), ":" means to plot the second column as the function of the first column, "5":th column will
be the y-axis, "with" is followed by what to plot at the coordinates, now it's "points" i.e. some
symbols.
The part in the output file that has this information looks like this:
*******************************************************************************
ENSEMBLE TYPE = NVE
STEP NUMBER = 1
TIME [fs] = 0.500000
CONSERVED QUANTITY [hartree] = -0.880615921354E+04
INSTANTANEOUS AVERAGES
CPU TIME [s] = 290.20 290.20
ENERGY DRIFT PER ATOM [K] = -0.810224161417E+02 0.000000000000E+00
POTENTIAL ENERGY[hartree] = -0.880838552803E+04 -0.880838552803E+04
KINETIC ENERGY [hartree] = 0.222631448393E+01 0.222631448393E+01
TEMPERATURE [K] = 305.326 305.326
*******************************************************************************
First we want to grep all lines with the temperature:
How to plot both the instantaneous temperature and the cumulative average? (hint: in gnuplot if you
use replot instead of plot, the previous plot is retained)
X-coordinate is the first number, i.e. in the second column on the file. You can sort the file numerically (-
n) according to the column (-k) you want.
Often it is necessary to change files slightly to use them in different analysis programs. This exercise
simulates some typical changes you need to do. For example, NGS data from different sources may
come in different syntax and to use them together needs fixing one or the other. This exercise shows an
example on how to accomplish that.
$ wget ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3
3. Change chromosome names (1st column) from format chr1, chr2, .. to format 1, 2,...
This command "cuts" i.e. prints everything from the 4th character on each line, i.e. cuts away the
first three characters. This could also be done with sed. The following command would replace
the first occurrence of "chr" on each line with nothing (what is between the second and third
slash), i.e. remove those. $ sed s/chr// tmp_nomir > tmp_noscr
'ID=MIMAT0027618;Alias=MIMAT0027618;\
Name=hsa-miR-6859-5p;Derives_from=MI0022705'.
Change the first item to format 'gene_id "MI0006363_1"'
Note that here we already print out the first " around the gene_id. We'll print the second " at the
next stage.
5. Leave out the last three columns i.e. Alias, Name and Derives entries and add the " after the
gene_id.
Introduction to Linux: Exercises 10/10
First we tell awk to use ; as the field separator. $1 now matches everything up to the first ; (i.e.
until the gene_id code). Getting the " in place is a bit tricky. As " has a special meaning (it is
not just a character) we need to escape it with \ to mean just-the-character-" and not the
meaning of " and finally quote that with ":s. An alternative way to do this in two steps is to use
cut to leave out everything after the first occurrence of ; and then print the trailing " with awk
(as above).
6. Sort the file by chromosome and by miRNA start position (4. column). Make sure to sort the
chromosomes in numerical order (-n), not in alphabetical (i.e. 1,2,3... not 1,10,11..)
Extra task. Make another file that only has entries from chromosome 2
It’s possible to do the above in single command line, i.e. passing the output from the previous command
as input to the next, but usually it is safer to use temporary files (until you know each of the steps work).
Here is a one-liner that does steps 1-6:
CSC Quick Reference to Unix commands and CSC Computing Environment 2016-06-02 1/2