"Advanced" Linux
Loops and Ifs
for
We are using BLAST to learn about loops. And we will be using your favorite genes. First, we will build multiple databases with for
loops and blast each databases.
Go to folder LinuxAdvance/forloops
and copy three genome assemblies fasta files from /home/genhons1/PhylogenomicMaterials/genomes/
into the folder.
for f in *.fasta
do makeblastdb -dbtype nucl -in $f
done
Let’s see what’s going on here. The first line uses a wild card and so it’s basically for f in A.fasta B.fasta C.fasta
. This will create a variable called f
. A.fasta
will be assigned to f
for the first iteration, B.fasta
will be the next, and so on and so forth.
Move on to the next line. do
states the things you are going to do for an iteration. And $f
is calling the variable f
. So, for the first iteration, we are running makeblastdb -dbtype nucl A.fasta
. So on and so forth.
Note: you can have multiple lines after do
and the for
loop will run through all of them. For example: (Do not run)
for f in *.fasta
do echo making database for $f
makeblastdb -dbtype nucl -in $f
done
How about you try to run blast through all databases now? Copy a fasta file with genes of interest (can be /home/genhons1/DennyMaterials/Gemlquery.fasta
or something else that you are interested in) into the folder and write a code to complete the task!
Hint
You can run something like this. (replaceblastn
with tblastn
if you have a protein query.)
for f in *.fasta do blastn -db $f -query [<yourquery>] -outfmt 6 -out blastn-$f.txt done
Did you get an issue with your query fasta being treated as a database? There are at least three strategies to deal with it. The first one is to avoid for
assigning it your variable f
. For example, you can rename it, put it into a different folder or use some clever scheme in your wildcard to exclude the query.
Or, you can use if
(I’m still proposing using the former, since it will be cleaner).
Click to learn more!
for
loops are also useful if you want to generate a series of number. Try the followings:
for n in {1..20} do echo $n done
if else
I don’t not use if
else
a lot, but I think it’s worth mentioning.
Using our former task as example, we want to avoid your query. Therefore, we need a if
clause saying “if f
is query, skip it”, or “if f
is not query, do some stuff”.
Let’s start with the second strategy with an example.
x=somestring
if [ $x != somestring ]
then echo $x is not somestring
fi
In this example, we first assign “somestring” to x
. The second line ask if x
is not somestring (!=
is not equal). Now, the []
pass a true to if
, so the if
clause continue to run things from then
to fi
. You can try assign something else to x
and see what happens.
Note: You need the space between [
and $x
and between somestring
and ]
.
Now, let’s try the first strategy.
x=somestring
if [ $x == somestring ]
then echo
else echo $x is not somestring
fi
Here, there’s an else clause to say “if the things in []
is false, in this case: x
is somestring (==
is equal), run some echo $x is not somestring
.
You can see the solution is not elegant. You still need to run something when []
is true (I just gave it an echo
). So, avoid this solution, but else
is useful when you want to run something dichotomous.
Click to learn more!
There's alsoelif
so you can process multiple if else together. Learn it yourself later!
Try to write a code with for
and if
to avoid using query fasta!
Hint
for f in *.fasta do if [ $f != [<yourquery>] ] then blastn -db $f -query [<yourquery>] -outfmt 6 -out tblastn-$f.txt fi done
while
The third strategy is to use while
loops.
The most common usage of while
loops is to read a file and it will go through each line and assign them to variables.
Let’s go to LinuxAdvance/whileloops
, take a look at testfile
and run this:
while read l
do echo $l is a letter
done < testfile
Could you figure out what while does?
Hint
The first line says that while you are reading a line from a file, you assign it tol
and do the things between do
and done
.
The last line uses <
to indicate which file you are reading.
Now, your turn, copy the same three genome assemblies from your friends, and the fasta with genes of interest into the folder. Then, write a code to complete the task.
Note: use nano
to create the files that you are reading with while
.
Hint
It should be something like this:while read l do makeblastdb -dbtype nucl $l blastn -db $l -query [<yourquery>] -outfmt 6 -out tblastn-$l.txt done < [<yourfastalist>]Also see that I combined makeblastdb and blast so I don't need two loops!
Click to learn more!
Remember that you can combine the variables with a string and so, you don't need to add.fasta
in your fasta list!
while read l do makeblastdb -dbtype nucl $l blastn -db $l.fasta -query [<yourquery>] -outfmt 6 -out tblastn-$l.txt done < [<yourfastalist>]This way, your output file name won't have the ugly
.fasta
and be so much cleaner!
Click to learn more!
If you type something likewhile read A B CIt will treat your read in file like a table and put the three columns into variable
A
, B
and C
!
Advance sed and grep
sed with variables
To make a combined database with multiple genomes, we technically only need to concatenate all the genome files and then use makeblastdb
. However, it would be hard to know which contig or scaffold is from which sample. So, we are going back to sed
to label these contigs/scaffolds. And although we do not necessarily need complicated coding to accomplish it, it’s a good place to introduce you regular expressions. Just bear with me.
If we look at our assemblies, each sequence have a sequence name start with >
. The easiest way to add sample information is to replace >
with >[<sample name>]
.
Note that because you are using ''
in sed
and that will suppress the meaning of $
for calling variables, so you’ll need to use ''
to address it. Basically, if you want to replace variable X
with variable Y
, you should code
sed 's/'$X'/'$Y'/g' [<somefile>]
Now, you should have all the things you need to figure it out!
Let’s go to LinuxAdvance/sedblast
, copy a few assembled genomes there and try it out! (Don’t stop at making database but also do the blast part.)
Hint
I'm skipping themakeblastdb
and blast
for f in *.fasta do sed 's/>/>'$f'/g' $f >> compiled.fasta doneYou will get something that's sort of usable with this script.
But it's ugly. you will have
.fasta
inside your sequences. What can you do? Well, just pipe it to another sed
to remove it (and maybe replace it with something sensible!) Also, if you already have
compiled.fasta
, it will just be appended continously. So, make sure you remove it first.
Take a look at the blast results what did you find?
sed with regular expression (Regex)
Start and End of a line
Normally you won’t have this problem, but what if there’s an additional >
in the middle of the sequence name? That will also add in the sample name that you don’t need right?
So, we can use ^
to specify that >
needs to be at the beginning. Like:
for f in *.fasta
do sed 's/^>/>'$f'/g' $f > temp
mv temp $f
done
Click to learn more!
This is quite inelegant, but you can't just>
to $f
because it will mess up the file and sed
won't be able to work with it.Instead you can use
-i
to write the original file and it won't make any issues.
Sometimes you also want to add something to the end of a sequence. You can then use $
to specify.
For example: (Don’t run)
sed 's/A$/B/g' somefile
This will replace every A
at end of a line with B
.
^
and $
are parts of regular expression (Regex). They specify certain search patterns. Using it you can do some more specific or fuzzy searches.
I hope that you are interested in Regex now. And we will learn some other useful Regex!
Wild Cards
Let’s head to LinuxAdvance/sedregex
, take a look at the file ECMspecies
and learn about more Regex.
First we will learn .
. .
means any single character. Let’s see what happen if we run these two scripts:
sed 's/./XXX/' ECMspecies
sed 's/../XXX/' ECMspecies
Notice that I didn’t add g
at the end of the sed
expression, so it will only replace the first match.
*
is often used with .
. What it does is matching the preceding element zero or more times (It’s different from bash
’s wildcard.)
So what does this do? (Think before you run it!)
sed 's/ .*/XXX/g' ECMspecies
Let’s try to add feminine
after genus names that end with a
, add masculine
after genus names that end with us
. Also, remove the numbers and species. You’ll need to run sed
twice. Remember pipes |
? (Ignore Tuber
)
Hint
sed 's/a .*/a feminine/g' ECMspecies | sed 's/us .*/us masculine/g'
Bracket Expressions
Although .
is useful, we can’t do more nuanced edits. For example, it can not do something like only editting a
and c
in the file. But, we can use bracket expressions []
.
How bracket expression works is that you place everything that you want to match inside the bracket. Such as,
sed 's/[ac]/XXX/' ECMspecies
Notice that, again, I did not use g
. Take a look at the results. What happened?
We can also use -
to replace a series of characters or numbers. For example,
sed 's/[a-c]/XXX/' ECMspecies
sed 's/[0-3]/XXX/' ECMspecies
What just happened then? Look at the Boletus
line. Found anything?
You can also specify the amount of times of these brackets need to appear with \{\}
. (\
s provide the “regular expression meanings” to {
and }
). Let’s look at how it works without brackets first.
sed 's/l\{2\}/XXX/' ECMspecies
What does this do?
You can replace the l
with brackets. Like,
sed 's/[ac]\{2\}/XXX/' ECMspecies
You can also specify a range of number of characters. Like:
sed 's/[lsu]\{1,\}/XXX/' ECMspecies
sed 's/[lsu]\{1,3\}/XXX/' ECMspecies
What does each of these do?
Now, you should know how to replace the numbers only. Let’s try to replace them with N
!
Hint
sed 's/[0-9]\{1,\}/N/' ECMspecies
Lastly, you should also know that you can add ^
at the beginning of the brackets to indicate “not”.
For example,
sed 's/[^0-9]/X/g' ECMspecies
What does it do?
Marked Subexpression
Now, let’s say we want move the numbers to the beginning of the line, we can use marked subexpression \(\)
. (Like \{
and \}
, \
s provide the “regular expression meanings” to (
and )
.)
Run the followings:
echo ABCDEFGH | sed 's/^\(.*\)\(F\)/\2\1/'
What does it do? Let’s ignore the \(\)
first. And so it will look like:
echo ABCDEFGH | sed 's/^.*F/\2\1/'
^
indicates the match needs to start from the beginning. After .*
we find F
and we replace them with \2
and \1
.
\1
and \2
are defined by \(\)
. \1
corresponds to the first and \2
correspond to the second.
If it makes sense, try to move the numbers to the beginning of lines in ECMspecies
so your output would be something like this:
114 Laccaria species
Hint
sed 's/\(.*\) \([0-9]\{1,\}\)/\2 \1/' ECMspecies
grep
grep
also supports the usage of Regex. Are you able to find all the genera with species number equal or higher than 10 but lower than 100? (10~99)
Hint
grep ' [0-9]\{2\} ' ECMspecies
Click to learn more!
The Regex we used was POSIX basic regex. There's also a thing called POSIX extended basic regex. You can use-E
to enable it. You don't need \
for {}
and ()
if you are using the extended version. And there are a few other functions that you can use. Learn more here!
Dealing with Tables
cut
After that tangent, let’s go back to our Blast results. The format of outfmt is a table, in tab-separated values (tsv
). Note: there’s also comma-separated values (csv
).
Go to LinuxAdvance/tables
and copy one of your Blast result into the folder. The option that I use the most is -f
(field). Google to see how to use cut
to extract the E-value.
Hint
cut -f 11 [<yourblastresults>]
You can also cut
out a few columns. Let’s try the query sequence id, database sequence id, and the start and end for both query and databases. (Google it!)
Hint
cut -f 1,2,7-10 [<yourblastresults>]
Click to learn more
csv
is often better than tsv
because handling tabs is not the easiest thing to do. You can use -d
to let cut
to recognize ,
instead of tabs as delimiter.
sort
We used sort
before. However, it sorts by lines. We can use -k
to specify certain columns to sort
. sort -k 1,10 [<yourfile>]
would be sorting all 1~10 columns.
Normally we only want one column, so we generally code -k 1,1
.
Let’s try to sort the identity column.
Hint
sort -k 3,3 [<yourblastresults>]
However, you might notice that the sort doesn’t make sense at all. It’s sorting by characters not numbers.
To solve this issue, we can add n
before k
.
sort -nk 3,3 [<yourblastresults>]
It’s also possible to reverse the order
sort -nrk 3,3 [<yourblastresults>]
You can also sort by more than one column. By adding more -k
s.
sort -k 1,1 -nrk 3,3 [<yourblastresults>]
Let’s see if you can sort the file by query id, reverse order of database id and a reverse numeric order for the identity.
Hint
sort -k 1,1 -rk 2,2 -nrk 3,3 [<yourblastresults>]
Click to learn more!
Likecut
, you can let sort
to recognize ,
instead of tabs as delimiter, but with -t
.
awk
Variables in awk
awk
is a beast. To me, it’s a computational language just like R, C and python, but simpler.
We won’t be able to cover everything about awk
. Neither I know much about it, but we’ll just go through a few to show case the power of awk
.
Let’s run this and see what happens
awk '{print $3, $1}' [<yourblastresults>]
As you can see, it prints out a line with the third and first fields linked by a space. The fields were indicated by $
(Note: $0
is the whole line). The space is a delimiter introduced by ,
. See what happen if you don’t include ,
!
OK, but what if we want to use something else as delimiter?
We can set the output field separater OFS
, with -v
.
awk -v 'OFS=\t' '{print $3, $1}' [<yourblastresults>]
The space is now replaced with a tab, which is coded as \t
.
If you want specifically give a special delimiter between two fields, you can replace ,
with your delimiter and "
. For example:
awk '{print $3 "xxx" $1}' [<yourblastresults>]
You may have notice that awk
treat things in ""
as a string and just glue everything together.
Now, your turn. Try to print out the database sequence ids and start and end of the database sequences of each line using a csv
format. And instead comma, let’s place -
between the start and end.
Hint
awk -v 'OFS=,' '{print $2, $9 "-" $10}' [<yourblastresults>]
Do multiple things per line
Run this code and compare it with the results from our first awk
.
awk '{print "column3 = " $3; print "column3 = " $1}' [<yourblastresults>]
What does ;
do?
Calculations
You can do calculations in awk as well. For example, you can add up a couple numbers:
awk '{print $9 + $10 - 3}' [<yourblastresults>]
Here, because 3
is a number, you don’t need to add ""
around it.
There are also some functions you can use such as square root.
awk '{print sqrt($9 + $10 - 3)}' [<yourblastresults>]
Let’s calculate the length of each database sequence of each hit (start to end). (There’s no absolute function in awk
. The easy way is to use square and then square root: sqrt(X^2)
).
Hint
awk '{print sqrt(($9 - $10)^2) + 1}' [<yourblastresults>]
Click to learn more!
You should notice that Blast tells you the position of the start and the position of the end. Therefore,end - start
is not accurate. Instead, you'll need to add 1.
Math is important!!!
Begin and End
Notice that awk
does stuff in {}
line by line. But sometimes, we want to do things at the beginning or the end of reading a file.
For example, we can print out some stuff to the output file.
awk 'BEGIN{print "start!"}{print $3, $1}END{done!}' [<yourblastresults>]
You can imagine, if we do some calculation etc. It could be quite powerful.
Like, we can calculate how many lines there are in the file.
awk 'BEGIN{i = 0}{print $3, $1; i = i + 1}END{print "We have " i " lines!"}' [<yourblastresults>]
First, we assign 0
to a variable i
at the beginning. Then we read lines, print out and do the things after ;
, which is assign i+1
to i
. (;
indicates the first job is done, and move on to the next part in the same line. Just like the new lines in Linux). And at the end, we print out i
.
Click to learn more!
We actually don't need to do this to count the lines. This is just an example.if else
You can also use if
and else
in awk
.
If we want to print out all the lines of queryA
hits but replace others with XXX
, we can do something like this
awk '{if($1=="queryA"){print $0}else{print "XXX"}}' [<yourblastresults>]
Here, awk
runs the stuff inside the first {}
when reading each line. And it meets an if
clause, asking if the first field is queryA
. If yes, run the things in the second {}
; if no, run the things in the third {}
.
Let’s see if you can use it to figure out the directions of the hits! (You can use >
, <
and do calculations inside the if
clause too!)
Hint
awk '{if($10>$9){print "+"}else{print "-"}}' [<yourblastresults>]
If clause also can use some operators, such as AND &&
and OR ||
. Guess what happens if you run the code below: (It might not work with your result.)
awk '{if($9>100 && $10<3000){print $0}}' [<yourblastresults>]
Play with it to get the idea.
Click to learn more!
awk
also support sed
like expression (something like /ABC/
). But I don't use it often since it seems like you can just use if else clauses instead.
Line Numbers
There are some native variables in awk
. I’ll only introduce Number of Records NR
.
You can think NR
as the line numbers. Let’s see what happen if we run this:
awk -v 'OFS=\t' '{print NR, $0}' [<yourblastresults>]
What did it do?
Using NR
, could you print out the 3rd to 5th lines of your blast results?
Hint
awk '{if(NR>2 && NR<6){print $0}}' [<yourblastresults>]You can replace
NR>2
with NR>=3
which could be more intuitive.
Lastly, let’s try to write a code to turn fastq files to fasta files! Subset a fastq file with head -n 40
to your current directory.
Remember fastqs are in four line blocks. The first line is the sequence name with @
at the beginning. And the second line is the sequence. Here, %
will be useful. It outputs remainder from a division. For example, 9 % 3 == 0
and 6 % 5 == 1
.
Also, you don’t need to do it solely with awk
. Be creative!
Hint
awk '{if(NR%4==1 || NR%4==2){print}}' [<yourfastqfile>] | sed 's/@/>/g'
Click to learn more!
We talked aboutbioawk
as well, it feels very similar to awk
, but syntax is different. It's not native in Linux, so I don't use it, but it seems pretty powerful too.