Intro to Linux
Please press “Hint” if only necessary.
Press “Click to learn more!” even if you completed the tasks without hints.
Review
This should be quite simple if you have done the exercises before.
Directory Navigation
What’s the Present Working Directory? (i.e. which directory are you in?)
Hint
pwd
Change Directory to DennyMaterials/LinuxIntro/Review
.
Hint
You can use either absolute path or relative path.cd DennyMaterials/LinuxIntro/ReviewOR
cd /[<pwd>]/DennyMaterials/LinuxIntro/Review(replace
[<pwd>]
with your [pwd]
output.)
Go to the parent directory.
Hint
cd ..
Go to back to the directory you were in (Review
).
Hint
cd -
Click to learn more!
You can usecd -
to hop back and forth between two directories.
Go to the home directory.
Hint
cdOR
cd ~
File Exploration
First, go to directory LinuxIntro/Review/FileExploration
. (I’m not writing out DennyMaterials
from now on.)
Then, LiSt the files.
Hint
ls
When is file1
created and which is the largest file?
Hint
ls -lh
-l
is an option, providing the details of the files and -h
is a second option, providing human-readable file size etc. We merged the two options together into
-lh
(This only works with single character options.)
Click to learn more!
You can sort the files by time or size!ls -lht ls -lhS
You might find that some of the “files” are directories. But how to tell which one is which?
Look at the first column of your ls -l
results. If it starts with a d
, it’s a directory!
Other options include using tab completion (my personal go to), or ls -p
, if there’s a /
after the name, that’s a directory.
Now, what’s in directory"X"
, where “X” can be anything?
Hint
ls directory*
*
is one of the wildcards, it can be anything at any length! However, the syntax is not the same in every situation. We will explore other usage of wildcards later!
Click to learn more!
Check Hint about wildcards!What’s in directoryRecipes/FiveSpiceSeaweed
?
Hint
cat directoryRecipes/FiveSpiceSeaweedOR
less directoryRecipes/FiveSpiceSeaweedTo leave
less
press q.There are a lot more choices!
Super Basic File and Directory Handling
Go to directory LinuxIntro/Review/FileDirHandle
.
Make a file named fileX
with 1+1=3
as its content with nano
.
Hint
nano fileX # type in 1+1=3 # control X # click Y
Let’s ReMove fileX
.
Hint
rm fileX
Use echo
and >
to create fileX
instead.
Hint
echo 1+1=3 > fileX
echo
will just print out the exact same thing you typed in. But >
redirect it to fileX
.
Make a directory called Knowledge
and make two subdirectories in Knowledge
called Facts
and Believes
.
Hint
mkdir Knowledge mkdir Knowledge/Facts mkdir Knowledge/Believes
Click to learn more!
You can make all directories at once!mkdir -p Knowledge/Facts Knowledge/Believes
-p
is "recursive". This way you can create the parent and child directories at the same time.
Make fileY
and fileZ
in LinuxIntro/Review/FileDirHandle
with anything inside.
ConCATenate fileY
and fileZ
and append them to the end of fileX
Hint
cat fileX fileY fileZ > temp mv temp fileXYou can't simply run
cat fileX fileY fileZ > fileXIt is because the fileX is opened as a writing file already and cannot be read. And this solution is not elegant, so check
Click to learn more!
.
Click to learn more!
You can use>>
to append things to a file, super useful!
cat fileY fileZ >> fileX
CoPy fileX
into Knowledge/Facts
and rename it into math
Hint
cp fileX Knowledge/Facts mv Knowledge/Facts/fileX Knowledge/Facts/MathOR
cp fileX Knowledge/Facts/Math
CoPy Knowledge/Facts/Math
into current directory without renaming
Hint
cp Knowledge/Facts/Math .
Make they're definitely correct
to the end of fileX
without using nano
.
Hint
echo "they're definitely correct" >> fileXNotice the apostrophe will open a quote, so we need to use the double quotes to suppress it.
ReMove Knowledge/Facts/Math
Hint
rm Knowledge/Facts/MathNote: this is dangerous, the files you
rm
are gone forever. You can't get it back from the recycle bin!
ReMove Knowledge
Hint
rm -r Knowledge
“New” stuff!
Compressions & Archives
In bioinformatics we often deal with large files, and so we need to compress the files. Also, there will be times you want to package a folder into an archive, so it’s easier to move around. Here you’ll learn about how to handle these files!
Compression
The most common compression method is gzip
. You might be quite familiar with zip
, but that’s mostly used in Windows.
First, go to LinuxIntro/NewStuff/CompArch
, make two file called mouth
and pants
, and try to compress them (You can add any content you want into the two files.)
Since we are learning, let’s get some help.
gzip -h
It opens a manual and tells you how to use this command:
Usage: gzip [OPTION]... [FILE]...
Right now, we don’t need [OPTION]
(usually it’s like -x
or --xxx
and the manual tells you what they are below.) So, here we only need to replace [FILE]
with file names.
Hint
gzip mouth pants
After zipping it, list all files, what did you find?
Now, let’s unzip them with gunzip
.
Hint
gunzip *.gzIf you want to do this, you should make sure that there's no other gz files.
What if we want to keep the original file when compressing?
Go to check the manual again to see what you can do!
This time, we will use:
man gzip
This opens the detailed manual with less
. You can scroll up and down with your mouse and press q
to leave. (Check out Advanced usage
.)
Now, let’s complete the task!
Hint
gzip -c mouth > mouth.gzip gzip -c pants > pants.gzipCheck out
-c
in OPTIONS
as well.
Click to learn more!
In different version ofgzip
you can use -k
or --keep
. Unfortunately, it's not available here.
Fun Fact
Compressibility can be used to estimate the complexity and relatedness of genomes.See Chen et al. 2000
Archive
We will use tar
to archive.
Let’s make a folder called stuffs
and put mouth
and pants
in it. We will archive stuffs
into stuffs.tar
Now, use --help
to get some help. (--help
normally is -h
but some programs use one or another.)
tar --help
You will see that tar
is a lot more complicated, it can do a lot of things! Check out the Examples
in the beginning!
With all the information, let’s try to archive!
Hint
tar -cf stuffs.tar stuffs
Click to learn more!
As mentioned before,-cf
is the same as -c -f
, usually you can stack these one-character options together.
People often use
-v
to the process verbose. It will tell you which file it is archiving. Could be a good practice!
tar -cvf stuffs.tar stuffs
List the files, what did you see?
Now, remove the stuffs
folder, and extract the tar file.
(Use tar --help
or man tar
for information!)
Hint
rm -r stuffs tar -x -v -f stuffs.tar stuffsOR
rm -r stuffs tar -xvf stuffs.tar stuffs
Archive + Compress
You can archive and compress at the same time with tar
!
We will archive and compress stuff
into stuff.tar.gz
Same drill: use man
or --help
for help and figure out how to do it! (check out Compression options
)
Hint
tar -czvf stuffs.tar.gz stuffs
Compare the file size between compressed and uncompressed archives.
Remember how to do it?
Hint
ls -lh stuffs.tar*
See contents in tar without unpacking
Sometimes an archive is so huge, say 200 GB. You know extracting it will crash your poor laptop, but you want to know what’s inside. What should you do?
Let’s look at Examples
in tar
manual again!
Hint
tar -tf stuffs.tar
Now, we want to extract mouth
only. Because it’s hard to find the information in manual, why don’t you Google for information? (No, Googling is not cheating! Also, discuss how to Google rather than how to write the codes.)
Hint
Remember that you need to provide the full pathtar -xzf stuffs.tar.gz stuffs/mouth
Read compressed files without extraction
You can actually read compressed files without extracting them into new files! But, why do we want to do that? Because we can directly feed it into other commands with pipes |
, which saves disk space and computational time (We will talk about it later).
Let’s run the followings and see what you got.
echo nonsense > temp.txt
gzip temp.txt
zcat temp.txt.gz
Head and Tail
You know that head
spits out the first few lines of a file already. It could be handy if you want to take a quick glance at a file. But do you know head
can do more than that?
For example, you can specify how many lines you want it to spit out.
Same drill, go to LinuxIntro/NewStuff/HeadTailAndWordCount
and try to read file see what’s the first three lines of HeadsAndTails.txt
?
Hint
head -n 3 HeadsAndTails.txt
On the other hand, when you are dealing error messages you may want to look at the last few lines of the file. Then, we can use tail
.
Now, let’s try to find out the last two lines of HeadsAndTails.txt
. (The syntax is the same).
Hint
tail -n 2 HeadsAndTails.txt
Sometimes, you want to remove the first few lines. And you can use tail
to do that! (You can use manual or Google)
Hint
tail -n +2 Linux.markdown
Click to learn more!
head
should be able to do similar things, too, but it depends on the version that you have: My laptop isn't capable of that :(.
Counting Characters, Words and Lines
Sometimes you need to count how many raw reads you got from a sequencing service, or you want to count mamy nucleotides there are in a gene. Then, we can use wc
.
Stay in LinuxIntro/NewStuff/HeadTailAndWordCount
and run
wc HeadsAndTails.txt
You can see that there are four columns, the first three are numeric the last one is your file name.
But what does each number mean? Try to find it out yourself!
You can also use options to specify which number you want to see. Look at manual and play with the four options!
Click to learn more!
You may notice-m
and -c
are nearly the same. That's because -m
is character -c
is bytes and normally one character is one byte. But when we use certain coding systems, some characters require two bytes to store, which makes the difference.
Note: It’s also not that simple to count nucleotides, but this could be a starting point.
Pipes |
Pipe |
is one of the most powerful tool in Linux in my opinion.
Before we learn about pipe, let’s revisit wc
and ls
. Go to LinuxIntro/NewStuff/pipe
and figure out a way to count files in the folder! (You will need to direct output from ls
to a file and use wc
on it.)
Hint
ls -l > listoffiles wc -l listoffiles
That gives you 8 files, but there should be 7 only.
Since we did not use |
, we will get one more file counted due to the temporary file. To avoid this, we can simply not make the temporary file with |
. Run:
ls -l | wc -l
What |
is doing here is storing the “standard output” (STDOUT) from ls
into memory and directly feed it as “standard input” (STDIN) to the next step. So, not only it makes your script shorter, it also saves a lot of computational time!
Let’s use |
again to get the 3rd to 8th lines of BadGuy
! (Think about which commands we used to get specific lines of a file.)
Hint
head -n 8 BadGuy | tail -n 5
Need to do some calculations!
Same and Different
Diff
Say today your lab is sharing a dataset and you made some edits, but you forgot what they are exactly. We can use diff
to solve this issue.
Head to LinuxIntro/NewStuff/DiffSort
and compare AfricanCountriesA
and AfricanCountriesB
diff AfricanCountriesA AfricanCountriesB
Try to figure out what the results mean!
Note: This is a very simple dataset so you can infer what the command is doing by looking at the dataset. Now, what if you have larger datasets and you want to know what the command is doing without Googling or asking? (It’s a useful skill!)
Comm
diff
can tell you where there’s a difference, but sometimes you want to know what’s the same. To do so, we can use comm
. Let’s run the following and see what happens.
comm AfricanCountriesA AfricanCountriesB
What are the three columns?
We can use the options to suppress certain columns. Play with the three options -1
, -2
and -3
and try to figure out what’s going on.
You can also combine the options into -12
, -13
and -23
.
Back to the original problem, how to get the lines common to both AfricanCountriesA
and AfricanCountriesB
?
Hint
comm -12 AfricanCountriesA AfricanCountriesB
Sort and Unique
Sort
sort
can be useful to organize things. Let’s sort AfricanCountriesA
with
sort AfricanCountriesA
How are the countries sorted?
Now, let’s sort numbers
. How are the numbers sorted?
Try to figure out how to sort
it numerically! (Note that sort -h
sort of “freeze” your terminal. It’s because of weird functionality of sort. Don’t worry and use --help
or man
instead.)
Hint
sort -n numbers
Unique
uniq
can be used to remove duplicated items and is often used with sort
. Why? Run:
uniq numbers
You’ll see that it only removes a 23
but not the duplicated 2
. That is because it only compares the two lines next to each other. So sort
can solve this issue! Try again (remember to use |
)!
Hint
sort -n numbers | uniq
Basic grep and sed
grep
grep
is a tool that you can use to search a pattern in files. There are many uses, but we are going to learn the basic ones first.
First let’s find anything with Africa
in AfricanCountriesA
.
grep Africa AfricanCountriesA
You should see two countries. Now, let’s do the inverse: find everything without Africa
. (Check -v
in the manual.)
Hint
grep -v Africa AfricanCountriesA
We can also count how many hits without Africa
we got. (There are multiple ways.)
Hint
grep -cv Africa AfricanCountriesAOR
grep -v Africa AfricanCountriesA | wc -l
Lastly, check out -o
. What does it do? And let’s search for “c”.
grep -o c AfricanCountriesA
It just print out a bunch of “c”. Seems useless right? But there’s actually some uses for it. Check the manual and discuss what use there might be!
Hint
It print out all exact hits. (And multiple times within a line).You can pipe it to
wc -l
and get a count of the hits!
grep -o c AfricanCountriesA | wc -l
sed
sed
is another command with a whole lot of uses. However, it’s syntax is quite strange, so practicing it would help you master this useful command. We’re covering two most common ones: substitute and delete lines.
We’ll start with “delete lines”. We will start with deleting certain positions of lines. Again, sed
manual is not the best. So let’s Google to see how to delete the third to fifth lines from AfricanCountriesA
.
Hint
sed '3,5d' AfricanCountriesA
Click to learn more
It's a good practice to add'
in sed
command:
sed '3,5d' AfricanCountriesAIt's because there often are weird characters that shell may interpret differently.
'
will suppress those interpretations.
We can also remove lines with a pattern. Same drill, Google again and try to remove lines with Africa
from AfricanCountriesA
.
Hint
sed '/Africa/d' AfricanCountriesARemember how to do the same thing with
grep
?
Click to learn more
The idea of all these craziness is thatd
means delete and anything matches what's before will be deleted. If it's number, it will delete the lines at those position (3,5 means 3~5). If it's things like /XXX/
, it's doing a grep
like searching.
Now, let’s move on to “substitute”. I’ll help you with this. Instead of d
, we will use s
.
sed 's/a/x/' AfricanCountriesA
It should replace the things between the first /
pair (a
) with the things between the second /
pair x
. But, what did you find?
It only replaces the first a
s in a line right?
We can add g
at the end of the sed
command for globally replacement
sed 's/a/x/g' AfricanCountriesA
You can also replace multiple characters and special characters. Howabout replacing “an” with “@$!”?
Hint
sed 's/an/@$!/g' AfricanCountriesA
Click to learn more 1!
You can replaceg
with a number and it will replace the N occurrence of the query. E.g.
sed 's/an/@$!/2' AfricanCountriesAwill replace the second "an" in everyline in AfricanCountriesA.
Click to learn more2!
Sometimes you may want to replace/
with something else, but /
is used already! If this is the case, you can use other characters except for
\
to replace /
! E.g.
sed 's@/@X@g' yourfile
It might look like that s
and d
have completely different syntax, but there’s actually a rationale behind. We won’t touch on it in this workshop, but you are welcome to discuss with others or me.
Both grep
and sed
also allow super-useful fuzzy matching with regular expressions. We will explore them tomorrow!
Good practice and miscellaneous things you should know
Variables
Variables are useful when you need to run same codes with different arguments. Let’s say we want to run cat
and echo
multiple times. We can use a variable, let’s say file
, and also assign a value AfricanCountriesA
to file
.
file=AfricanCountriesA
cat $file
echo $file
We first use =
to assign the variable. Then, we call the variable with $
. This way, we will be able to see the contents of AfricanCountriesA
and the file name itself.
We can further assign different values to file
. Such as,
file=AfricanCountriesB
cat $file
echo $file
You can also combine the variables with other strings to create new strings. They will just glue together. For example,
char=B
cat AfricanCountries$char
It might not seem very useful right now, but if you have complicated codes. You only need to save the codes once instead of multiple times if you use variables. We will also talk about other uses soon (tomorrow!)
Tabs Tabs Tabs!
OK I should have said this before, but if you’re stuck, just press tab
!
Some more details: A normal Linux syntax is start with a command and add options, arguments etc. If you press tab
when you are at the command part, it will search in your PATH
(a special environmental variable that’s already defined) and try to find one that fits to fill in. If you press tab
when you at the options/arguments part it will search in your present working directory and try to find a file (including directories) that fits to fill in.
If there are more than two hits, it won’t give you anything. But then, you can press tab
another time and it will give you a list of hits.
Don’t use Word to store your codes
Word messes up everything! To prove it, try to TYPE in the followings and paste in terminal to run it.
echo "stupid" > MicrosoftWord
What did you find?
OK so it’s not Word is stupid but it’s too smart that it automatically changes some of the characters for the writers. However, it’s not good for Bioinformatics.
So, I suggest you use a text editor or Visual Studio Code to store your codes.
(Visual Studio Code is quite useful. It understands some computer languages and will highlight things to help you read your codes. We will explore it together.)
Space Oddity
Do you know how to read a file Coinvasion workshop.txt
with cat
?
The space between Coinvasion
and workshop
make your system thinks they are two different files!
No worries, there’s a solution. Just add \
before the space like this:
cat Coinvasion\ workshop.txt
\
, the “back slash”, is a special character that can suppress other special characters’ function (or will give a character a function). We will explore it later. (the next section has some examples, too.)
My recommendation is to avoid spaces in your file naming system (you probably already know that). Alternatives include using upper cases, hyphens, underscores… Such as,
CoinvasionWorkshop.txt
Coinvasion-workshop.txt
Coinvasion_workshop.txt
New Line and Carriage Return
Sometimes codes behaves weirdly, why are txt files sometimes have blank lines when loaded into R? This is especially common when you save an Excel file into a txt file.
So, a new line is actually a character \n
in Linux (and Unix) (and a tab is \t
). However, in Windows it’s \r\n
, older MacOS use \r
. And Office Excel follow the Windows coding system. But R will interpret both \r
and \n
as new lines. I recommend to change everything to \n
.
Fun Fact
Here,\n
means New line and \r
means carriage Return and it's because in type writer world you need to first return the carriage \r
and make a line break \n
.
Do you know how to remove \r
? (Just think about it and press Hint.)
Hint
sed 's/\r//g' file