band

Knowing your dataset

Queries

To start to BLAST on local computers, we will first use two of my favorite genes, encoding DNA polymerase alpha subunit B, and Clathrin heavy chain, as examples. These genes are supposed to be single copied so they would be easier to deal with than many of genes of your interest.
I’ve took the coding sequences (CDSs; basically the things translated into proteins) from Amanita phalloides (a distantly related Amanita species) and stored them in a fasta file: DennyMaterials/BLAST/query.fasta. So let’s head to DennyMaterials/BLAST and look at the contents of query.fasta.


Databases

You should have already played with the genomes already. Copy the genome assembly (likely in ~/11666_SizeFiltered.fasta or ~/busco/11666_SizeFiltered.fasta) into the ~/DennyMaterials/BLAST directory. Also, we want to generate a protein database as well. Normally, it would be come from a genome annotation, but we will use BUSCOs instead.
So, we need to put all the BUSCOs into a file, and we will only use single-copied ones. (They should be in ~/busco/home/genhons*/11666_buscoOut/run_basidiomycota_odb10/busco_sequences/single_copy_busco_sequences). Figure out a way to do that with the commands we learned! (Remember cat?)

Hint To copy the genome, you do something like
cp [<path to genome>] .  
To make the protein database, you do something like
cat ~/busco/home/genhons*/11666_buscoOut/run_basidiomycota_odb10/busco_sequences/single_copy_busco_sequences/* > busco.fasta  
If your busco failed let us know.


Making databases

Making databases for Blast is quite straightforward. You only need makeblastdb, -dbtype and -in. (You might notice that we there’s only one hyphen before these multiple-character options, but normally there should be two. This is simply because different developers use different syntax. So, you’ll need to look at tutorials/manuals to learn these softwares.)

I will help you with the genome database. (Remember to load module first by running module load ncbi-blast-2.12.0+)

makeblastdb -dbtype nucl -in [<yourfasta>]

You want to replace [<yourfasta>] with your genome fasta file. And by -dbtype nucl you specify that it’s a nucleotide database. Check what’s in the directory!

Now, you also know how to make a protein database! (Use makeblastdb -help to get help.)

Click to learn more!
makeblastdb -dbtype prot -in [<yourfasta>]  
Click to learn more! It is worth noting that you can change output database name with -out. It is useful when you want to build a database for a fasta from a different folder. For example,
makeblastdb -dbtype nucl -in somedir/somefasta -out XXX  


BLAST!!

We will start with blastn, which is using a nucleotide query to search in a nucleotide database. Let’s try

blastn -db [<databasename>] -query query.fasta -out blastn.txt

[<databasename>] is your fasta filename if you didn’t use the -out option.

Now, take a look at blastn.txt. What did you see? Why is that? (Talk!!!)

How about we translate the genes into protein using Expasy, save them into fasta format and use the protein sequences to blast? (Which blast should you use?)

Hint
tblastn -db [<databasename>] -query [<proteinfasta>] -out tblastn.txt  


What did you find? Why does it perform differently from blastn?

Okay, that is all good, but the output format isn’t the best: it’s hard to extract information. I like to use -outfmt 6 instead. It gives you a “tab-delimited file”, which is basically a table but instead of grids, it uses tabs and new lines.

Hint
tblastn -db [<databasename>] -query [<proteinfasta>] -outfmt 6 -out tblastn-fmt6.txt  


Take a look at your output. What does each column mean?
Why does each query have multiple hits?

Now, let’s try to use our protein database.

Hint
blastp -db [<databasename>] -query [<proteinfasta>] -outfmt 6 -out blastp-fmt6.txt  


How’s the performance of it? When will it be more or less favored than a nucleotide database?

Get sequence

Often times we want to get the sequences instead of just having the hit range. There are multiple ways to do it. Here, we will use the Blast suite.

Let’s add an option -parse_seqids. To your makeblastdb.

makeblastdb -dbtype nucl -in [<yourfasta>] -parse_seqids

Adding this option will provide some more information to the database allowing later sequence extraction.

Then you can run:

blastdbcmd -db [<yourfasta>] -dbtype nucl -entry [<hitsequenceid>] -range [<hitsequencerange>] -strand [<plus or minus>]

Note that [<hitsequencerange>] should be typed like 3-21 and [<plus or minus>] depends on if you want the forward or reverse strand of the sequence.

You can also extract multiple genes at once. Such as,

blastdbcmd -db [<yourfasta>] -dbtype nucl -entry_batch [<yourregionsofinterest>]

yourregionsofinterest should be a file name including an entry, a strand direction and a range in each line, separated by a space. For example, seqid minus 3-21.


Extract Genes for Phylogenetics

Now, we want to extract genes to make a phylogeny. This way, we can compare our samples to the samples from previous studies. We have prepared the A. muscaria sequences from Geml et al. 2008 and Vargas et al. 2008, so you only need to extract the same four genes from your genome.

Head to DennyMaterials, make a directory called phyloBLAST and copy the query (/home/genhons1/DennyMaterials/Gemlquery.fasta) to the folder.

Next, copy your genome assembly to the folder, too. Everyone has a different sample. You can figure out the one for which you are responsible by looking into the files in ~/Data/Illumina/. And copy the corresponding assembly fasta file from /home/genhons1/PhylogenomicMaterials/genomes/ into phyloBLAST.

Use makeblastdb to create a database (remember to add -parse_seqids to allow sequences retrieval) and blastn to search for the four genes from Gemlquery.fasta in your genome.

Hint
makeblastdb -dbtype nucl -in [<yourgenomefasta>] -parse_seqids  
blastn -db [<yourgenomefasta>] -query Gemlquery.fasta -outfmt 6 -out blastn-Geml.txt  


Take a quick look at the results. Do you see multiple lines with the same subject ID for the top hit? It shouldn’t. But btub and tef1a are protein coding genes, why aren’t they fragments? Also, are there any query sequence without any hits?

Use blastdbcmd to extract the top hit sequence for each query sequence. Like below: (I’m showing one at a time so that it’s not too complicated.)

blastdbcmd -db [<databasename>] -dbtype nucl -entry [<subjectid>] -range [<subjectrange>] -strand [<plus or minus>] > [<genename>].fasta

[<subjectid>] is in the second column of your blast results. [<subjectrange>] is start-end of the subject (Note the start needs to be lower than end). And [<plus or minus>] is decided by if your blast hit is forward or reverse (if the “subject start” in your result is lower than “subject end”, it needs to be plus, otherwise it should be minus.)

Now, you should get four [<genename>].fasta. Rename the sequences with the sample names underscroll yourname (you can use sed, nano…; look like >[<samplename>]_[<yourname>])

Lastly, copy each fasta into corresponding /home/genhons1/StudentBlastResults/[<genename>] and change the file name into your sample name, underscroll gene name, underscroll your name. Remember to not copy into the file and then change the name: IT WILL MESS THINGS UP! It should look like this.

cp [<genename>].fasta /home/genhons1/StudentBlastResults/[<genename>]/[<samplename>]_[<genename>]_[<yourname>].fasta


See also