Thursday, May 1, 2008

Setting up Standalone BLAST Software in Linux


Visit 123Bioinformatics.com for more updates.

Installing and executing stand-alone BLAST softwares in Linux.


Stand alone BLAST is the local installation of the NCBI BLAST suite of programs. NCBI provides binaries for various platforms. It is the same as the NCBI BLAST programs except that we can execute in the local machine.

The local version is significant when we have a large set of sequences to BLAST and this is not affected by the Internet speed /Traffic etc and it can be automated.

The stand alone blast can be downloaded from the NCBI FTP site (The link can be found at the bottom side tool bar in the NCBI main page “FTP Site-> Blast-> executables->Latest”).

The file should be in binary mode. Filenames are of the following form:

Program-version-architecture-os.extension Remember to choose the appropriate architecture (32 bit or 64 bit). Download the file and extract the contents in the gzip'ed tar archive. The ‘.gz’ file extension indicates that the file has been compressed with gzip (a standard Unix compression utility), The ‘.tar’ extension indicates that the file is a tape archive created with tar (a standard Unix archiving tool).

To uncompress ‘gunzip’ and extract the files from the archive into the current working directory follow the comments given below.

jk@jk:~/Desktop/blast-2.2.18/bin$ gunzip blast-2.2.18-ia32-linux.tar.gz #uncompress

jk@jk:~/Desktop/blast-2.2.18/bin$ tar -xpf blast-2.2.18-ia32-linux.tar #extract

For more information on the options look into $man tar/gunzip.

When you get into the extracted directory you can see three other directories (bin, data, doc). The doc directory contains the README files for each software. The data directory contains the scoring matrices. The bin directory contains all the executables for running various BLAST searches.

How to execute bl2seq (BLAST two sequence):

Bl2seq performs a comparison between two sequences using either the blastn or blastp algorithm. Both sequences must be either nucleotides or proteins.

The input files to any BLAST softwares should always be in FASTA format.

eg
>gi|229673|pdb|1ALC| Alpha-Lactalbumin
KQFTKCELSQNLYDIDGYGRIALPELICTMFHTSGYDTQAIVENDESTEYGLFQISNALWCKSSQSPQSR
NICDITCDKFLDDDITDDIMCAKKILDIKGIDYWIAHKALCTEKLEQWLCEKE

Syntax:

jk@jk:~/Desktop/blast-2.2.18/bin$
./bl2seq - # Displays all options

You can choose the required options. The must-options are -p, -i, -j. The other options can be defined or elze the program will choose the default value.

jk@jk:~/Desktop/blast-2.2.18/bin$ ./bl2seq -p blastp -e 0.01 -i -j # blastp -to execute protein sequence
-i First sequence [File In]
-j Second sequence [File In]
-p Program name: blastp, blastn, blastx, tblastn, tblastx. For blastx 1st sequence should be nucleotide, tblastn 2nd sequence nucleotide.
-e E-Value # (optional)

The two input files (file1, file2) should be in the (/blast-2.2.18/bin) current working directory for the above syntax to work. If not, give the appropriate path. If you have multiple FASTA sequences to compare you can automate the above syntax using shell scripts.

How to execute Blastall:

Blastall is most commonly used tool. It can perform all BLAST programs like blastp, blastn, blastx, tblastn, tblastx. Unlike the bl2seq, The blastall is used when you have multiple FASTA sequences as input/queries and searched against the appropriate protein/nucleotide database.
You can download the Protein or Nucleotide database from swissprot or NCBI. for eg to download the human chr22,

go to NCBI-> FTP site-> RefSeq-> H_sapiens-> H_sapiens ->chr22.

Note:

FASTA formatted files are not compatible for the BLAST programs. You need to prepare the FASTA files for BLAST with formatdb. This indexes the entries in the FASTA file and enables BLAST to run much faster.
Uncompress the database. It will look like the one below if its a protein sequence database. The multiple sequence input query to blastall will look similar to this.

>gi|86438068|gb|AAI12638.1| HGD protein [Bos taurus]
MTELKYISGFGNECASEDPRCPGALPEGQNNPQVCPYNLYAEQLSGSAFTCPRSTNKRSWLYRILPSVSH
KPFEFIDQGHITHNWD
>gi|116283875|gb|AAH44758.1| Hgd protein [Mus musculus]
MSVLQRILAVQVPCPKDSWLYRILPSVSHKPFESIDQGHVTHNWDEVGPDPNQLRWKPFEIPKASEKKVD
FVSGLYTLCGAGDIKSNNGLAVHIFLCNSSMENRCFYNSDGDFLIVPQKGKLLIYTEFGKMSLQPNEICV
>gi|116283724|gb|AAH24369.1| Hgd protein [Mus musculus]
MSVLQRILAVQVPCPKDSWLYRILPSVSHKPFESIDQGHVTHNWDEVGPDPNQLRWKPFEIPKASEKKVD



Formatdb:


jk@jk:~/Desktop/blast-2.2.18/bin$ ./formatdb - # displays all options

jk@jk:~/Desktop/blast-2.2.18/bin$ ./blast-2.2.18/bin/formatdb -i -o T -p T

-i Input file(s) for formatting (this parameter must be set) [File In]
-p Type of file T - protein F - nucleotide (default = T)
-o Parse options T - True: Parse SeqId and create indexes. F - False: Do not parse SeqId. ( default = F)

The input database should be in the (/blast-2.2.18/bin) current working directory for the above syntax to work. If not, give the appropriate path.

After running formatdb you can see seven indexes and data files along with the original input file. All the seven files are required for the blastall to run. Make sure the database along with the generated input database is kept in the same directory. View the contents of formatdb.log for error messages.

2. Executing Blastall:

jk@jk:~/Desktop/blast-2.2.18/bin$ ./blastall -i -p blastp -d -o

-p Program Name [String] Input should be one of "blastp", "blastn", "blastx", "tblastn", or "tblastx".
-d Database [String] default = nr The database specified must first be formatted with formatdb.
-i Query File [File In]
-o BLAST report Output File [File Out]

The input database should be in the (/blast-2.2.18/bin) current working directory for the above syntax to work. If not, give the appropriate path.

The output file will contain the BLAST output for all the input query sequences.