The computational toolkit of modern biology. From sequence searching to phylogenetic trees, and the programming languages and databases that make it all possible.
BLAST
BLAST (Basic Local Alignment Search Tool) is the most widely used bioinformatics program. Published by Altschul et al. in 1990, it searches sequence databases to find regions of similarity between a query sequence and database sequences. BLAST has been cited over 100,000 times โ among the most cited papers in all of science.
How BLAST Works
Seeding โ BLAST breaks the query into short words (default: 11-mers for DNA, 3-mers for protein). It looks for exact or near-exact matches of these words in the database. This initial step is fast because it uses a lookup table.
Extension โ from each seed hit, BLAST extends the alignment in both directions using dynamic programming. It stops when the score drops below a threshold (X-drop). This step is more expensive but only applied to promising seeds.
Scoring โ alignments are scored using substitution matrices. For proteins: BLOSUM62 (most common), PAM250 (distant homologs). For DNA: match/mismatch scores (+2/-3 default). Gap penalties: linear or affine (open + extend).
E-value โ Expected value. The number of alignments with this score or better that would be expected by chance in a database of this size. E-value < 1e-5 is usually significant for homology. E-value depends on database size and query length โ not just the alignment score.
Heuristic โ BLAST is not guaranteed to find the optimal alignment (unlike Smith-Waterman). It trades sensitivity for speed. For exhaustive searches, use Smith-Waterman (slow) or DIAMOND/MMseqs2 (fast heuristic alternatives).
BLAST Programs
blastn โ nucleotide query vs nucleotide database. For DNA/RNA sequence similarity. Megablast (default on NCBI) is optimized for highly similar sequences.
blastp โ protein query vs protein database. The classic protein search. Uses BLOSUM62 by default.
blastx โ translated nucleotide query (all 6 reading frames) vs protein database. For finding protein homologs from a DNA query (e.g., EST or RNA-seq contigs).
tblastn โ protein query vs translated nucleotide database. For finding unannotated genes in genome assemblies.
tblastx โ translated nucleotide vs translated nucleotide. Slowest, most sensitive for divergent homologs. Rarely used due to computational cost.
PSI-BLAST โ Position-Specific Iterated BLAST. Builds a position-specific scoring matrix (PSSM) from initial hits, then searches again with the PSSM. Each iteration finds more distant homologs. Excellent for detecting remote homology. Risk: profile corruption from inclusion of false positives.
Query sequence
→
Seed words (k-mers)
→
Extend hits
→
Score + E-value
→
Alignments
Running BLAST
NCBI web BLAST โ blast.ncbi.nlm.nih.gov. Free, searches all GenBank sequences. Results in minutes. Best for individual queries.
Command-line BLAST+ โ makeblastdb -in ref.fasta -dbtype nucl then blastn -query reads.fasta -db ref -outfmt 6 -evalue 1e-10. Install via conda or download from NCBI. Essential for batch processing.
DIAMOND โ 100-10,000x faster than BLAST for protein searches. Uses double indexing and spaced seeds. Standard for metagenomic protein annotation. diamond blastp -d nr -q query.faa -o results.tsv
MMseqs2 โ even faster than DIAMOND for clustering and searching. Used in the AlphaFold pipeline for building MSAs. By Martin Steinegger (Seoul National University).
🔍BLAST is like Google, but for DNA and proteins! You give it a sequence of letters (like ATCGATCG) and it searches through BILLIONS of known sequences to find matches. "Hey, this gene in your mystery bacteria looks 95% similar to a gene in E. coli!" It's the most-used tool in all of biology.
Multiple Sequence Alignment
Multiple sequence alignment (MSA) simultaneously aligns three or more sequences to identify conserved regions, functional motifs, and evolutionary relationships. MSA is the starting point for phylogenetics, structure prediction, and functional annotation.
Tools
Clustal Omega โ the successor to ClustalW. Uses guide-tree-based progressive alignment with HMM profiles. Handles thousands of sequences. The most recognized MSA tool name, though not always the most accurate. Web server at EBI.
MUSCLE โ Multiple Sequence Comparison by Log-Expectation. Iterative alignment: initial progressive alignment, then iterative refinement by tree bisection. MUSCLE5 (2021) by Robert Edgar uses ensemble methods and achieves high accuracy. Fast, widely used.
MAFFT โ Multiple Alignment using Fast Fourier Transform. Very fast for large datasets. Multiple strategies: L-INS-i (most accurate, O(n^3)), FFT-NS-2 (fast, O(n^2)), G-INS-i (global, accurate). Recommended for >1,000 sequences.
T-Coffee โ combines information from pairwise alignments (libraries) to build a more consistent MSA. Slower but often more accurate than progressive-only methods. Structural T-Coffee (3D-Coffee, Expresso) uses structure information when available.
HMMER โ profile HMM-based alignment and search. Not a traditional MSA tool but builds probabilistic models (profile HMMs) from alignments. Used by Pfam to define protein family profiles. hmmsearch finds distant homologs missed by BLAST.
Alignment Visualization
Jalview โ Java-based MSA editor and viewer. Coloring by conservation, physicochemical properties, or custom schemes. Tree display, annotations, web service integration. The standard MSA viewer.
AliView โ lightweight, fast MSA viewer. Handles very large alignments (millions of sequences). Editing capabilities. Cross-platform.
MView โ command-line tool for reformatting and coloring MSA output. Useful for converting between formats (FASTA, CLUSTAL, PIR, MSF).
Alignment Quality
MSA accuracy depends heavily on sequence identity. Above 40% identity: most tools produce similar, reliable alignments. Between 20-40% ("twilight zone"): alignment becomes difficult and tool choice matters significantly. Below 20%: alignment is unreliable; structure-based alignment is preferred.
Benchmarks โ BAliBASE, PREFAB, SABmark provide reference alignments (often structure-based) for evaluating MSA accuracy. MAFFT L-INS-i and MUSCLE5 consistently rank highest.
Phylogenetics reconstructs evolutionary relationships between sequences (or organisms) in the form of a tree. The tree topology shows which sequences share common ancestors, and branch lengths represent evolutionary distance (usually measured in substitutions per site).
Sequences
→
MSA
→
Trim
→
Model selection
→
Build tree
→
Phylogeny
Tree-Building Methods
Distance-based: Neighbor-Joining (NJ) โ compute a distance matrix from the MSA, then agglomeratively build the tree by joining closest neighbors. Fast (O(n^3)), good for large datasets. Does not explicitly model evolution. Tool: MEGA, RapidNJ.
Maximum Likelihood (ML) โ find the tree topology and branch lengths that maximize the probability of observing the data (MSA) under a substitution model. Statistically rigorous. Computationally expensive. Current gold standard. Tools: IQ-TREE 2 (fastest, automatic model selection via ModelFinder), RAxML-NG (optimized for large datasets, MPI-parallelized), PhyML (classic, user-friendly).
Bayesian inference โ explores tree space using MCMC (Markov Chain Monte Carlo). Produces a posterior distribution of trees with branch support (posterior probabilities). Computationally most expensive but provides the most information (uncertainty quantification). Tools: MrBayes, BEAST 2 (adds molecular clock, divergence dating, phylogeography).
Maximum Parsimony โ find the tree that requires the fewest evolutionary changes. Simple conceptually but statistically inconsistent (can converge on the wrong tree for unequal rates โ long-branch attraction). Largely superseded by ML and Bayesian methods. Tool: PAUP*, TNT.
Substitution Models
DNA โ JC69 (simplest: equal rates, equal frequencies), K2P/K80 (transition/transversion bias), HKY85 (unequal base frequencies + ti/tv bias), GTR (General Time Reversible: most complex, 6 rate parameters). Plus gamma (+G) for rate variation among sites and invariable sites (+I).
Protein โ WAG, LG, JTT, Dayhoff. Empirical matrices derived from large protein alignments. LG is the current default in most tools. ModelFinder (IQ-TREE) automatically selects the best model using BIC.
Codon models โ model DNA evolution at the codon level, distinguishing synonymous (dS) from non-synonymous (dN) substitutions. dN/dS (omega): omega < 1 = purifying selection, omega = 1 = neutral, omega > 1 = positive selection. Tools: PAML (codeml), HyPhy.
Branch Support
Bootstrap โ resample alignment columns with replacement (typically 1,000 replicates), rebuild tree from each resampled dataset, count how often each branch appears. Values >70% are generally considered well-supported. Ultrafast bootstrap (UFBoot2 in IQ-TREE) is 10-40x faster with comparable accuracy.
aLRT (approximate Likelihood Ratio Test) โ faster alternative to bootstrap. Tests whether the ML tree topology for a branch is significantly better than the NNI (nearest-neighbor interchange) alternative. SH-aLRT values >80% indicate good support.
Bayesian posterior probability โ proportion of MCMC samples containing a given branch. Values >0.95 indicate strong support. Can be overconfident compared to bootstrap, especially with short alignments.
Tree Visualization
FigTree โ the classic phylogenetic tree viewer. Reads Newick/Nexus format. Publication-quality figures. By Andrew Rambaut (BEAST developer).
iTOL (Interactive Tree Of Life) โ web-based tool for tree annotation and visualization. Add metadata (heatmaps, bar charts, colored ranges) to tree branches. Excellent for large trees. itol.embl.de.
ggtree โ R package for phylogenetic tree visualization using ggplot2 grammar. Extremely flexible. Integrates with tidyverse for data manipulation.
🌳Phylogenetics is building a family tree โ but for ALL living things! By comparing DNA sequences, scientists can figure out which species are cousins, which share a great-great-grandparent, and when they split apart. Humans and chimps are like siblings (99% same DNA). Humans and bananas are like very distant cousins (60% same DNA)!
R / Bioconductor
R is the dominant programming language for statistical bioinformatics. Bioconductor is a curated repository of R packages for biological data analysis, with over 2,200 packages covering genomics, proteomics, flow cytometry, imaging, and more.
Core Bioconductor Packages
DESeq2 โ differential gene expression analysis for RNA-seq count data. Uses negative binomial distribution, shrinkage estimation of dispersions, and Wald test. The gold standard for bulk RNA-seq DE analysis. Michael Love, Simon Anders, Wolfgang Huber.
edgeR โ alternative to DESeq2. Uses empirical Bayes estimation. Slightly different statistical framework; similar results in practice. Good for small sample sizes.
limma โ linear models for microarrays and RNA-seq (with voom transformation). Empirical Bayes moderation of standard errors. Fast, flexible, handles complex experimental designs. Gordon Smyth (WEHI).
GenomicRanges โ the foundational Bioconductor data structure for genomic intervals. GRanges objects represent chromosome regions (seqnames, ranges, strand). Operations: findOverlaps, reduce, intersect, setdiff. Underpins most Bioconductor genomics packages.
Seurat โ single-cell RNA-seq analysis. Normalization, feature selection, PCA, UMAP, clustering, differential expression, integration of multiple datasets. By Rahul Satija (New York Genome Center). The most widely used scRNA-seq package (though technically not on Bioconductor).
clusterProfiler โ gene set enrichment analysis. GO (Gene Ontology) enrichment, KEGG pathway enrichment, GSEA (Gene Set Enrichment Analysis). Publication-quality dot plots and enrichment maps. Guangchuang Yu.
VariantAnnotation โ read, filter, and annotate VCF files in R. Integration with GenomicRanges and BSgenome for variant-level analysis.
ComplexHeatmap โ publication-quality heatmaps with dendrograms, annotations, and split panels. By Zuguang Gu. The best heatmap package in any language.
Python is the dominant language for bioinformatics pipelines, data wrangling, machine learning, and tool development. Biopython is the core library for biological computation in Python.
Biopython
Seq objects โ represent biological sequences with methods for complement, reverse complement, translation, and transcription.
SeqIO โ read and write sequence files (FASTA, GenBank, FASTQ, Swiss-Prot, and 20+ formats). SeqIO.parse("seqs.fasta", "fasta") returns an iterator of SeqRecord objects.
AlignIO โ read and write multiple sequence alignments (Clustal, FASTA, Stockholm, Nexus, PHYLIP).
BLAST wrappers โ run BLAST from Python: NcbiblastpCommandline for local BLAST, NCBIWWW.qblast for remote NCBI BLAST.
PDB โ parse PDB/mmCIF structure files. PDBParser and MMCIFParser for reading. Calculate distances, angles, contacts. DSSP integration for secondary structure assignment.
Phylo โ read, draw, and manipulate phylogenetic trees (Newick, Nexus, PhyloXML). Basic tree visualization with matplotlib.
Key Python Libraries
pysam โ Python wrapper for htslib/SAMtools. Read/write BAM/CRAM/VCF files. Pileup, fetch, count operations. Essential for working with aligned sequencing data.
pyVCF / cyvcf2 โ parse VCF files. cyvcf2 (Cython-based) is much faster. Iterate over variants, filter, extract genotypes.
Scanpy โ single-cell analysis in Python. Equivalent of Seurat. AnnData object structure, preprocessing, clustering, DE, trajectory inference, visualization. By Fabian Theis lab (Helmholtz Munich).
scikit-learn โ machine learning for classification, clustering, regression, dimensionality reduction. Used in variant effect prediction, protein function prediction, drug response modeling.
PyTorch / TensorFlow โ deep learning frameworks. Used for AlphaFold, ESM (evolutionary scale modeling), DeepVariant, and countless biological ML models.
pandas โ data manipulation (DataFrames). Essential for any tabular bioinformatics data. Read CSV, merge datasets, filter, aggregate.
matplotlib / seaborn โ plotting. Volcano plots, heatmaps, scatter plots, genome browsers. Seaborn adds statistical visualization on top of matplotlib.
Python Example
# Parse FASTA, calculate GC content
from Bio import SeqIO
from Bio.SeqUtils import gc_fraction
for record in SeqIO.parse("genome.fasta", "fasta"):
gc = gc_fraction(record.seq) * 100
print(f"{record.id}: {len(record.seq)} bp, GC={gc:.1f}%")
Biological Databases
Sequence Databases
GenBank (NCBI) โ primary nucleotide sequence database. Part of the International Nucleotide Sequence Database Collaboration (INSDC) with ENA (Europe) and DDBJ (Japan). ~250 million sequences. All publicly available DNA/RNA.
UniProt โ the universal protein resource. Swiss-Prot (manually curated, ~570K entries) + TrEMBL (automatically annotated, ~250M entries). Protein function, domains, PTMs, interactions, variants, structure. The first place to look up any protein.
RefSeq (NCBI) โ curated, non-redundant reference sequences for genomes, transcripts, and proteins. NM_ (mRNA), NP_ (protein), NC_ (chromosome), XM_/XP_ (predicted). Higher quality than GenBank for reference purposes.
Ensembl โ genome browser and annotation database (EMBL-EBI). Gene models, regulatory elements, variation, comparative genomics. BioMart for bulk data access. REST API for programmatic access.
Functional Databases
Gene Ontology (GO) โ structured vocabulary for gene/protein function. Three ontologies: Biological Process, Molecular Function, Cellular Component. Used for enrichment analysis (which functions are over-represented in a gene list?).
KEGG โ Kyoto Encyclopedia of Genes and Genomes. Metabolic pathways, signaling pathways, disease pathways. Pathway maps link genes to biochemical reactions. Widely used for pathway enrichment analysis.
Reactome โ curated pathway database (human focus). Higher curation quality than KEGG for signaling and disease pathways. Open-source, peer-reviewed.
InterPro โ integrates protein family/domain databases (Pfam, PROSITE, PRINTS, SMART, CDD, etc.) into a single resource. Assign domains and predict function from sequence. ~47,000 entries covering ~90% of UniProt.
Structure Databases
PDB โ experimentally determined 3D structures. 220,000+ entries. Searchable by sequence, structure similarity, ligand, resolution.
AlphaFold DB โ 200+ million predicted protein structures. Per-residue confidence scores. Freely downloadable.
PDBe โ European PDB mirror with additional tools: PDBe-KB (knowledge base aggregating functional annotations onto structures), FunPDBe (community data deposition).
Clinical / Disease Databases
ClinVar โ variants and their clinical significance. Essential for clinical genomics interpretation.
OMIM โ Online Mendelian Inheritance in Man. Catalog of human genes and genetic disorders. Authoritative descriptions of genotype-phenotype relationships.
gnomAD โ population allele frequencies from 76K+ genomes. Variant constraint metrics (pLI, LOEUF) measure gene intolerance to loss-of-function.
PharmGKB โ pharmacogenomics database. Gene-drug-disease relationships. Drug dosing guidelines based on genotype (e.g., warfarin dosing by CYP2C9/VKORC1).
Snakemake โ Python-based workflow manager. Rules define input-output relationships; Snakemake resolves the DAG (Directed Acyclic Graph) and handles parallelism. Conda integration for reproducible environments. Popular in academic bioinformatics.
WDL (Workflow Description Language) โ developed by the Broad Institute. Used by GATK best-practices pipelines. Runs on Cromwell (local, HPC, cloud). Terra platform (Broad) provides a cloud execution environment.
Common Pipelines
nf-core/rnaseq โ FASTQ → FastQC → Trim Galore → STAR/HISAT2 alignment → featureCounts/Salmon quantification → MultiQC report. One command for a complete RNA-seq analysis.
Platform for learning bioinformatics through problem solving. 280+ problems from string algorithms to population genetics. Solve in any language. The Project Euler of biology.