Genomics & Sequencing

Reading the book of life at scale. From Sanger sequencing to nanopore real-time reads, and the computational tools that turn raw data into biological insight.

History of Sequencing

Sanger Sequencing (1977)

Frederick Sanger developed chain-termination sequencing using dideoxynucleotides (ddNTPs). Each ddNTP (ddATP, ddCTP, ddGTP, ddTTP) lacks the 3'-OH needed for elongation, terminating the chain at random positions. Gel electrophoresis separates fragments by size, revealing the sequence. Sanger received his second Nobel Prize for this (1980). Automated capillary electrophoresis Sanger sequencing (Applied Biosystems 3730) can read ~800-1000 bp per run with >99.999% accuracy. Still the gold standard for confirming short sequences.

The Human Genome Project (1990-2003)

International effort to sequence the entire human genome. Cost: ~$2.7 billion. Took 13 years. Used BAC-by-BAC (bacterial artificial chromosome) clone-based Sanger sequencing.
Draft published in 2001 (Nature, Science), complete sequence in 2003. Revealed ~20,000-25,000 protein-coding genes (fewer than expected), vast amounts of repetitive DNA, and extensive conserved non-coding regions.
Craig Venter's parallel whole-genome shotgun approach (Celera Genomics) demonstrated that shotgun assembly could work for large genomes, leading to current sequencing strategies.

Next-Generation Sequencing Revolution (2005-present)

Massively parallel sequencing technologies dropped the cost of a human genome from $2.7 billion to under $200 (2024). Key platforms: 454 (pyrosequencing, 2005, discontinued), Illumina (sequencing by synthesis, 2006, dominant), Ion Torrent (semiconductor, 2010), PacBio (single-molecule, 2011), Oxford Nanopore (nanopore, 2014).

1977 — Sanger Sequencing

Chain termination. First genomes. ~800 bp reads.

2005 — Next-Gen (Illumina)

Sequencing by synthesis. Short reads, massive throughput.

2014 — Long-Read (Nanopore + PacBio)

Single-molecule. 10-100 kb reads. Resolves repeats.

2022 — T2T Complete Genome

First truly complete human genome. +200 Mbp resolved.

T2T Consortium (2022)

The Telomere-to-Telomere (T2T) Consortium completed the first truly complete human genome sequence — including centromeres, telomeres, and ribosomal DNA arrays that were missing from the original Human Genome Project assembly. Added ~200 million base pairs (8% of the genome) previously unresolved. Used a combination of PacBio HiFi and Oxford Nanopore ultra-long reads.

📖 Imagine your DNA is a book with 3 billion letters. Sequencing is reading that whole book! The first time we read a human genome took 13 years and cost $2.7 billion. Now we can do it in a day for about $200. Some machines are so small they plug into your phone via USB!

Illumina Sequencing

Illumina dominates the sequencing market (~80% of all sequencing data worldwide). Their technology, sequencing by synthesis (SBS), generates massive amounts of short, accurate reads at low cost.

How It Works

Library preparation — fragment DNA to ~300-600 bp, ligate adapters to both ends. Adapters contain sequences for flow cell binding, amplification primers, and sample-specific barcodes (indexes) for multiplexing.
Cluster generation — library fragments bind to a glass flow cell coated with complementary oligos. Bridge amplification creates ~1,000-copy clusters from each fragment. Each cluster generates enough signal for detection.
Sequencing by synthesis — fluorescently labeled, reversibly terminated nucleotides are added one at a time. After each incorporation, the flow cell is imaged (all clusters simultaneously). The terminator and fluorophore are cleaved, and the next cycle begins. Typically 150-300 cycles per read.
Data output — FASTQ files containing reads + base quality scores (Phred scale: Q30 = 99.9% accuracy, Q40 = 99.99%). Paired-end sequencing: both ends of each fragment are read, providing pairs of reads ~300-600 bp apart.

Fragment DNA

→

Ligate adapters

→

Cluster on flow cell

→

SBS + image

→

FASTQ output

Platforms

NovaSeq X Plus — flagship. Up to 16 Tb per run, 52 billion reads. Cost: ~$200 per human genome at 30x coverage. 25-flow-cell run capacity. For population-scale sequencing.
NextSeq 2000 — mid-range. 360 Gb per run. Flexible for exomes, targeted panels, RNA-seq, and small genomes.
MiSeq — benchtop. Up to 15 Gb per run. Longest Illumina reads (2x300 bp). Popular for amplicon sequencing, metagenomics, and small projects.
iSeq 100 — smallest. 1.2 Gb per run. For infectious disease panels, targeted sequencing, and education.

Strengths & Limitations

Strengths — lowest cost per base, highest accuracy (~99.9% per base), enormous throughput, mature ecosystem of analysis tools.
Limitations — short reads (150-300 bp) struggle with repetitive regions, structural variants, and de novo assembly of complex genomes. PCR amplification can introduce biases (GC bias, duplicate reads). Library prep takes hours.

Oxford Nanopore Technology

Oxford Nanopore Technologies (ONT) sequences DNA or RNA by threading individual molecules through protein nanopores and measuring the resulting electrical current changes. No amplification, no fluorescence — direct, single-molecule, real-time sequencing.

How It Works

Nanopore — a modified CsgG protein pore (from E. coli curli secretion system) is embedded in an electrically resistant polymer membrane. A voltage is applied across the membrane, driving ions through the pore and generating a measurable current (~100-150 pA).
Motor protein — a helicase-type motor protein (attached to the DNA during library prep) ratchets the strand through the pore one nucleotide at a time (~450 bases/second). Without the motor, translocation would be too fast for accurate base-calling.
Signal processing — each 5-mer of nucleotides occupying the pore constriction produces a characteristic current level. A neural network (base-caller) converts the raw current signal into nucleotide sequence. Latest basecallers (Dorado, SUP mode) achieve ~99% single-read accuracy; consensus accuracy exceeds 99.9%.
Direct RNA sequencing — native RNA can be sequenced directly (no reverse transcription). Detects RNA modifications (m6A, pseudouridine) in situ. The only platform that sequences RNA without converting to cDNA.

Platforms

MinION — pocket-sized USB device. 512 nanopores. Up to 50 Gb per flow cell. ~$1,000 starter pack. Used in field sequencing (Ebola outbreak in Guinea, International Space Station, Antarctic expeditions).
GridION — benchtop, runs 5 MinION flow cells simultaneously. Up to 250 Gb per run.
PromethION — high-throughput. 48 or 72 flow cells. Up to 12 Tb per run. Competitive with Illumina for large projects at lower capital cost. ~$200-300 per human genome.
Flongle — disposable, low-cost adapter for MinION/GridION. 126 nanopores. ~2 Gb. For small, targeted experiments and rapid diagnostics.

Ultra-Long Reads

Nanopore's headline capability: reads exceeding 1 million bases (current record: >4 Mb). Ultra-long reads span entire structural variants, centromeric repeats, and segmental duplications that are unresolvable with short reads. Critical for the T2T complete human genome assembly. Library prep using high-molecular-weight DNA extraction (phenol-chloroform, agarose plug) is key to achieving ultra-long reads.

PacBio (Pacific Biosciences)

PacBio uses Single-Molecule Real-Time (SMRT) sequencing. A DNA polymerase, fixed at the bottom of a nanophotonic well (zero-mode waveguide, ZMW), synthesizes a complementary strand while fluorescent nucleotide incorporations are observed in real time.

How It Works

Zero-Mode Waveguides (ZMWs) — nanoscale holes (~70 nm diameter) in a metal film on glass. Light cannot propagate through (below the diffraction limit), creating an observation volume of ~20 zeptoliters. Only the nucleotide being incorporated by the polymerase at the bottom is illuminated, eliminating background from free-floating nucleotides.
SMRTbell library — DNA is ligated into circular templates (SMRTbell: hairpin adapters on both ends). The polymerase reads around the circle multiple times, generating "subreads" that are averaged into a high-accuracy consensus (HiFi reads).
HiFi reads — introduced 2019. Circular consensus sequencing (CCS) of ~15-20 kb inserts, with the polymerase reading the insert 3-10 times. Consensus accuracy: >99.9% (Q30+). Read length: 10-25 kb. The best of both worlds: long and accurate.
Continuous Long Reads (CLR) — older mode. Single pass, up to 50-100 kb reads. Lower per-read accuracy (~85-90%) but very long. Useful when length matters more than accuracy. Being phased out in favor of HiFi.

Platforms

Revio — current flagship (2023). 25 million ZMWs per SMRT Cell (vs 8M on Sequel IIe). Up to 90 Gb HiFi per SMRT Cell, 360 Gb per run (4 cells). ~$300 per human genome at 30x coverage. Dramatic cost reduction.
Sequel IIe — predecessor. 8 million ZMWs. Up to 30 Gb HiFi per SMRT Cell. Being replaced by Revio.
Onso — PacBio's short-read sequencer (2023). Sequencing by binding (SBB) — detects nucleotide identity without incorporating (no scarring). Competes with Illumina on accuracy and cost for short-read applications.

Applications

De novo genome assembly — HiFi reads produce near-complete assemblies with N50 >10 Mb. Combined with Hi-C data, achieve chromosome-scale scaffolds. The current method of choice for reference-quality genomes.
Structural variant detection — long reads span SVs (deletions, insertions, inversions, translocations) that are invisible to short reads. Particularly important for understanding cancer genomes and rare diseases.
Epigenetics — SMRT sequencing detects DNA methylation (5mC, 6mA) directly during sequencing by measuring polymerase kinetics (IPD: inter-pulse duration). No bisulfite conversion needed.
Isoform sequencing (Iso-Seq) — full-length cDNA sequencing. Captures complete transcript isoforms without assembly. Reveals alternative splicing, novel genes, and fusion transcripts.

Genome Assembly

Genome assembly reconstructs the complete genomic sequence from short or long sequencing reads. It is a computational puzzle: overlap reads, resolve repeats, and produce contiguous sequences (contigs) that ideally span entire chromosomes.

Approaches

De novo assembly — assemble without a reference genome. Required for new species, highly divergent genomes, and capturing structural variation. More computationally expensive.
Reference-guided assembly — align reads to a reference genome and call variants (differences). Faster and easier, but biased toward the reference and misses novel sequences.

Assembly Algorithms

Overlap-Layout-Consensus (OLC) — find all pairwise overlaps between reads, build an overlap graph, find a path through the graph (layout), derive the consensus sequence. Used for long reads. Tools: Canu, Flye, Hifiasm.
De Bruijn graph — decompose reads into k-mers (subsequences of length k), build a graph where k-mers are edges. Find Eulerian paths through the graph. Memory-efficient, handles massive numbers of short reads. Tools: SPAdes, MEGAHIT, Velvet.
String graph — similar to OLC but uses irreducible overlaps. More memory-efficient for long reads. Used by Hifiasm (the current best HiFi assembler).

Assembly Quality Metrics

N50 — the length such that contigs of this length or longer cover 50% of the assembly. Higher = better. A human genome assembly from HiFi reads typically has contig N50 of 30-50 Mb.
BUSCO — Benchmarking Universal Single-Copy Orthologs. Measures completeness by checking for expected conserved genes. A complete assembly has >95% BUSCO genes as complete single-copy.
QV (Quality Value) — consensus accuracy. QV50 = 1 error per 100,000 bases. QV60 = 1 error per 1 million bases. Merqury tool calculates QV from k-mer analysis.
Phasing — diploid organisms have two copies of each chromosome. Phased assemblies separate maternal and paternal haplotypes. Trio binning (using parental short reads to assign long reads to haplotypes) or Hi-C-based phasing. Hifiasm can produce partially phased assemblies without parental data.

Read Alignment

Read alignment (mapping) places sequencing reads onto a reference genome, finding the position and orientation where each read best matches. It is the first step in most resequencing analyses (variant calling, gene expression quantification, ChIP-seq peak calling).

Short-Read Aligners

BWA-MEM2 — the gold standard for short reads. Uses a Burrows-Wheeler Transform (BWT) index of the reference for fast seed finding, then extends seeds with Smith-Waterman alignment. Handles paired-end reads, split reads, supplementary alignments. Successor to BWA-MEM.
Bowtie2 — fast short-read aligner using FM-index. Good for smaller genomes and ChIP-seq. Slightly faster than BWA for single-end reads.
STAR — splice-aware aligner for RNA-seq. Maps reads across exon-exon junctions. Two-pass mode: first pass discovers novel splice sites, second pass uses them. Standard for RNA-seq alignment.
HISAT2 — successor to TopHat/TopHat2. Uses a graph-based index incorporating known SNPs and splice sites. Memory-efficient alternative to STAR.

Long-Read Aligners

Minimap2 — the universal long-read aligner. Handles PacBio HiFi, PacBio CLR, Oxford Nanopore, and even short reads. Uses minimizer-based seeding. Extremely fast. Also used for all-vs-all overlap finding in assembly. By Heng Li (author of BWA, SAMtools).
Winnowmap2 — specialized for mapping to repetitive regions (centromeres, segmental duplications). Uses weighted minimizers to improve mapping in regions where Minimap2 struggles.

Output Format: SAM/BAM/CRAM

SAM — Sequence Alignment/Map format. Tab-delimited text. Each line: read name, flags, reference position, mapping quality, CIGAR string (alignment operations), sequence, quality. Human-readable but large.
BAM — compressed binary SAM. Standard working format. Indexed (.bai) for random access. Tools: SAMtools, Picard.
CRAM — reference-based compression. 30-60% smaller than BAM. Becoming standard for archival storage (ENA, SRA). Requires the reference genome for decompression.

Variant Calling

Variant calling identifies positions where a sample's genome differs from the reference. Variants range from single-nucleotide changes (SNVs) to large structural rearrangements (SVs). Accurate variant calling is essential for clinical genomics, population genetics, and understanding disease.

Variant Types

SNV (Single Nucleotide Variant) — single base change. ~4-5 million SNVs per human genome relative to the reference. SNPs (Single Nucleotide Polymorphisms) are common SNVs (>1% allele frequency in the population).
Indel — insertion or deletion of 1-50 bp. ~500,000-1,000,000 per human genome. Harder to call accurately than SNVs, especially in homopolymer regions.
Structural Variant (SV) — larger events (>50 bp): deletions, duplications, insertions, inversions, translocations. ~20,000-30,000 SVs per human genome. Responsible for more base-pair differences between individuals than SNVs. Long reads dramatically improve SV detection.
Copy Number Variant (CNV) — duplications or deletions of large segments (1 kb to megabases). Detected by read-depth analysis or array CGH. Clinically significant: many diseases involve CNVs (e.g., 22q11 deletion = DiGeorge syndrome).

Variant Callers

GATK HaplotypeCaller — the Broad Institute's standard pipeline. Local de novo assembly of haplotypes in active regions. Produces GVCF files for joint genotyping across many samples. Gold standard for short-read germline variant calling.
DeepVariant — Google's deep-learning variant caller. Converts read pileups into images, classifies with a CNN. Matches or exceeds GATK accuracy, especially for indels. Won the PrecisionFDA Truth Challenge.
Mutect2 — GATK's somatic variant caller (tumor vs normal). Detects low-frequency mutations in cancer samples. Handles tumor heterogeneity and contamination.
Strelka2 — fast somatic and germline variant caller from Illumina. Good sensitivity for SNVs and indels.
Sniffles2 / SVIM / cuteSV — structural variant callers for long reads. Use split reads, supplementary alignments, and read-depth to detect SVs. Sniffles2 is the current standard for nanopore/PacBio SV calling.

Variant Annotation

VEP (Variant Effect Predictor) — Ensembl's tool. Annotates variants with gene impact (missense, nonsense, splice), allele frequency (gnomAD), pathogenicity predictions (SIFT, PolyPhen-2, CADD), and clinical significance (ClinVar).
ClinVar — NCBI database of variants and their clinical significance (pathogenic, likely pathogenic, VUS, likely benign, benign). >2 million variant submissions. Essential for clinical genomics.
gnomAD — Genome Aggregation Database. Allele frequencies from >76,000 genomes and >126,000 exomes. If a variant is common in gnomAD (>1%), it is unlikely to be a rare disease-causing mutation.

Single-Cell Genomics

Bulk sequencing averages signals across millions of cells, masking cell-to-cell heterogeneity. Single-cell technologies sequence individual cells, revealing cell types, states, developmental trajectories, and rare populations that are invisible in bulk data.

scRNA-seq

10x Genomics Chromium — dominant platform. Microfluidic droplets encapsulate individual cells with barcoded gel beads. Each cell gets a unique barcode; all transcripts from that cell share the barcode. Typically profiles 1,000-10,000 cells per experiment, detecting 1,000-5,000 genes per cell. 3' or 5' capture.
Smart-seq3 — plate-based, full-length transcript sequencing. Higher sensitivity per cell (~8,000 genes) but lower throughput (~hundreds of cells). Better for detecting isoforms and rare transcripts.
Analysis pipeline — demultiplex → align (Cell Ranger / STARsolo) → count matrix (genes x cells) → quality filtering → normalization → dimensionality reduction (PCA, UMAP, t-SNE) → clustering → marker gene identification → cell type annotation. Tools: Scanpy (Python), Seurat (R).

Spatial Transcriptomics

Visium (10x Genomics) — capture spots (~55 um, ~1-10 cells each) on tissue sections. Not truly single-cell but preserves spatial context. Genome-wide. Being superseded by Visium HD (~2 um spots, near-single-cell resolution).
MERFISH / seqFISH+ — imaging-based, fluorescence in situ hybridization with combinatorial barcoding. True single-cell, subcellular resolution. 100-10,000 genes per experiment. Xiaowei Zhuang (MERFISH, Harvard).
Slide-seq / Stereo-seq — bead-based spatial capture. Stereo-seq (BGI): subcellular resolution (220 nm spot size), genome-wide, on large tissue areas.

Single-Cell Multiomics

CITE-seq — simultaneous measurement of mRNA and surface proteins (via antibody-conjugated DNA tags) in the same cell. Combines transcriptomic and proteomic data.
Multiome (10x) — simultaneous scATAC-seq (chromatin accessibility) and scRNA-seq from the same cell. Links regulatory elements to gene expression.
scATAC-seq — assay for transposase-accessible chromatin in single cells. Identifies active regulatory elements (promoters, enhancers) per cell.

Human Cell Atlas

An international effort to create a comprehensive reference atlas of all human cell types. Using single-cell and spatial transcriptomics to catalog every cell type in the human body, their molecular profiles, locations, and states. Led by Aviv Regev and Sarah Teichmann. As of 2025, data from >100 million cells across most human organs have been generated. The atlas will transform our understanding of development, health, and disease.

🔬 Your body has about 37 trillion cells, and they're not all the same — brain cells, blood cells, muscle cells are all different! Single-cell sequencing lets scientists read the DNA instructions in ONE cell at a time. It's like going from reading one class's average test score to reading every student's individual answer sheet!

Genomics & Sequencing

History of Sequencing

Sanger Sequencing (1977)

The Human Genome Project (1990-2003)

Next-Generation Sequencing Revolution (2005-present)

T2T Consortium (2022)

Illumina Sequencing

How It Works

Platforms

Strengths & Limitations

Oxford Nanopore Technology

How It Works

Platforms

Ultra-Long Reads

PacBio (Pacific Biosciences)

How It Works

Platforms

Applications

Genome Assembly

Approaches

Assembly Algorithms

Assembly Quality Metrics

Read Alignment

Short-Read Aligners

Long-Read Aligners

Output Format: SAM/BAM/CRAM

Variant Calling

Variant Types

Variant Callers

Variant Annotation

Single-Cell Genomics

scRNA-seq

Spatial Transcriptomics

Single-Cell Multiomics

Human Cell Atlas

Resources

Sequence Read Archive (SRA)

Galaxy

Biostars

HBC Training (Harvard)

1000 Genomes Project

gnomAD