Molecular Biology

The molecular machinery of life. How DNA stores information, RNA carries messages, and proteins do the work — from replication to translation.

DNA Structure

Deoxyribonucleic acid (DNA) is the molecule that stores genetic information in all cellular life and many viruses. Its structure was determined by James Watson and Francis Crick in 1953, building on X-ray crystallography data from Rosalind Franklin and Maurice Wilkins.

The Double Helix

Two antiparallel strands — each strand is a polymer of nucleotides. The strands run in opposite directions: one 5' to 3', the other 3' to 5'. The 5' end has a free phosphate group; the 3' end has a free hydroxyl group.
Sugar-phosphate backbone — alternating deoxyribose sugar and phosphate groups form the structural backbone. Negatively charged due to phosphate groups (which is why DNA migrates toward the positive electrode in gel electrophoresis).
Base pairing — adenine (A) pairs with thymine (T) via 2 hydrogen bonds; guanine (G) pairs with cytosine (C) via 3 hydrogen bonds. This complementarity means one strand determines the sequence of the other. Chargaff's rules: %A = %T and %G = %C.
Major and minor grooves — the helical twist creates two grooves of different widths. Proteins (transcription factors, restriction enzymes) read the DNA sequence by contacting bases through these grooves without unwinding the helix.
B-form DNA — the standard right-handed helix. 10 base pairs per turn, 3.4 nm pitch, 2 nm diameter. Other forms: A-DNA (dehydrated, wider), Z-DNA (left-handed, zigzag backbone, found in GC-rich regions).

🧬

Double Helix

Antiparallel strands

A-T (2 H-bonds), G-C (3 H-bonds)

📜

Base Pairing

5' → 3' coding strand

3' → 5' template strand

Genome Organization

Prokaryotes — single circular chromosome (E. coli: 4.6 million bp, ~4,300 genes). Often have small circular plasmids carrying antibiotic resistance or other accessory genes. DNA in the nucleoid, not membrane-bound.
Eukaryotes — multiple linear chromosomes (human: 3.2 billion bp across 23 pairs, ~20,000 protein-coding genes). DNA wrapped around histone proteins into nucleosomes, forming chromatin. Condensed into visible chromosomes during cell division.
Non-coding DNA — in humans, only ~1.5% of the genome encodes proteins. The rest includes regulatory elements (promoters, enhancers, silencers), introns, transposable elements (45% of human genome), and sequences of unknown function.

🧬 DNA is like a recipe book for building YOU. It has 3 billion letters that spell out instructions for everything — your eye color, how tall you are, even how your brain works! The letters are A, T, G, and C, and they always pair up: A with T, G with C. It's shaped like a twisted ladder called a double helix.

RNA

Ribonucleic acid (RNA) is chemically similar to DNA but differs in three key ways: it uses ribose sugar (with a 2'-OH group) instead of deoxyribose, it contains uracil (U) instead of thymine (T), and it is usually single-stranded. RNA is far more versatile than initially appreciated.

Types of RNA

mRNA (messenger RNA) — carries the protein-coding sequence from DNA to ribosomes. In eukaryotes: 5' cap, 5' UTR, coding sequence, 3' UTR, poly-A tail. Half-life ranges from minutes (bacteria) to hours/days (eukaryotes).
tRNA (transfer RNA) — small (~76 nucleotides), cloverleaf structure. Carries amino acids to the ribosome. The anticodon loop base-pairs with mRNA codons. Aminoacyl-tRNA synthetases charge each tRNA with its cognate amino acid. 61 sense codons, but only ~45 tRNA types due to wobble base-pairing at the third codon position.
rRNA (ribosomal RNA) — the catalytic and structural core of the ribosome. Makes up ~80% of total cellular RNA. In humans: 28S, 18S, 5.8S, 5S. The peptidyl transferase activity (peptide bond formation) is catalyzed by rRNA, making the ribosome a ribozyme.
miRNA (microRNA) — small (~22 nt) regulatory RNAs. Base-pair with 3' UTR of target mRNAs to repress translation or trigger degradation. ~2,600 human miRNAs regulate ~60% of protein-coding genes. Discovered by Victor Ambros and Gary Ruvkun (Nobel Prize 2024).
siRNA (small interfering RNA) — double-stranded, ~21 nt. Triggers mRNA degradation via RISC (RNA-Induced Silencing Complex). Basis for RNA interference (RNAi) — Craig Mello and Andrew Fire, Nobel Prize 2006. Used in gene knockdown experiments and therapeutics (patisiran for hereditary ATTR amyloidosis).
lncRNA (long non-coding RNA) — >200 nt, no protein-coding capacity. Diverse functions: X-chromosome inactivation (Xist), chromatin remodeling, transcription regulation. Thousands identified but most poorly characterized.
snRNA / snoRNA — small nuclear and small nucleolar RNAs. snRNAs are components of the spliceosome (mRNA splicing). snoRNAs guide chemical modifications of rRNA.

The Central Dogma

Proposed by Francis Crick in 1958 and published in 1970. The central dogma describes the flow of genetic information: DNA is replicated, DNA is transcribed into RNA, and RNA is translated into protein. Information flows from nucleic acid to protein, but not from protein back to nucleic acid.

DNA

Storage

→

Transcription

RNA

Messenger

→

Translation

Protein

Function

Replication ← DNA → Reverse transcription (retroviruses)

The Flow

DNA → DNA — replication. DNA polymerase copies the genome before cell division. Semi-conservative: each new double helix contains one old strand and one new strand (proven by Meselson-Stahl experiment, 1958).
DNA → RNA — transcription. RNA polymerase reads the template strand (3' to 5') and synthesizes mRNA (5' to 3'). The mRNA sequence matches the coding strand (sense strand) except U replaces T.
RNA → Protein — translation. Ribosomes read mRNA codons (three nucleotides at a time) and assemble amino acids into polypeptide chains, guided by tRNAs.

Exceptions and Extensions

Reverse transcription — RNA → DNA. Retroviruses (HIV) use reverse transcriptase to copy their RNA genome into DNA, which integrates into the host genome. Also used by telomerase and retrotransposons.
RNA replication — RNA → RNA. RNA viruses (influenza, SARS-CoV-2) use RNA-dependent RNA polymerase (RdRp) to replicate their genomes.
RNA editing — the mRNA sequence can be altered after transcription. ADAR enzymes convert A to inosine (read as G) in double-stranded RNA regions. APOBEC edits C to U. Changes the protein product without changing the gene.
Epigenetics — heritable changes in gene expression without DNA sequence changes. DNA methylation (CpG islands), histone modifications (acetylation, methylation), and chromatin remodeling control which genes are active.

DNA Replication

DNA replication is the process of copying the entire genome before cell division. In E. coli, replication proceeds at ~1,000 nucleotides/second with an error rate of ~1 per 10^9 bases (after proofreading and mismatch repair). Human cells replicate 6.4 billion base pairs in ~8 hours using ~30,000 replication origins.

Key Steps

Initiation — origin recognition complex (ORC in eukaryotes, DnaA in E. coli) binds to replication origins and recruits helicase. Replication origins are AT-rich (easier to separate because A-T has only 2 hydrogen bonds).
Unwinding — helicase (DnaB in E. coli, MCM2-7 in eukaryotes) unwinds the double helix. Single-strand binding proteins (SSB) stabilize the separated strands. Topoisomerase relieves torsional stress ahead of the replication fork.
Priming — primase synthesizes a short RNA primer (~10 nt) because DNA polymerase cannot start de novo — it can only extend an existing 3'-OH. Primers are later removed and replaced with DNA.
Elongation — DNA polymerase III (E. coli) or Pol epsilon/delta (eukaryotes) synthesizes new DNA 5' to 3'. The leading strand is synthesized continuously. The lagging strand is synthesized discontinuously as Okazaki fragments (~1000-2000 nt in bacteria, ~100-200 nt in eukaryotes).
Proofreading — DNA polymerase has 3' to 5' exonuclease activity. If a mismatched base is incorporated, the polymerase backs up, removes it, and inserts the correct one. Reduces error rate from ~10^-5 to ~10^-7.
Ligation — DNA ligase joins Okazaki fragments after primer removal. Seals the phosphodiester backbone.
Mismatch repair — post-replication repair system (MutS/MutL/MutH in E. coli) detects and corrects remaining mismatches. Reduces final error rate to ~10^-9 to 10^-10.

Telomeres

Linear chromosomes have an "end-replication problem": the lagging strand cannot be fully replicated at the chromosome ends. Telomeres — repetitive sequences (TTAGGG in humans, repeated 1,000-2,000 times) — protect chromosome ends from erosion. Telomerase (a reverse transcriptase with an RNA template) extends telomeres in stem cells and germ cells. In somatic cells, telomeres shorten with each division (~50-100 bp per division), contributing to cellular aging (Hayflick limit). Cancer cells typically reactivate telomerase for unlimited replication.

Transcription

Transcription is the synthesis of RNA from a DNA template by RNA polymerase. The enzyme reads the template strand 3' to 5' and synthesizes the RNA transcript 5' to 3'. Unlike DNA replication, transcription does not require a primer.

In Prokaryotes

One RNA polymerase — a multi-subunit enzyme (alpha2-beta-beta'-omega, ~400 kDa) handles all transcription: mRNA, tRNA, and rRNA.
Sigma factor — associates with the core enzyme to form the holoenzyme. Sigma recognizes the promoter (specifically the -10 and -35 elements: TATAAT and TTGACA consensus). Different sigma factors direct transcription of different gene sets (sigma-70 for housekeeping, sigma-32 for heat shock).
No processing — mRNA is translated while still being transcribed (coupled transcription-translation). No 5' cap, no poly-A tail, no introns. Polycistronic mRNAs encode multiple proteins (operons like lac, trp).
Termination — Rho-independent: a GC-rich hairpin followed by a U-rich region causes the polymerase to stall and release. Rho-dependent: Rho helicase catches up to a paused polymerase and unwinds the RNA-DNA hybrid.

In Eukaryotes

Three RNA polymerases — Pol I (rRNA: 28S, 18S, 5.8S), Pol II (mRNA, miRNA, most snRNAs), Pol III (tRNA, 5S rRNA, U6 snRNA). Pol II is the one regulated by transcription factors.
Promoter elements — TATA box (~25 bp upstream), Inr (initiator), BRE, DPE. General transcription factors (TFIIA, B, D, E, F, H) assemble on the promoter to form the preinitiation complex (PIC). TFIID (containing TBP) recognizes the TATA box.
Mediator complex — ~30 subunits, bridges transcription factors and Pol II. Integrates signals from activators and repressors bound to enhancers (which can be 100 kb or more away, connected via DNA looping).
mRNA processing — (1) 5' capping: 7-methylguanosine cap added co-transcriptionally, protects from degradation and aids ribosome binding. (2) Splicing: introns removed by the spliceosome (5 snRNPs: U1, U2, U4, U5, U6). (3) 3' polyadenylation: poly-A signal (AAUAAA) triggers cleavage and addition of ~200 A residues by poly-A polymerase.
Alternative splicing — different combinations of exons can be included, producing multiple protein isoforms from one gene. The human genome's ~20,000 genes produce ~100,000+ distinct mRNAs. The Dscam gene in Drosophila can produce 38,016 isoforms.

Translation

Translation is the process of synthesizing a polypeptide chain from an mRNA template. It occurs on ribosomes — large ribonucleoprotein complexes (2.5-4 MDa) composed of two subunits. In E. coli, translation elongation proceeds at ~20 amino acids per second.

The Ribosome

Prokaryotic (70S) — 30S small subunit (16S rRNA + 21 proteins) and 50S large subunit (23S + 5S rRNA + 31 proteins). Targeted by many antibiotics (chloramphenicol, erythromycin, tetracycline, streptomycin).
Eukaryotic (80S) — 40S small subunit (18S rRNA + 33 proteins) and 60S large subunit (28S + 5.8S + 5S rRNA + 47 proteins). Larger and more complex than prokaryotic ribosomes.
Three binding sites — A (aminoacyl: incoming charged tRNA), P (peptidyl: tRNA holding the growing chain), E (exit: deacylated tRNA leaving). During each elongation cycle, tRNAs move A → P → E.

Steps

Initiation — small subunit binds mRNA at the start codon (AUG). In prokaryotes: Shine-Dalgarno sequence (AGGAGG, ~8 nt upstream of AUG) base-pairs with 16S rRNA. In eukaryotes: 40S subunit binds the 5' cap and scans for the first AUG (Kozak consensus: GCC(A/G)CCAUGG). Initiator tRNA (Met-tRNAi) enters the P site. Large subunit joins.
Elongation — (1) Aminoacyl-tRNA (charged with correct amino acid) enters the A site, guided by EF-Tu (prokaryotes) or eEF1A (eukaryotes) and GTP hydrolysis. Codon-anticodon match verified. (2) Peptidyl transferase (23S rRNA in the 50S subunit) catalyzes peptide bond formation between the amino acid in the A site and the peptide chain in the P site. (3) EF-G/eEF2 + GTP catalyze translocation: ribosome advances one codon, moving tRNAs from A→P and P→E.
Termination — a stop codon (UAA, UAG, or UGA) enters the A site. No tRNA recognizes stop codons. Release factors (RF1/RF2 in prokaryotes, eRF1 in eukaryotes) enter the A site, trigger hydrolysis of the peptidyl-tRNA bond, releasing the completed polypeptide. Ribosome recycling factor disassembles the ribosome for reuse.

Post-Translational Modifications

Folding — chaperones (Hsp70, Hsp60/GroEL, Hsp90) assist proper protein folding. Misfolded proteins are targeted for proteasomal degradation via ubiquitination.
Phosphorylation — kinases add phosphate groups to serine, threonine, or tyrosine. The most common signaling modification. ~30% of human proteins are phosphorylated at any time.
Glycosylation — addition of sugar chains. N-linked (to asparagine, in the ER) or O-linked (to serine/threonine, in the Golgi). Critical for cell surface proteins and secreted proteins.
Proteolytic cleavage — removal of signal peptides (for secretion), propeptide cleavage (e.g., insulin from proinsulin), polyprotein processing (viral proteins).

The Genetic Code

The genetic code maps 64 codons (4^3 triplets) to 20 amino acids and 3 stop signals. It was deciphered between 1961 and 1966 by Marshall Nirenberg, Har Gobind Khorana, and Robert Holley (Nobel Prize 1968). The code is nearly universal across all life — from bacteria to humans — with minor variations in mitochondria and some organisms.

Properties

Degenerate (redundant) — most amino acids are encoded by 2-6 codons. Leucine and serine each have 6 codons; methionine and tryptophan have only 1 each. The degeneracy is mostly in the third ("wobble") position, where different bases can encode the same amino acid.
Non-overlapping — codons are read sequentially, three nucleotides at a time, with no shared nucleotides between adjacent codons.
Comma-free — there are no spacers between codons. The reading frame is set by the start codon (AUG) and maintained until a stop codon.
Start codon — AUG (methionine). Also encodes methionine internally. In prokaryotes, the start Met is formylated (fMet). In rare cases, GUG or UUG can serve as start codons in bacteria.
Stop codons — UAA ("ochre"), UAG ("amber"), UGA ("opal"). No tRNA recognizes them. In some organisms, UGA encodes selenocysteine (the "21st amino acid"), and UAG encodes pyrrolysine in some archaea.

Codon Usage Bias

Although the code is degenerate, organisms preferentially use certain codons over synonymous alternatives. E. coli strongly favors certain codons (e.g., CGU for arginine over CGG), matching the abundance of corresponding tRNAs. Codon optimization — adjusting a gene's codons to match the host's preference — is essential for heterologous protein expression. The BioNTech/Pfizer COVID-19 vaccine used codon-optimized mRNA with N1-methylpseudouridine to enhance expression and reduce innate immune activation.

Proteins

Proteins are the functional workhorses of the cell. Built from 20 standard amino acids linked by peptide bonds, they fold into specific 3D structures that determine their function. The human body contains an estimated 80,000-400,000 distinct proteins.

Amino Acid Properties

Nonpolar (hydrophobic) — glycine, alanine, valine, leucine, isoleucine, proline, phenylalanine, methionine, tryptophan. Tend to be buried in the protein interior.
Polar uncharged — serine, threonine, cysteine, tyrosine, asparagine, glutamine. Participate in hydrogen bonds. Cysteine can form disulfide bonds (S-S).
Positively charged (basic) — lysine, arginine, histidine. Found on protein surfaces, interact with negatively charged DNA/RNA.
Negatively charged (acidic) — aspartate, glutamate. Metal-binding sites, catalytic residues in many enzymes.

Protein Functions

Enzymes — biological catalysts. Accelerate reactions by 10^6 to 10^17-fold. Substrate-specific. Regulated by allosteric effectors, covalent modification, and gene expression. Examples: DNA polymerase, ATP synthase, lactase.
Structural — collagen (connective tissue, most abundant human protein), keratin (hair, nails), actin and tubulin (cytoskeleton).
Transport — hemoglobin (oxygen in blood), transferrin (iron), aquaporins (water channels), ABC transporters (membrane transport).
Signaling — insulin (hormone), receptor tyrosine kinases, G-proteins, transcription factors.
Immune — antibodies (immunoglobulins), T-cell receptors, complement proteins, MHC molecules.
Motor — myosin (muscle contraction), kinesin (cargo transport along microtubules), dynein (flagellar motion, intracellular transport).

⚙️ Proteins are tiny machines inside your body. DNA is the blueprint, but proteins do the actual work — they digest food, fight germs, carry oxygen, and build muscles. Your body has over 80,000 different kinds of protein machines, each folded into a special shape that lets it do its job!

Molecular Biology

DNA Structure

The Double Helix

Genome Organization

RNA

Types of RNA

The Central Dogma

The Flow

Exceptions and Extensions

DNA Replication

Key Steps

Telomeres

Transcription

In Prokaryotes

In Eukaryotes

Translation

The Ribosome

Steps

Post-Translational Modifications

The Genetic Code

Properties

Codon Usage Bias

Proteins

Amino Acid Properties

Protein Functions

Resources

Molecular Biology of the Cell (Alberts et al.)

MIT 7.013: Introductory Biology

Khan Academy: AP Biology

RCSB Protein Data Bank

GenBank (NCBI)

Nature Scitable: Molecular Biology