🧬 Biotech Institute
Educational Resources

Protein Structure

From amino acid chains to the three-dimensional machines of life. How proteins fold, how we determine and predict their structures, and why structure matters for drug design.

Primary Structure

The primary structure is the linear sequence of amino acids in a polypeptide chain. It is encoded by the gene and determines all higher levels of structure. The sequence is written from the N-terminus (free amino group) to the C-terminus (free carboxyl group), matching the direction of translation.

Amino Acids

20 standard amino acids — each has an amino group (-NH2), a carboxyl group (-COOH), a hydrogen atom, and a distinctive side chain (R group) bonded to the central alpha-carbon. The R group determines the amino acid's chemical properties.
Peptide bond — a covalent bond between the carboxyl group of one amino acid and the amino group of the next, releasing water (condensation reaction). The peptide bond is planar (partial double-bond character) and almost always in the trans configuration. Free rotation around the N-C-alpha (phi) and C-alpha-C (psi) bonds allows the chain to fold.
Sequence determines structure — Anfinsen's dogma (Nobel Prize 1972): the amino acid sequence contains all the information needed for a protein to fold into its native 3D structure (at least for small, single-domain proteins). Demonstrated by denaturing and refolding ribonuclease A in vitro.

Sequence Analysis

Sequence alignment — comparing sequences to find conserved regions (evolutionary constraints on structure/function). Tools: BLAST, Clustal Omega, MUSCLE.
Conservation — positions critical for structure or function are conserved across species. Mutations at these sites are typically deleterious. The ratio of non-synonymous to synonymous substitutions (dN/dS) measures selective pressure.
Domains — independently folding units within a protein (typically 50-300 residues). Many proteins are multi-domain. Domain databases: Pfam (protein families), InterPro, SMART.
Motifs — short, conserved sequence patterns associated with specific functions. Examples: KDEL (ER retention), NLS (nuclear localization), CAAX (prenylation signal). Database: PROSITE.

📏 Think of a protein like a really long LEGO chain. The primary structure is just the order of the LEGO bricks. But here's the cool part — once you connect them all in the right order, the chain automatically folds up into an amazing 3D shape, like origami! The shape is what gives the protein its superpowers.

Secondary Structure

Secondary structure refers to local, regular folding patterns stabilized by hydrogen bonds between backbone atoms (C=O and N-H groups). Linus Pauling and Robert Corey predicted the two major secondary structures in 1951, before the first protein crystal structure was solved.

Primary — Amino acid sequence (1D chain)

Secondary — α-helices and β-sheets (local folding)

Tertiary — Full 3D fold of one chain

Quaternary — Multi-subunit assembly

Alpha Helix

Right-handed coil with 3.6 residues per turn, 5.4 angstrom pitch. Hydrogen bond between C=O of residue i and N-H of residue i+4. Side chains project outward.
Common in globular proteins and membrane proteins (transmembrane helices). Alpha-keratin (hair, nails) is almost entirely alpha-helical.
Amphipathic helices have hydrophobic residues on one face and polar residues on the other — important at membrane interfaces and in coiled-coil interactions.
Proline is a helix-breaker (its ring constrains phi angle). Glycine destabilizes helices (too flexible, loses conformational entropy on folding).

Beta Sheet

Extended strands (beta strands) arranged side by side, connected by hydrogen bonds between strands. Parallel (strands run same direction, weaker H-bonds) or antiparallel (opposite directions, straight H-bonds, more stable).
R groups alternate above and below the sheet plane. Beta sheets can be flat or twisted (most natural beta sheets have a right-handed twist).
Common in structural proteins: silk fibroin is almost entirely beta sheet. Also prominent in immunoglobulin domains, beta barrels (porins), and beta propellers.
Connected by turns and loops. Beta turns are tight 4-residue turns with a hydrogen bond between residue i C=O and residue i+3 N-H. Glycine and proline are common in turns.

Other Elements

310 helix — tighter helix with 3 residues per turn, H-bond i to i+3. Less common, often at helix ends.
Polyproline II helix — left-handed helix with 3 residues per turn, no intra-chain H-bonds. Found in collagen (triple helix of three polyproline II chains) and proline-rich motifs involved in signaling (SH3 domain binding).
Loops and coils — non-regular regions connecting secondary structure elements. Often on the protein surface, forming binding sites and active sites. More variable in sequence than helices/sheets.

Prediction

Secondary structure can be predicted from sequence with ~80-85% accuracy. Methods: PSIPRED (neural network using PSI-BLAST profiles), JPred, DSSP (dictionary of secondary structure, assigns SS from 3D coordinates). GOR, Chou-Fasman (early methods, lower accuracy). Modern methods use deep learning on multiple sequence alignments.

Tertiary Structure

The tertiary structure is the complete 3D arrangement of all atoms in a single polypeptide chain. It is the biologically active conformation, determined by interactions between side chains and the backbone, with the solvent (usually water).

Hydrophobic

Dominant force

Nonpolar core

H-bonds

Backbone + side chain

Secondary structure

Disulfide

Covalent S-S

Extracellular

Stabilizing Forces

Hydrophobic effect — the dominant force in protein folding. Nonpolar side chains cluster in the protein interior, away from water. This maximizes the entropy of surrounding water molecules (which would form ordered cages around exposed hydrophobic surfaces). The hydrophobic core is tightly packed, with packing density similar to crystalline solids.
Hydrogen bonds — between polar side chains, between side chains and backbone, and between the protein and water. Backbone H-bonds stabilize secondary structure. Side-chain H-bonds in the interior are particularly stabilizing because they're in a low-dielectric environment.
Van der Waals forces — weak but numerous. Arise from transient dipoles in closely packed atoms. Optimal at ~3.5 angstrom. Contribute significantly to the stability of the tightly packed hydrophobic core.
Electrostatic interactions — salt bridges between oppositely charged side chains (Lys/Arg with Asp/Glu). Important on the protein surface. Ion pairs in the interior are rare but very stabilizing.
Disulfide bonds — covalent S-S bonds between cysteine residues. Form in the oxidizing environment of the ER (extracellular and secreted proteins). Rare in cytoplasmic proteins (reducing environment). Stabilize proteins that must function in harsh extracellular environments.

Structural Motifs

Helix-turn-helix (HTH) — DNA-binding motif. Recognition helix fits in the major groove. Found in homeodomains and many prokaryotic transcription factors.
Zinc finger — ~30 residues, stabilized by Zn2+ coordinated by Cys/His residues. DNA-binding motif in many transcription factors (Cys2His2 zinc fingers). Also: RING fingers (E3 ubiquitin ligases), zinc ribbons.
Coiled coil — two or more alpha helices wound around each other (supercoil). Heptad repeat (abcdefg) with hydrophobic residues at positions a and d. Found in: leucine zippers (transcription factors), myosin (motor), keratin, tropomyosin.
Beta barrel — beta strands arranged in a closed barrel. Found in membrane proteins (porins, outer membrane proteins), green fluorescent protein (GFP), lipocalins.
TIM barrel — (beta/alpha)8 barrel. Eight parallel beta strands surrounded by eight alpha helices. One of the most common folds (~10% of all enzymes). Named after triosephosphate isomerase. Active site at the C-terminal end of the beta strands.
Rossmann fold — beta-alpha-beta motif that binds nucleotides (NAD, FAD). Found in dehydrogenases and many metabolic enzymes. Characterized by a GXGXXG motif.

Quaternary Structure

Quaternary structure describes the arrangement of multiple polypeptide chains (subunits) into a multi-subunit complex. Not all proteins have quaternary structure — it applies only to multi-chain assemblies. Subunits are held together by the same non-covalent forces as tertiary structure, plus occasional inter-chain disulfide bonds.

Examples

Hemoglobin — alpha2-beta2 tetramer. Cooperative oxygen binding (sigmoidal binding curve): when one subunit binds O2, it increases the affinity of the others (T-state to R-state transition). The classic example of allosteric regulation.
Antibodies — two heavy chains + two light chains, held by disulfide bonds. Y-shaped molecule. Variable regions (Fab) bind antigen; constant region (Fc) mediates effector functions. Five classes: IgG, IgM, IgA, IgD, IgE.
Proteasome — 26S complex (~2.5 MDa), degrades ubiquitin-tagged proteins. 20S core (four stacked rings of 7 subunits each, barrel-shaped) + 19S regulatory caps. ATP-dependent unfoldase + protease.
Ribosome — two subunits (30S + 50S in bacteria, 40S + 60S in eukaryotes). RNA-protein complex. The peptidyl transferase center in the large subunit catalyzes peptide bond formation.
ATP synthase — rotary motor enzyme. F0 portion in the membrane (proton channel), F1 portion in the matrix (catalytic subunits). Proton gradient drives rotation of the c-ring, causing conformational changes in F1 that synthesize ATP from ADP + Pi. Peter Mitchell (Nobel 1978), Paul Boyer and John Walker (Nobel 1997).
Viral capsids — symmetric assemblies of coat proteins. Icosahedral symmetry (most common): 60 subunits minimum, often multiples (T=1, T=3, T=4). Self-assembling. TMV: helical rod of 2,130 identical coat protein subunits.

Symmetry

Homo-oligomers — identical subunits. Dimers (C2 symmetry), trimers (C3), tetramers (D2 = three perpendicular C2 axes), hexamers. Most common: dimers and tetramers.
Hetero-oligomers — different subunits. Hemoglobin (alpha2-beta2), antibodies (H2L2). Allows functional specialization within a complex.
Allostery — binding at one site affects activity at another site on the same or a different subunit. Monod-Wyman-Changeux (MWC) model: concerted transition between R and T states. Koshland-Nemethy-Filmer (KNF) model: sequential changes. Critical for enzyme regulation and signal transduction.

Protein Folding

How a linear polypeptide chain reaches its native 3D structure is one of the grand challenges in biology. Levinthal's paradox (1969): a 100-residue protein has ~3^100 possible conformations. If it sampled one per picosecond, it would take longer than the age of the universe. Yet most proteins fold in milliseconds to seconds. The search is not random — folding follows an energy funnel.

The Folding Funnel

The energy landscape of protein folding is a funnel: many high-energy unfolded states at the top, one (or a few) low-energy native states at the bottom. The protein doesn't search all conformations — it slides down the funnel, progressively forming native-like contacts.
Local secondary structures form first (nanoseconds to microseconds), then collapse into a compact "molten globule" intermediate (microseconds), then final native contacts form (milliseconds to seconds).
Some proteins fold in two-state kinetics (directly from unfolded to native, no detectable intermediate). Larger proteins often fold through intermediates and may require chaperones.

Chaperones

Hsp70 / DnaK — binds exposed hydrophobic patches on nascent or misfolded proteins, preventing aggregation. ATP-driven. Works with co-chaperones (Hsp40/DnaJ as J-domain co-chaperone, nucleotide exchange factors).
Hsp60 / GroEL-GroES — barrel-shaped "Anfinsen cage." Encapsulates a single protein (~60 kDa max) in a hydrophilic cavity, giving it time to fold in isolation. GroEL (bacterial) is a double-ring of 14 subunits; GroES caps one ring. 7 ATP hydrolyzed per folding cycle.
Hsp90 — specialized chaperone for signaling proteins (steroid receptors, kinases). Late-stage folding and stability. Target of geldanamycin (anti-cancer drug that inhibits Hsp90, destabilizing oncogenic client proteins).
Trigger factor — ribosome-associated chaperone in bacteria. First chaperone to interact with a nascent polypeptide as it emerges from the ribosome exit tunnel.

Misfolding and Disease

Alzheimer's disease — amyloid-beta peptide (A-beta, 40-42 residues) aggregates into amyloid fibrils and plaques in the brain. Cross-beta sheet structure. Whether plaques are the cause or consequence of neurodegeneration is debated (amyloid cascade hypothesis vs alternatives).
Parkinson's disease — alpha-synuclein aggregates into Lewy bodies. Normally intrinsically disordered; misfolding leads to fibril formation.
Prion diseases — PrP^Sc (misfolded prion protein) converts normal PrP^C into the misfolded form (template-directed misfolding). Causes CJD, BSE, kuru. Transmissible protein-only agent — no nucleic acid required. Stanley Prusiner, Nobel Prize 1997.
Amyloidosis — systemic diseases caused by deposition of misfolded protein fibrils in organs. AL amyloidosis (immunoglobulin light chains), transthyretin (TTR) amyloidosis. Tafamidis stabilizes TTR tetramer, preventing dissociation and misfolding.
Cystic fibrosis — most common mutation (deltaF508) causes misfolding of CFTR protein, which is retained in the ER and degraded. Lumacaftor + ivacaftor partially correct folding and channel function.

AlphaFold

AlphaFold, developed by DeepMind (Google), is an AI system that predicts protein 3D structures from amino acid sequences with near-experimental accuracy. It solved the "protein structure prediction problem" that had been an open challenge for 50 years.

AlphaFold2 (2020)

Won CASP14 (Critical Assessment of protein Structure Prediction) with a median GDT score of 92.4 (out of 100), far ahead of all competitors. For most targets, predictions were within the accuracy of experimental structures.
Architecture: Evoformer module processes multiple sequence alignments (MSAs) and pairwise residue features through attention-based neural network layers. Structure module converts features into 3D coordinates.
Trained on ~170,000 experimental structures from the PDB plus UniRef sequence databases.
Demis Hassabis and John Jumper awarded the Nobel Prize in Chemistry 2024 for AlphaFold2.

AlphaFold Protein Structure Database

DeepMind and EMBL-EBI released predicted structures for over 200 million proteins — essentially every known protein sequence in UniProt.
Freely available at alphafold.ebi.ac.uk. Each prediction includes per-residue confidence scores (pLDDT). High confidence (>90 pLDDT) regions are typically accurate; low confidence (<50) regions are often disordered or poorly predicted.
Transformed structural biology: researchers can now start with a predicted structure instead of spending months/years on experimental determination. Drug target identification, protein engineering, and evolutionary analysis all benefit.

AlphaFold3 (2024)

Extends beyond proteins: predicts structures of protein-ligand, protein-DNA, protein-RNA, and protein-protein complexes.
Uses a diffusion-based architecture instead of AlphaFold2's direct coordinate prediction. Can model covalent modifications, ions, and small molecule ligands.
AlphaFold Server allows free predictions through a web interface (non-commercial use).

Limitations

Static structures only — does not predict dynamics, conformational ensembles, or allosteric transitions.
Struggles with proteins that have few homologous sequences (orphan proteins), intrinsically disordered regions, and some multi-chain complexes.
Does not predict the effects of mutations, post-translational modifications, or ligand binding on structure (though fine-tuned variants are being developed).
Not a substitute for experimental structures when high accuracy is critical (e.g., drug binding sites where sub-angstrom precision matters).

🤖 AlphaFold is like an AI that solves puzzles. Scientists spent 50 years trying to figure out how proteins fold into their 3D shapes. Then Google's AI cracked it! Now it has predicted the shape of almost every protein known to science — over 200 million of them. The scientists who built it won the Nobel Prize!

The Protein Data Bank (PDB)

The PDB (rcsb.org) is the global repository for experimentally determined 3D structures of biological macromolecules. Established in 1971 at Brookhaven National Laboratory with 7 structures. As of 2025, it contains over 220,000 structures.

Experimental Methods

X-ray crystallography — the dominant method (~85% of PDB structures). Grow protein crystals, expose to X-rays, interpret the diffraction pattern to calculate electron density. Resolution: typically 1.5-3.0 angstrom. Requires crystals (sometimes impossible for membrane proteins, flexible proteins, large complexes).
Cryo-electron microscopy (cryo-EM) — flash-freeze protein in vitreous ice, image with electron microscope, computationally reconstruct 3D structure from thousands of 2D particle images. No crystals needed. "Resolution revolution" since ~2013 (direct electron detectors). Now routinely achieves 2-4 angstrom. Dominant for large complexes (ribosomes, viruses, membrane protein complexes). Jacques Dubochet, Joachim Frank, Richard Henderson: Nobel Prize 2017.
NMR spectroscopy — determines structure in solution (no crystals, no freezing). Limited to smaller proteins (~40 kDa). Provides information about dynamics, flexibility, and interactions. ~7% of PDB structures.
Neutron diffraction — locates hydrogen atoms (invisible to X-rays). Important for understanding enzyme mechanisms and hydrogen bonding. Requires large crystals and neutron sources (rare).

Using the PDB

Search — by protein name, gene, organism, sequence (BLAST), structure similarity, ligand, author, or method. Advanced search supports complex queries.
Visualization — built-in Mol* viewer renders structures in the browser. Also: PyMOL (most popular, open-source + commercial), UCSF ChimeraX (free, excellent for cryo-EM), VMD (molecular dynamics).
PDB file format — legacy format (ATOM/HETATM records, 80-character lines). Being superseded by PDBx/mmCIF (machine-readable, extensible). All new depositions use mmCIF.
Key identifiers — 4-character PDB ID (e.g., 1HHO for hemoglobin). UniProt accession maps proteins to all their PDB structures. SCOP/CATH classify protein folds.

Drug Targets

Protein structure is central to modern drug discovery. Understanding the 3D shape of a target protein — especially its binding sites — enables rational drug design. Approximately 60% of approved drugs target proteins (mostly enzymes, receptors, ion channels, and transporters).

Structure-Based Drug Design

Target identification — identify a protein whose dysfunction causes disease. Validate that modulating it (inhibition, activation, degradation) has therapeutic benefit. Genomics, proteomics, and CRISPR screens help identify targets.
Structure determination — solve the target's 3D structure, ideally with a ligand bound (holo structure). Identify the binding site, key residues, and interactions. AlphaFold predictions increasingly used as starting models.
Virtual screening — computationally dock millions of compounds into the binding site. Score by predicted binding affinity. Reduces the chemical space from billions to thousands for experimental testing. Tools: AutoDock Vina, Glide (Schrodinger), GOLD.
Lead optimization — iteratively improve hits through medicinal chemistry. Co-crystal structures with improved compounds reveal structure-activity relationships (SAR). Optimize potency, selectivity, solubility, metabolic stability, and toxicity.

Notable Drug-Target Successes

HIV protease inhibitors — saquinavir, ritonavir, lopinavir. Designed from the crystal structure of HIV-1 protease (homodimeric aspartyl protease). Transformed HIV from a death sentence to a manageable chronic disease. One of the first triumphs of structure-based drug design (1990s).
Imatinib (Gleevec) — BCR-ABL kinase inhibitor for chronic myeloid leukemia (CML). Crystal structure revealed an inactive kinase conformation that imatinib stabilizes. 5-year survival improved from 30% to >90%. Paradigm-shifting targeted therapy.
Nirmatrelvir (Paxlovid) — SARS-CoV-2 main protease (Mpro) inhibitor. Designed using the crystal structure of Mpro. Pfizer went from target to EUA in ~18 months. Demonstrates the power of structural biology in pandemic response.
Osimertinib (Tagrisso) — third-generation EGFR inhibitor for non-small-cell lung cancer. Designed to target the T790M resistance mutation while sparing wild-type EGFR. Structure-guided to exploit a cysteine residue (Cys797) for covalent binding.

Emerging Approaches

PROTACs — Proteolysis-Targeting Chimeras. Bifunctional molecules that bind both the target protein and an E3 ubiquitin ligase, inducing target degradation. Can target "undruggable" proteins that lack enzymatic active sites. ARV-110 (androgen receptor degrader) in clinical trials.
Molecular glues — small molecules that stabilize protein-protein interactions to induce degradation or loss of function. Thalidomide/lenalidomide: bind CRBN (E3 ligase), recruit neosubstrates (Ikaros, Aiolos) for degradation. Serendipitous discovery explained by structural biology.
AI-driven drug design — diffusion models generate novel molecules optimized for target binding. AlphaFold structures used as inputs. Companies: Recursion, Insilico Medicine, Isomorphic Labs (DeepMind spin-off). Early clinical candidates entering trials.
Antibody engineering — computational design of antibodies and nanobodies using structural data. CDR loop modeling, affinity maturation in silico. AlphaFold-Multimer and RoseTTAFold for antibody-antigen complex prediction.

Protein Structure

Primary Structure

Amino Acids

Sequence Analysis

Secondary Structure

Alpha Helix

Beta Sheet

Other Elements

Prediction

Tertiary Structure

Stabilizing Forces

Structural Motifs

Quaternary Structure

Examples

Symmetry

Protein Folding

The Folding Funnel

Chaperones

Misfolding and Disease

AlphaFold

AlphaFold2 (2020)

AlphaFold Protein Structure Database

AlphaFold3 (2024)

Limitations

The Protein Data Bank (PDB)

Experimental Methods

Using the PDB

Drug Targets

Structure-Based Drug Design

Notable Drug-Target Successes

Emerging Approaches

Resources

RCSB Protein Data Bank

AlphaFold Protein Structure Database

PyMOL

UCSF ChimeraX

PDB-101

Foldit