Bioanalytics Analytical Methods and Concepts in Biochemistry and Molecular Biology

2,401 494 155MB

English Pages 1137 Year 2018

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Bioanalytics Analytical Methods and Concepts in Biochemistry and Molecular Biology

Table of contents :
Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology......Page 1
Table of Contents......Page 7
Preface......Page 17
Introduction: Bioanalytics - a Science in its Own Right......Page 21
Part I: Protein Analytics......Page 27
1.1 Properties of Proteins......Page 29
1.2 Protein Localization and Purification Strategy......Page 32
1.3 Homogenization and Cell Disruption......Page 33
1.4 Precipitation......Page 35
1.5 Centrifugation......Page 37
1.5.2 Centrifugation Techniques......Page 38
1.6 Removal of Salts and Hydrophilic Contaminants......Page 41
1.7 Concentration......Page 43
1.8.1 Properties of Detergents......Page 44
1.8.2 Removal of Detergents......Page 46
Further Reading......Page 48
Chapter 2: Protein determination......Page 49
2.1 Quantitative Determination by Staining Tests......Page 51
2.1.2 Lowry Assay......Page 52
2.1.3 Bicinchoninic Acid Assay (BCA Assay)......Page 53
2.2 Spectroscopic Methods......Page 54
2.2.1 Measurements in the UV Range......Page 55
2.3 Radioactive Labeling of Peptides and Proteins......Page 57
Further Reading......Page 59
3.1 The Driving Force behind Chemical Reactions......Page 61
3.2 Rate of Chemical Reactions......Page 62
3.4 Enzymes as Catalysts......Page 63
3.6 Michaelis-Menten Theory......Page 64
3.7 Determination of Km and Vmax......Page 65
3.8.1 Competitive Inhibitors......Page 66
3.9 Test System Set-up......Page 67
3.9.3 Detection System......Page 68
3.9.6 Selecting the Buffer Substance and the Ionic Strength......Page 69
3.9.8 Substrate Concentration......Page 70
Further Reading......Page 71
Chapter 4: Microcalorimetry......Page 73
4.1 Differential Scanning Calorimetry (DSC)......Page 74
4.2.1 Ligand Binding to Proteins......Page 80
4.2.2 Binding of Molecules to Membranes: Insertion and Peripheral Binding......Page 84
4.3 Pressure Perturbation Calorimetry (PPC)......Page 87
Further Reading......Page 88
5.1.1 Antibodies and Immune Defense......Page 89
5.1.3 Properties of Antibodies......Page 90
5.1.4 Functional Structure of IgG......Page 92
5.1.5 Antigen Interaction at the Combining Site......Page 93
5.1.6 Handling of Antibodies......Page 94
5.2 Antigens......Page 95
5.3 Antigen-Antibody Reaction......Page 97
5.3.1 Immunoagglutination......Page 98
5.3.2 Immunoprecipitation......Page 99
5.3.3 Immune Binding......Page 110
5.4 Complement Fixation......Page 120
5.5 Methods in Cellular Immunology......Page 121
5.6 Alteration of Biological Functions......Page 123
5.7.1 Types of Antibodies......Page 124
5.7.2 New Antibody Techniques (Antibody Engineering)......Page 125
5.7.3 Optimized Monoclonal Antibody Constructs with Effector Functions for Therapeutic Application......Page 128
Further Reading......Page 132
Chapter 6: Chemical Modification of Proteins and Protein Complexes......Page 133
6.1 Chemical Modification of Protein Functional Groups......Page 134
6.2.1 Investigation with Naturally Occurring Proteins......Page 142
6.2.2 Investigation of Recombinant and Mutated Proteins......Page 146
6.3.2 Photoaffinity Labeling......Page 147
Further Reading......Page 155
Chapter 7: Spectroscopy......Page 157
7.1.1 Physical Principles of Optical Spectroscopic Techniques......Page 158
7.1.2 Interaction of Light with Matter......Page 159
7.1.3 Absorption Measurement and the Lambert-Beer Law......Page 166
7.1.4 Photometer......Page 169
7.1.5 Time-Resolved Spectroscopy......Page 170
7.2.1 Basic Principles......Page 172
7.2.2 Chromoproteins......Page 173
7.3.1 Basic Principles of Fluorescence Spectroscopy......Page 180
7.3.2 Fluorescence: Emission and Action Spectra......Page 182
7.3.3 Fluorescence Studies using Intrinsic and Extrinsic Probes......Page 183
7.3.4 Green Fluorescent Protein (GFP) as a Unique Fluorescent Probe......Page 184
7.3.5 Quantum Dots as Fluorescence Labels......Page 185
7.3.7 Förster Resonance Energy Transfer (FRET)......Page 186
7.3.8 Frequent Mistakes in Fluorescence Spectroscopy: ``The Seven Sins of Fluorescence Measurements´´......Page 187
7.4.1 Basic Principles of IR Spectroscopy......Page 189
7.4.2 Molecular Vibrations......Page 190
7.4.3 Technical aspects of Infrared Spectroscopy......Page 191
7.4.4 Infrared Spectra of Proteins......Page 194
7.5.1 Basic Principles of Raman Spectroscopy......Page 197
7.5.2 Raman Experiments......Page 198
7.5.3 Resonance Raman Spectroscopy......Page 199
7.6 Single Molecule Spectroscopy......Page 200
7.7.1 Linear Dichroism......Page 201
7.7.2 Optical Rotation Dispersion and Circular Dichroism......Page 204
Further Reading......Page 206
8.1 Steps on the Road to Microscopy - from Simple Lenses to High Resolution Microscopes......Page 207
8.2 Modern Applications......Page 208
8.3 Basic Physical Principles......Page 209
8.4 Detection Methods......Page 215
8.5 Sample Preparation......Page 221
8.6 Special Fluorescence Microscopic Analysis......Page 223
Further Reading......Page 231
9.1 Proteolytic Enzymes......Page 233
9.2 Strategy......Page 234
9.4 Cleavage of Disulfide Bonds and Alkylation......Page 235
9.5.1 Proteases......Page 236
9.5.2 Conditions for Proteolysis......Page 241
9.6 Chemical Fragmentation......Page 242
9.7 Summary......Page 243
Further Reading......Page 244
10.1 Instrumentation......Page 245
10.2 Fundamental Terms and Concepts in Chromatography......Page 246
10.3 Biophysical Properties of Peptides and Proteins......Page 250
10.4 Chromatographic Separation Modes for Peptides and Proteins......Page 251
10.4.2 High-Performance Reversed-Phase Chromatography (HP-RPC)......Page 253
10.4.3 High-Performance Normal-Phase Chromatography (HP-NPC)......Page 254
10.4.4 High-Performance Hydrophilic Interaction Chromatography (HP-HILIC)......Page 255
10.4.6 High-Performance Hydrophobic Interaction Chromatography (HP-HIC)......Page 256
10.4.7 High-Performance Ion Exchange Chromatography (HP-IEX)......Page 258
10.4.8 High-Performance Affinity Chromatography (HP-AC)......Page 259
10.5.1 Development of an Analytical Method......Page 260
10.5.2 Scaling up to Preparative Chromatography......Page 262
10.5.3 Fractionation......Page 263
10.6.1 Purification of Peptides and Proteins by MD-HPLC Methods......Page 264
10.6.3 Strategies for MD-HPLC Methods......Page 265
10.6.4 Design of an Effective MD-HPLC Scheme......Page 266
Further Reading......Page 268
Chapter 11: Electrophoretic Techniques......Page 269
11.1 Historical Review......Page 270
11.2 Theoretical Fundamentals......Page 271
11.3 Equipment and Procedures of Gel Electrophoreses......Page 274
11.3.1 Sample Preparation......Page 275
11.3.2 Gel Media for Electrophoresis......Page 276
11.3.3 Detection and Quantification of the Separated Proteins......Page 277
11.3.4 Zone Electrophoresis......Page 279
11.3.5 Porosity Gradient Gels......Page 280
11.3.7 Disc Electrophoresis......Page 281
11.3.9 SDS Polyacrylamide Gel Electrophoresis......Page 283
11.3.10 Cationic Detergent Electrophoresis......Page 284
11.3.12 Isoelectric Focusing......Page 285
11.4.1 Electroelution from Gels......Page 289
11.4.2 Preparative Zone Electrophoresis......Page 290
11.4.3 Preparative Isoelectric Focusing......Page 291
11.5 Free Flow Electrophoresis......Page 292
11.6 High-Resolution Two-Dimensional Electrophoresis......Page 293
11.6.2 Prefractionation......Page 294
11.6.3 First Dimension: IEF in IPG Strips......Page 295
11.6.6 Difference Gel Electrophoresis (DIGE)......Page 296
11.7.1 Blot Systems......Page 298
Further Reading......Page 299
12.1 Historical Overview......Page 301
12.2 Capillary Electrophoresis Setup......Page 302
12.3.1 Sample Injection......Page 303
12.3.2 The Engine: Electroosmotic Flow (EOF)......Page 304
12.3.4 Detection Methods......Page 305
12.4.1 Capillary Zone Electrophoresis (CZE)......Page 307
12.4.2 Affinity Capillary Electrophoresis (ACE)......Page 311
12.4.3 Micellar Electrokinetic Chromatography (MEKC)......Page 312
12.4.4 Capillary Electrochromatography (CEC)......Page 314
12.4.5 Chiral Separations......Page 315
12.4.6 Capillary Gel Electrophoresis (CGE)......Page 316
12.4.7 Capillary Isoelectric Focusing (CIEF)......Page 317
12.4.8 Isotachophoresis (ITP)......Page 319
12.5.2 Online Sample Concentration......Page 321
12.5.3 Fractionation......Page 322
12.6 Outlook......Page 323
Further Reading......Page 325
Chapter 13: Amino Acid Analysis......Page 327
13.1.1 Acidic Hydrolysis......Page 328
13.3.1 Post-Column Derivatization......Page 329
13.3.2 Pre-column Derivatization......Page 331
13.4 Amino Acid Analysis using Mass Spectrometry......Page 335
13.5 Summary......Page 336
Further Reading......Page 337
Chapter 14: Protein Sequence Analysis......Page 339
14.1.1 Reactions of the Edman Degradation......Page 341
14.1.2 Identification of the Amino Acids......Page 342
14.1.3 Quality of Edman Degradation: the Repetitive Yield......Page 343
14.1.4 Instrumentation......Page 345
14.1.5 Problems of Amino Acid Sequence Analysis......Page 348
14.2.1 Chemical Degradation Methods......Page 351
14.2.3 Degradation of Polypeptides with Carboxypeptidases......Page 353
Further Reading......Page 354
Chapter 15: Mass Spectrometry......Page 355
15.1.1 Matrix Assisted Laser Desorption Ionization Mass Spectrometry (MALDI-MS)......Page 356
15.1.2 Electrospray Ionization (ESI)......Page 361
15.2 Mass Analyzer......Page 367
15.2.1 Time-of-Flight Analyzers (TOF)......Page 369
15.2.2 Quadrupole Analyzer......Page 371
15.2.3 Electric Ion Traps......Page 374
15.2.4 Magnetic Ion Trap......Page 375
15.2.5 Orbital Ion Trap......Page 376
15.2.6 Hybrid Instruments......Page 377
15.3 Ion Detectors......Page 381
15.3.1 Secondary Electron Multiplier (SEV)......Page 382
15.4.1 Collision Induced Dissociation (CID)......Page 383
15.4.2 Prompt and Metastable Decay (ISD, PSD)......Page 384
15.4.4 Generation of Free Radicals (ECD, HECD, ETD)......Page 386
15.5.2 Influence of Isotopy......Page 388
15.5.4 Determination of the Number of Charges......Page 391
15.5.7 Problems......Page 392
15.6.1 Identification......Page 394
15.6.3 Structure Elucidation......Page 395
15.7.1 LC-MS......Page 401
15.7.2 LC-MS/MS......Page 402
15.8 Quantification......Page 404
Further Reading......Page 405
16.1.1 Principle of Two-Hybrid Systems......Page 407
16.1.3 Construction of Bait and Prey Proteins......Page 408
16.1.5 AD Fusion Proteins and cDNA Libraries......Page 411
16.1.6 Carrying out a Y2H Screen......Page 412
16.1.7 Other Modifications and Extensions of the Two-Hybrid-Technology......Page 417
16.1.8 Biochemical and Functional Analysis of Interactions......Page 419
16.2 TAP-Tagging and Purification of Protein Complexes......Page 420
16.3 Analyzing Interactions In Vitro: GST-Pulldown......Page 423
16.4 Co-immunoprecipitation......Page 424
16.5 Far-Western......Page 425
16.6 Surface Plasmon Resonance Spectroscopy......Page 426
16.7.1 Introduction......Page 428
16.7.3 Methods of FRET Measurements......Page 429
16.7.4 Fluorescent Probes for FRET......Page 432
16.7.5 Alternative Tools for Probing Protein-Protein Interactions: LINC and STET......Page 434
16.8 Analytical Ultracentrifugation......Page 435
16.8.1 Principles of Instrumentation......Page 436
16.8.2 Basics of Centrifugation......Page 437
16.8.3 Sedimentation Velocity Experiments......Page 438
16.8.4 Sedimentation-Diffusion Equilibrium Experiments......Page 441
Further Reading......Page 442
Chapter 17: Biosensors......Page 445
17.2.1 Concept of Biosensors......Page 446
17.2.2 Construction and Function of Biosensors......Page 447
17.2.3 Cell Sensors......Page 451
17.2.4 Immunosensors......Page 452
17.3 Biomimetic Sensors......Page 453
17.4 From Glucose Enzyme Electrodes to Electronic DNA Biochips......Page 454
Further Reading......Page 455
Part II: 3D Structure Determination......Page 457
18.1 NMR Spectroscopy of Biomolecules......Page 459
18.1.1 Theory of NMR Spectroscopy......Page 460
18.1.2 One-Dimensional NMR Spectroscopy......Page 464
18.1.3 Two-Dimensional NMR Spectroscopy......Page 469
18.1.4 Three-Dimensional NMR Spectroscopy......Page 475
18.1.5 Resonance Assignment......Page 478
18.1.6 Protein Structure Determination......Page 483
18.1.7 Protein Structures and more - an Overview......Page 488
18.2 EPR Spectroscopy of Biological Systems......Page 492
18.2.1 Basics of EPR Spectroscopy......Page 493
18.2.2 cw- EPR Spectroscopy......Page 494
18.2.4 Electron Spin Nuclear Spin Coupling (Hyperfine Coupling)......Page 495
18.2.5 g and Hyperfine Anisotropy......Page 496
18.2.6 Electron Spin-Electron Spin Coupling......Page 498
18.2.7 Pulsed EPR Experiments......Page 499
18.2.8 Further Examples of EPR Applications......Page 505
18.2.10 Comparison EPR/NMR......Page 507
Further Reading......Page 508
Chapter 19: Electron Microscopy......Page 511
19.1 Transmission Electron Microscopy - Instrumentation......Page 513
19.2.1 Native Samples in Ice......Page 514
19.2.2 Negative Staining......Page 516
19.2.3 Metal Coating by Evaporation......Page 517
19.3.1 Resolution of a Transmission Electron Microscope......Page 518
19.3.2 Interactions of the Electron Beam with the Object......Page 519
19.3.4 Electron Microscopy with a Phase Plate......Page 521
19.3.5 Imaging Procedure for Frozen-Hydrated Specimens......Page 522
19.3.6 Recording Images - Cameras and the Impact of Electrons......Page 523
19.4.1 Pixel Size......Page 524
19.4.2 Fourier Transformation......Page 525
19.4.3 Analysis of the Contrast Transfer Function and Object Features......Page 527
19.4.4 Improving the Signal-to-Noise Ratio......Page 530
19.4.5 Principal Component Analysis and Classification......Page 532
19.5 Three-Dimensional Electron Microscopy......Page 534
19.5.1 Three-Dimensional Reconstruction of Single Particles......Page 535
19.5.2 Three-Dimensional Reconstruction of Regularly Arrayed Macromolecular Complexes......Page 537
19.5.3 Electron Tomography of Individual Objects......Page 538
19.6.1 Hybrid Approach: Combination of EM and X-Ray Data......Page 540
19.6.3 Identifying Protein Complexes in Cellular Tomograms......Page 541
19.7 Perspectives of Electron Microscopy......Page 542
Further Reading......Page 543
20.1 Introduction......Page 545
20.2 Principle of the Atomic Force Microscope......Page 546
20.3 Interaction between Tip and Sample......Page 547
20.5 Mapping Biological Macromolecules......Page 548
20.6 Force Spectroscopy of Single Molecules......Page 550
20.7 Detection of Functional States and Interactions of Individual Proteins......Page 552
Further Reading......Page 553
Chapter 21: X-Ray Structure Analysis......Page 555
21.1 X-Ray Crystallography......Page 556
21.1.1 Crystallization......Page 557
21.1.2 Crystals and X-Ray Diffraction......Page 559
21.1.3 The Phase Problem......Page 564
21.1.4 Model Building and Structure Refinement......Page 568
21.2 Small Angle X-Ray Scattering (SAXS)......Page 569
21.2.1 Machine Setup......Page 570
21.2.2 Theory......Page 571
21.2.3 Data Analysis......Page 573
21.3.1 Machine Setup and Theory......Page 575
Acknowledgement......Page 576
Further Reading......Page 577
Part III: Peptides, Carbohydrates, and Lipids......Page 579
22.1 Concept of Peptide Synthesis......Page 581
22.2 Purity of Synthetic Peptides......Page 586
22.3 Characterization and Identity of Synthetic Peptides......Page 588
22.4 Characterization of the Structure of Synthetic Peptides......Page 590
22.5 Analytics of Peptide Libraries......Page 593
Further Reading......Page 595
Chapter 23: Carbohydrate Analysis......Page 597
23.1.1 The Series of d-Sugars......Page 598
23.1.2 Stereochemistry of d-Glucose......Page 599
23.1.5 The Glycosidic Bond......Page 600
23.2 Protein Glycosylation......Page 605
23.2.2 Structure of the O-Glycans......Page 606
23.3 Analysis of Protein Glycosylation......Page 607
23.3.1 Analysis on the Basis of the Intact Glycoprotein......Page 608
23.3.2 Mass Spectrometric Analysis on the Basis of Glycopeptides......Page 614
23.3.3 Release and Isolation of the N-Glycan Pool......Page 616
23.3.4 Analysis of Individual N-Glycans......Page 625
23.4 Genome, Proteome, Glycome......Page 636
23.5 Final Considerations......Page 637
Further Reading......Page 638
24.1 Structure and Classification of Lipids......Page 639
24.2 Extraction of Lipids from Biological Sources......Page 641
24.2.2 Solid Phase Extraction......Page 642
24.3.1 Chromatographic Methods......Page 644
24.3.3 Immunoassays......Page 648
24.3.5 Combining Different Analytical Systems......Page 649
24.4.1 Whole Lipid Extracts......Page 652
24.4.2 Fatty Acids......Page 653
24.4.3 Nonpolar Neutral Lipids......Page 654
24.4.4 Polar Ester Lipids......Page 656
24.4.5 Lipid Hormones and Intracellular Signaling Molecules......Page 659
24.5 Lipid Vitamins......Page 664
24.6 Lipidome Analysis......Page 666
24.7 Perspectives......Page 668
Further Reading......Page 670
25.1.1 Phosphorylation......Page 671
25.1.2 Acetylation......Page 672
25.2 Strategies for the Analysis of Phosphorylated and Acetylated Proteins and Peptides......Page 673
25.3 Separation and Enrichment of Phosphorylated and Acetylated Proteins and Peptides......Page 675
25.4.1 Detection by Enzymatic, Radioactive, Immunochemical, and Fluorescence Based Methods......Page 677
25.5 Localization and Identification of Post-translationally Modified Amino Acids......Page 679
25.5.2 Localization of Phosphorylated and Acetylated Amino Acids by Tandem Mass Spectrometry......Page 680
25.6 Quantitative Analysis of Post-translational Modifications......Page 685
Further Reading......Page 687
Part IV: Nucleic Acid Analytics......Page 689
26.1.1 Phenolic Purification of Nucleic Acids......Page 691
26.1.2 Gel Filtration......Page 692
26.1.3 Precipitation of Nucleic Acids with Ethanol......Page 693
26.1.4 Determination of the Nucleic Acid Concentration......Page 694
26.2 Isolation of Genomic DNA......Page 695
26.3.1 Isolation of Plasmid DNA from Bacteria......Page 696
26.4.1 Isolation of Phage DNA......Page 700
26.4.2 Isolation of Eukaryotic Viral DNA......Page 701
26.6 Isolation of RNA......Page 702
26.6.1 Isolation of Cytoplasmic RNA......Page 703
26.6.2 Isolation of Poly(A) RNA......Page 704
26.7 Isolation of Nucleic Acids using Magnetic Particles......Page 705
Further Reading......Page 706
27.1.1 Principle of Restriction Analyses......Page 707
27.1.3 Restriction Enzymes......Page 708
27.1.4 In Vitro Restriction and Applications......Page 711
27.2 Electrophoresis......Page 716
27.2.1 Gel Electrophoresis of DNA......Page 717
27.2.2 Gel Electrophoresis of RNA......Page 723
27.2.3 Pulsed-Field Gel Electrophoresis (PFGE)......Page 724
27.2.4 Two-Dimensional Gel Electrophoresis......Page 726
27.2.5 Capillary Gel Electrophoresis......Page 727
27.3.1 Fluorescent Dyes......Page 728
27.4.2 Choice of Membrane......Page 730
27.4.3 Southern Blotting......Page 731
27.4.4 Northern Blotting......Page 732
27.4.6 Colony and Plaque Hybridization......Page 733
27.5.3 Purification using Electroelution......Page 734
27.6.1 Principles of the Synthesis of Oligonucleotides......Page 735
27.6.2 Investigation of the Purity and Characterization of Oligonucleotides......Page 737
27.6.3 Mass Spectrometric Investigation of Oligonucleotides......Page 738
27.6.4 IP-RP-HPLC-MS Investigation of a Phosphorothioate Oligonucleotide......Page 740
Further Reading......Page 743
Chapter 28: Techniques for the Hybridization and Detection of Nucleic Acids......Page 745
28.1 Basic Principles of Hybridization......Page 746
28.1.1 Principle and Practice of Hybridization......Page 747
28.1.2 Specificity of the Hybridization and Stringency......Page 748
28.1.3 Hybridization Methods......Page 749
28.2 Probes for Nucleic Acid Analysis......Page 755
28.2.1 DNA Probes......Page 756
28.2.2 RNA Probes......Page 757
28.2.4 LNA Probes......Page 758
28.3.1 Labeling Positions......Page 759
28.3.2 Enzymatic Labeling......Page 761
28.3.4 Chemical Labeling......Page 763
28.4.2 Radioactive Systems......Page 764
28.4.3 Non-radioactive Systems......Page 765
28.5 Amplification Systems......Page 776
28.5.2 Target-Specific Signal Amplification......Page 777
28.5.3 Signal Amplification......Page 778
Further Reading......Page 779
29.1 Possibilities of PCR......Page 781
29.2.1 Instruments......Page 782
29.2.2 Amplification of DNA......Page 784
29.2.3 Amplification of RNA (RT-PCR)......Page 787
29.2.5 Quantitative PCR......Page 789
29.3.1 Nested PCR......Page 792
29.3.4 Multiplex PCR......Page 793
29.3.7 Homogeneous PCR Detection Procedures......Page 794
29.3.10 Other Approaches......Page 795
29.4.1 Avoiding Contamination......Page 796
29.4.2 Decontamination......Page 797
29.5.1 Detection of Infectious Diseases......Page 798
29.5.2 Detection of Genetic Defects......Page 799
29.5.3 The Human Genome Project......Page 802
29.6.3 Helicase-Dependent Amplification (HDA)......Page 803
29.6.4 Ligase Chain Reaction (LCR)......Page 805
29.6.5 Qβ Amplification......Page 806
Further Reading......Page 808
Chapter 30: DNA Sequencing......Page 811
30.1 Gel-Supported DNA Sequencing Methods......Page 812
30.1.1 Sequencing according to Sanger: The Dideoxy Method......Page 815
30.1.2 Labeling Techniques and Methods of Verification......Page 822
30.1.3 Chemical Cleavage according to Maxam and Gilbert......Page 826
30.2 Gel-Free DNA Sequencing Methods - The Next Generation......Page 832
30.2.1 Sequencing by Synthesis......Page 833
30.2.2 Single Molecule Sequencing......Page 839
Further Reading......Page 841
Chapter 31: Analysis of Epigenetic Modifications......Page 843
31.1 Overview of the Methods to Detect DNA-Modifications......Page 844
31.2.1 Amplification and Sequencing of Bisulfite-Treated DNA......Page 845
31.2.2 Restriction Analysis after Bisulfite PCR......Page 846
31.2.3 Methylation Specific PCR......Page 848
31.3 DNA Analysis with Methylation Specific Restriction Enzymes......Page 849
31.4 Methylation Analysis by Methylcytosine-Binding Proteins......Page 851
31.5 Methylation Analysis by Methylcytosine-Specific Antibodies......Page 852
31.6 Methylation Analysis by DNA Hydrolysis and Nearest Neighbor-Assays......Page 853
31.8 Chromosome Interaction Analyses......Page 854
Further Reading......Page 855
32.1.1 Basic Features for DNA-Protein Recognition: Double-Helical Structures......Page 857
32.1.2 DNA Curvature......Page 858
32.1.3 DNA Topology......Page 859
32.2 DNA-Binding Motifs......Page 861
32.3.2 Gel Electrophoresis......Page 862
32.3.3 Determination of Dissociation Constants......Page 865
32.3.4 Analysis of DNA-Protein Complex Dynamics......Page 866
32.4 DNA Footprint Analysis......Page 867
32.4.2 Primer Extension Reaction for DNA Analysis......Page 869
32.4.3 Hydrolysis Methods......Page 870
32.4.4 Chemical Reagents for the Modification of DNA-Protein Complexes......Page 872
32.4.5 Interference Conditions......Page 874
32.4.6 Chemical Nucleases......Page 875
32.4.7 Genome-Wide DNA-Protein Interactions......Page 876
32.5.2 Fluorophores and Labeling Procedures......Page 877
32.5.3 Fluorescence Resonance Energy Transfer (FRET)......Page 878
32.5.5 Surface Plasmon Resonance (SPR)......Page 879
32.5.6 Scanning Force Microscopy (SFM)......Page 880
32.5.7 Optical Tweezers......Page 881
32.6.1 Functional Diversity of RNA......Page 882
32.6.3 Dynamics of RNA-Protein Interactions......Page 883
32.7 Characteristic RNA-Binding Motifs......Page 885
32.8 Special Methods for the Analysis of RNA-Protein Complexes......Page 886
32.8.2 Labeling Methods......Page 887
32.8.4 Customary RNases......Page 888
32.8.5 Chemcal Modification of RNA-Protein Complexes......Page 889
32.8.6 Chemical Crosslinking......Page 892
32.8.8 Genome-Wide Identification of Transcription Start Sites (TSS)......Page 893
32.9.1 Tri-hybrid Method......Page 894
32.9.2 Aptamers and the Selex Procedure......Page 895
Further Reading......Page 896
Part V: Functional and Systems Analytics......Page 899
33.1 Sequence Analysis and Bioinformatics......Page 901
33.2 Sequence: An Abstraction for Biomolecules......Page 902
33.3 Internet Databases and Services......Page 903
33.3.1 Sequence Retrieval from Public Databases......Page 904
33.3.2 Data Contents and File Format......Page 905
33.4.1 EMBOSS......Page 907
33.6 Sequence Patterns......Page 908
33.6.1 Transcription Factor Binding Sites......Page 910
33.6.2 Identification of Coding Regions......Page 911
33.6.3 Protein Localization......Page 912
33.7.1 Identity, Similarity, Homology......Page 913
33.7.2 Optimal Sequence Alignment......Page 914
33.7.4 Profile-Based Sensitive Database Search: PSI-BLAST......Page 916
33.8 Multiple Alignment and Consensus Sequences......Page 917
33.9 Structure Prediction......Page 918
33.10 Outlook......Page 919
34.1.1 Overview......Page 921
34.1.2 Nuclease S1 Analysis of RNA......Page 922
34.1.3 Ribonuclease-Protection Assay (RPA)......Page 924
34.1.4 Primer Extension Assay......Page 927
34.1.5 Northern Blot and Dot- and Slot-Blot......Page 928
34.1.6 Reverse Transcription Polymerase Chain Reaction (RT-PCR and RT-qPCR)......Page 930
34.2.1 Nuclear-run-on Assay......Page 931
34.2.2 Labeling of Nascent RNA with 5-Fluoro-uridine (FUrd)......Page 932
34.3.1 Components of an In Vitro Transcription Assay......Page 933
34.3.3 Template DNA and Detection of In Vitro Transcripts......Page 934
34.4.1 Vectors for Analysis of Gene-Regulatory cis-Elements......Page 937
34.4.2 Transfer of DNA into Mammalian Cells......Page 938
34.4.3 Analysis of Reporter Gene Expression......Page 940
Further Reading......Page 942
35.1.1 Labeling Strategy......Page 943
35.1.3 Labeling of DNA Probes......Page 944
35.1.4 In Situ Hybridization......Page 945
35.2.1 FISH Analysis of Genomic DNA......Page 946
35.2.2 Comparative Genomic Hybridization (CGH)......Page 947
Further Reading......Page 950
36.1.1 Recombination......Page 951
36.1.2 Genetic Markers......Page 953
36.1.3 Linkage Analysis - the Generation of Genetic Maps......Page 955
36.1.4 Genetic Map of the Human Genome......Page 957
36.2.1 Restriction Mapping of Whole Genomes......Page 958
36.2.2 Mapping of Recombinant Clones......Page 960
36.2.3 Generation of a Physical Map......Page 961
36.2.4 Identification and Isolation of Genes......Page 963
36.2.5 Transcription Maps of the Human Genome......Page 965
36.3 Integration of Genome Maps......Page 966
Further Reading......Page 968
Chapter 37: DNA-Microarray Technology......Page 971
37.1.1 Transcriptome Analysis......Page 972
37.1.3 RNA Structure and Functionality......Page 973
37.2.2 Methylation Studies......Page 974
37.2.3 DNA Sequencing......Page 975
37.2.5 Protein-DNA Interactions......Page 977
37.3.1 DNA Synthesis......Page 978
37.3.3 On-Chip Protein Expression......Page 979
37.4.1 Barcode Identification......Page 980
37.4.2 A Universal Microarray Platform......Page 981
37.5.2 Beyond Nucleic Acids......Page 982
Further Reading......Page 983
Chapter 38: The Use of Oligonucleotides as Tools in Cell Biology......Page 985
38.1.1 Mechanisms of Antisense Oligonucleotides......Page 986
38.1.2 Triplex-Forming Oligonucleotides......Page 987
38.1.3 Modifications of Oligonucleotides to Decrease their Susceptibility to Nucleases......Page 988
38.1.5 Antisense Oligonucleotides as Therapeutics......Page 990
38.2.1 Discovery and Classification of Ribozymes......Page 991
38.2.2 Use of Ribozymes......Page 992
38.3.1 Basics of RNA Interference......Page 993
38.3.2 RNA Interference Mediated by Expression Vectors......Page 994
38.3.3 Uses of RNA Interference......Page 995
38.3.4 microRNAs......Page 996
38.4.1 Selection of Aptamers......Page 997
38.4.2 Uses of Aptamers......Page 999
38.5 Genome Editing with CRISPR/Cas9......Page 1000
38.6 Outlook......Page 1001
Further Reading......Page 1002
39.1 General Aspects in Proteome Analysis......Page 1003
39.2 Definition of Starting Conditions and Project Planning......Page 1005
39.3 Sample Preparation for Proteome Analysis......Page 1006
39.4.1 Two-Dimensional-Gel-Based Proteomics......Page 1008
39.4.3 Top-Down Proteomics using Isotope Labels......Page 1012
39.4.5 Concepts in Intact Protein Mass Spectrometry......Page 1013
39.5.2 Bottom-Up Proteomics......Page 1024
39.5.4 Bottom-Up Proteomic Strategies......Page 1026
39.5.5 Peptide Quantification......Page 1027
39.5.6 Data Dependent Analysis (DDA)......Page 1028
39.5.7 Selected Reaction Monitoring......Page 1029
39.5.8 SWATH-MS......Page 1036
39.5.10 Extensions......Page 1038
39.6.1 Stable Isotope Label in Top-Down Proteomics......Page 1039
39.6.2 Stable Isotope Labeling in Bottom-Up Proteomics......Page 1045
Further Reading......Page 1047
Chapter 40: Metabolomics and Peptidomics......Page 1049
40.1 Systems Biology and Metabolomics......Page 1051
40.2 Technological Platforms for Metabolomics......Page 1052
40.3 Metabolomic Profiling......Page 1053
40.4 Peptidomics......Page 1054
40.5 Metabolomics - Knowledge Mining......Page 1055
40.6 Data Mining......Page 1056
Further Reading......Page 1058
41.1 Protein Microarrays......Page 1059
41.1.1 Sensitivity Increase through Miniaturization - Ambient Analyte Assay......Page 1060
41.1.2 From DNA to Protein Microarrays......Page 1061
41.1.3 Application of Protein Microarrays......Page 1063
Further Reading......Page 1065
42.1 Chemical Biology - Innovative Chemical Approaches to Study Biological Phenomena......Page 1067
42.2 Chemical Genetics - Small Organic Molecules for the Modulation of Protein Function......Page 1069
42.2.1 Study of Protein Functions with Small Organic Molecules......Page 1070
42.2.2 Forward and Reverse Chemical Genetics......Page 1072
42.2.3 The Bump-and-Hole Approach of Chemical Genetics......Page 1073
42.2.4 Identification of Kinase Substrates with ASKA Technology......Page 1076
42.2.5 Switching Biological Systems on and off with Small Organic Molecules......Page 1077
42.3.1 Analysis of Lipid-Modified Proteins......Page 1078
42.3.3 Conditional Protein Splicing......Page 1080
Further Reading......Page 1081
43.1 Antibody Based Toponome Analysis using Imaging Cycler Microscopy (ICM)......Page 1083
43.1.1 Concept of the Protein Toponome......Page 1084
43.1.2 Imaging Cycler Robots: Fundament of a Toponome Reading Technology......Page 1085
Acknowledgements......Page 1089
43.2.2 Mass Spectrometric Pixel Images......Page 1090
43.2.3 Achievable Spatial Resolution......Page 1091
43.2.5 Lateral Resolution and Analytical Limit of Detection......Page 1093
43.2.7 Accurate MALDI Mass Spectrometry Imaging......Page 1094
43.2.8 Identification and Characterization of Analytes......Page 1095
Further Reading......Page 1096
Appendix 1: Amino Acids and Posttranslational Modifications......Page 1099
Appendix 2: Symbols and Abbreviations......Page 1101
Appendix 3: Standard Amino Acids (three and one letter code)......Page 1107
Appendix 4: Nucleic Acid Bases......Page 1109
Index......Page 1111
End User License Agreement......Page 1137

Citation preview

Edited by Friedrich Lottspeich and Joachim W. Engels

Bioanalytics

Bioanalytics

Edited by Friedrich Lottspeich and Joachim Engels

Bioanalytics Analytical Methods and Concepts in Biochemistry and Molecular Biology

Editors Dr. phil Dr. med. habil. Friedrich Lottspeich retired from MPI for Biochemistry Peter-Dörfler-Straße 4a 82131 Stockdorf Germany

Prof. Dr. Joachim Engels Goethe Universität OCCB FB14 Max-von-Laue-Straße 7 60438 Frankfurt Germany

Cover credit: Background picture - fotolia_science photo; Circles from left to right: 1st circle - fotolia_T-flex; 2nd circle - Adam Design; 3rd circle fotolia_fibroblasts; 4th circle - The Picture was kindly provided by Dr. Ficner. All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate. Library of Congress Card No.: applied for British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at .  2018 Wiley-VCH Verlag GmbH & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law. Print ISBN: 978-3-527-33919-8 ePDF ISBN: 978-3-527-69444-0 ePub ISBN: 978-3-527-69446-4 Mobi ISBN: 978-3-527-69447-1 Cover Design Adam Design Typesetting Thomson Digital Printing and Binding Printed on acid-free paper

Table of Contents

Preface

XV

Introduction

XIX

Part I Protein Analytics

1

1 1.1 1.2 1.3 1.4 1.5 1.5.1 1.5.2 1.6 1.7 1.8 1.8.1 1.8.2 1.9

Protein Purification Properties of Proteins Protein Localization and Purification Strategy Homogenization and Cell Disruption Precipitation Centrifugation Basic Principles Centrifugation Techniques Removal of Salts and Hydrophilic Contaminants Concentration Detergents and their Removal Properties of Detergents Removal of Detergents Sample Preparation for Proteome Analysis Further Reading

3 3 6 7 9 11 12 12 15 17 18 18 20 22 22

2 2.1 2.1.1 2.1.2 2.1.3 2.1.4 2.2 2.2.1 2.2.2 2.3 2.3.1

Protein determination Quantitative Determination by Staining Tests Biuret Assay Lowry Assay Bicinchoninic Acid Assay (BCA Assay) Bradford Assay Spectroscopic Methods Measurements in the UV Range Fluorescence Method Radioactive Labeling of Peptides and Proteins Iodinations Further Reading

23 25 26 26 27 28 28 29 31 31 33 33

3 3.1 3.2

Enzyme Activity Testing The Driving Force behind Chemical Reactions Rate of Chemical Reactions

35 35 36

3.3 3.4 3.5 3.6 3.7 3.8 3.8.1 3.8.2 3.9 3.9.1 3.9.2 3.9.3 3.9.4 3.9.5 3.9.6 3.9.7 3.9.8 3.9.9

4 4.1 4.2 4.2.1 4.2.2 4.3

5 5.1 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 5.2

Catalysts Enzymes as Catalysts Rate of Enzyme-Controlled Reactions Michaelis–Menten Theory Determination of K m and V max Inhibitors Competitive Inhibitors Non-competitive Inhibitors Test System Set-up Analysis of the Physiological Function Selecting the Substrates Detection System Time Dependence pH Value Selecting the Buffer Substance and the Ionic Strength Temperature Substrate Concentration Controls Further Reading

37 37 38 38 39 40 40 41 41 42 42 42 43 43

Microcalorimetry Differential Scanning Calorimetry (DSC) Isothermal Titration Calorimetry (ITC) Ligand Binding to Proteins Binding of Molecules to Membranes: Insertion and Peripheral Binding Pressure Perturbation Calorimetry (PPC) Further Reading

47 48 54 54 58 61 62

Immunological Techniques Antibodies Antibodies and Immune Defense Antibodies as Reagents Properties of Antibodies Functional Structure of IgG Antigen Interaction at the Combining Site Handling of Antibodies Antigens

63 63 63 64 64 66 67 68 69

43 44 44 45 45

VI 5.3 5.3.1 5.3.2 5.3.3 5.4 5.5 5.6 5.7 5.7.1 5.7.2 5.7.3 5.8

6 6.1 6.2 6.2.1 6.2.2 6.3 6.3.1 6.3.2

7 7.1 7.1.1 7.1.2 7.1.3 7.1.4 7.1.5 7.2 7.2.1 7.2.2 7.3 7.3.1 7.3.2 7.3.3 7.3.4 7.3.5 7.3.6 7.3.7 7.3.8 7.4 7.4.1

Table of Contents

Antigen–Antibody Reaction Immunoagglutination Immunoprecipitation Immune Binding Complement Fixation Methods in Cellular Immunology Alteration of Biological Functions Production of Antibodies Types of Antibodies New Antibody Techniques (Antibody Engineering) Optimized Monoclonal Antibody Constructs with Effector Functions for Therapeutic Application Outlook: Future Expansion of the Binding Concepts Dedication Further Reading

71 72 73 84 94 95 97 98 98 99 102 106 106 106

Chemical Modification of Proteins and Protein Complexes 107 Chemical Modification of Protein Functional Groups 108 Modification as a Means to Introduce Reporter Groups 116 Investigation with Naturally Occurring Proteins 116 Investigation of Recombinant and Mutated Proteins 120 Protein Crosslinking for the Analysis of Protein Interaction 121 Bifunctional Reagents 121 Photoaffinity Labeling 121 Further Reading 129 Spectroscopy Physical Principles and Measuring Techniques Physical Principles of Optical Spectroscopic Techniques Interaction of Light with Matter Absorption Measurement and the Lambert–Beer Law Photometer Time-Resolved Spectroscopy UV/VIS/NIR Spectroscopy Basic Principles Chromoproteins Fluorescence Spectroscopy Basic Principles of Fluorescence Spectroscopy Fluorescence: Emission and Action Spectra Fluorescence Studies using Intrinsic and Extrinsic Probes Green Fluorescent Protein (GFP) as a Unique Fluorescent Probe Quantum Dots as Fluorescence Labels Special Fluorescence Techniques: FRAP, FLIM, FCS, TIRF Förster Resonance Energy Transfer (FRET) Frequent Mistakes in Fluorescence Spectroscopy: “The Seven Sins of Fluorescence Measurements” Infrared Spectroscopy Basic Principles of IR Spectroscopy

7.4.2 7.4.3 7.4.4 7.5 7.5.1 7.5.2 7.5.3 7.6 7.7 7.7.1 7.7.2

131 132 132 133 140 143 144 146 146 147 154 154 156

9 9.1 9.2 9.3 9.4 9.5 9.5.1 9.5.2 9.6 9.7

Cleavage of Proteins Proteolytic Enzymes Strategy Denaturation of Proteins Cleavage of Disulfide Bonds and Alkylation Enzymatic Fragmentation Proteases Conditions for Proteolysis Chemical Fragmentation Summary Further Reading

207 207 208 209 209 210 210 215 216 217 218

10 10.1 10.2

Chromatographic Separation Methods Instrumentation Fundamental Terms and Concepts in Chromatography Biophysical Properties of Peptides and Proteins Chromatographic Separation Modes for Peptides and Proteins High-Performance Size Exclusion Chromatography High-Performance Reversed-Phase Chromatography (HP-RPC) High-Performance Normal-Phase Chromatography (HP-NPC) High-Performance Hydrophilic Interaction Chromatography (HP-HILIC) High-Performance Aqueous Normal Phase Chromatography (HP-ANPC) High-Performance Hydrophobic Interaction Chromatography (HP-HIC) High-Performance Ion Exchange Chromatography (HP-IEX) High-Performance Affinity Chromatography (HP-AC)

219 219

10.3 10.4 10.4.1 10.4.2

10.4.4 10.4.5 10.4.6 10.4.7 161 163 163

178 180

181 182 183 189 195 197 205

8.2 8.3 8.4 8.5 8.6

157

160 160

164 165 168 171 171 172 173 174 175 175

Light Microscopy Techniques – Imaging Steps on the Road to Microscopy – from Simple Lenses to High Resolution Microscopes Modern Applications Basic Physical Principles Detection Methods Sample Preparation Special Fluorescence Microscopic Analysis Further Reading

8 8.1

10.4.3

158 159

Molecular Vibrations Technical aspects of Infrared Spectroscopy Infrared Spectra of Proteins Raman Spectroscopy Basic Principles of Raman Spectroscopy Raman Experiments Resonance Raman Spectroscopy Single Molecule Spectroscopy Methods using Polarized Light Linear Dichroism Optical Rotation Dispersion and Circular Dichroism Further Reading

10.4.8

181

220 224 225 227 227 228 229 230 230 232 233

Table of Contents

10.5 10.5.1 10.5.2 10.5.3 10.5.4 10.6 10.6.1 10.6.2 10.6.3 10.6.4 10.7

11 11.1 11.2 11.3 11.3.1 11.3.2 11.3.3 11.3.4 11.3.5 11.3.6 11.3.7 11.3.8 11.3.9 11.3.10 11.3.11 11.3.12 11.4 11.4.1 11.4.2 11.4.3 11.5 11.6 11.6.1 11.6.2 11.6.3 11.6.4 11.6.5 11.6.6 11.7 11.7.1 11.7.2 11.7.3

12 12.1 12.2 12.3 12.3.1 12.3.2

Method Development from Analytical to Preparative Scale Illustrated for HP-RPC Development of an Analytical Method Scaling up to Preparative Chromatography Fractionation Analysis of Fractionations Multidimensional HPLC Purification of Peptides and Proteins by MD-HPLC Methods Fractionation of Complex Peptide and Protein Mixtures by MD-HPLC Strategies for MD-HPLC Methods Design of an Effective MD-HPLC Scheme Final Remarks Further Reading

234 234 236 237 238 238 238 239 239 240 242 242

Electrophoretic Techniques 243 Historical Review 244 Theoretical Fundamentals 245 Equipment and Procedures of Gel Electrophoreses 248 Sample Preparation 249 Gel Media for Electrophoresis 250 Detection and Quantification of the Separated Proteins 251 Zone Electrophoresis 253 Porosity Gradient Gels 254 Buffer Systems 255 Disc Electrophoresis 255 Acidic Native Electrophoresis 257 SDS Polyacrylamide Gel Electrophoresis 257 Cationic Detergent Electrophoresis 258 Blue Native Polyacrylamide Gel Electrophoresis 259 Isoelectric Focusing 259 Preparative Techniques 263 Electroelution from Gels 263 Preparative Zone Electrophoresis 264 Preparative Isoelectric Focusing 265 Free Flow Electrophoresis 266 High-Resolution Two-Dimensional Electrophoresis 267 Sample Preparation 268 Prefractionation 268 First Dimension: IEF in IPG Strips 269 Second Dimension: SDS Polyacrylamide Gel Electrophoresis 270 Detection and Identification of Proteins 270 Difference Gel Electrophoresis (DIGE) 270 Electroblotting 272 Blot Systems 272 Transfer Buffers 273 Blot Membranes 273 Further Reading 273 Capillary Electrophoresis Historical Overview Capillary Electrophoresis Setup Basic Principles of Capillary Electrophoresis Sample Injection The Engine: Electroosmotic Flow (EOF)

275 275 276 277 277 278

12.3.3 12.3.4 12.4 12.4.1 12.4.2 12.4.3 12.4.4 12.4.5 12.4.6 12.4.7 12.4.8 12.5 12.5.1 12.5.2 12.5.3 12.5.4 12.6

Joule Heating Detection Methods Capillary Electrophoresis Methods Capillary Zone Electrophoresis (CZE) Affinity Capillary Electrophoresis (ACE) Micellar Electrokinetic Chromatography (MEKC) Capillary Electrochromatography (CEC) Chiral Separations Capillary Gel Electrophoresis (CGE) Capillary Isoelectric Focusing (CIEF) Isotachophoresis (ITP) Special Techniques Sample Concentration Online Sample Concentration Fractionation Microchip Electrophoresis Outlook Further Reading

13 13.1 13.1.1 13.1.2 13.1.3 13.2 13.3

Amino Acid Analysis Sample Preparation Acidic Hydrolysis Alkaline Hydrolysis Enzymatic Hydrolysis Free Amino Acids Liquid Chromatography with Optical Detection Systems 13.3.1 Post-Column Derivatization 13.3.2 Pre-column Derivatization 13.4 Amino Acid Analysis using Mass Spectrometry 13.5 Summary Further Reading 14 14.1 14.1.1 14.1.2 14.1.3 14.1.4 14.1.5 14.1.6 14.2 14.2.1 14.2.2 14.2.3

Protein Sequence Analysis N-Terminal Sequence Analysis: The Edman Degradation Reactions of the Edman Degradation Identification of the Amino Acids Quality of Edman Degradation: the Repetitive Yield Instrumentation Problems of Amino Acid Sequence Analysis State of the Art C-Terminal Sequence Analysis Chemical Degradation Methods Peptide Quantities and Quality of the Chemical Degradation Degradation of Polypeptides with Carboxypeptidases Further Reading

15 Mass Spectrometry 15.1 Ionization Methods 15.1.1 Matrix Assisted Laser Desorption Ionization Mass Spectrometry (MALDI-MS) 15.1.2 Electrospray Ionization (ESI) 15.2 Mass Analyzer 15.2.1 Time-of-Flight Analyzers (TOF)

VII 279 279 281 281 285 286 288 289 290 291 293 295 295 295 296 297 297 299 301 302 302 303 303 303 303 303 305 309 310 311 313 315 315 316 317 319 322 325 325 325 327 327 328 329 330 330 335 341 343

VIII

Table of Contents

15.2.2 15.2.3 15.2.4 15.2.5 15.2.6 15.3 15.3.1 15.3.2 15.4 15.4.1 15.4.2 15.4.3 15.4.4 15.5 15.5.1 15.5.2 15.5.3 15.5.4 15.5.5 15.5.6 15.5.7 15.6 15.6.1 15.6.2 15.6.3 15.7 15.7.1 15.7.2 15.7.3 15.8

Quadrupole Analyzer Electric Ion Traps Magnetic Ion Trap Orbital Ion Trap Hybrid Instruments Ion Detectors Secondary Electron Multiplier (SEV) Faraday Cup Fragmentation Techniques Collision Induced Dissociation (CID) Prompt and Metastable Decay (ISD, PSD) Photon-Induced Dissociation (PID, IRMPD) Generation of Free Radicals (ECD, HECD, ETD) Mass Determination Calculation of Mass Influence of Isotopy Calibration Determination of the Number of Charges Signal Processing and Analysis Derivation of the Mass Problems Identification, Detection, and Structure Elucidation Identification Verification Structure Elucidation LC-MS and LC-MS/MS LC-MS LC-MS/MS Ion Mobility Spectrometry (IMS) Quantification Further Reading

345 348 349 350 351 355 356 357 357 357 358 360 360 362 362 362 365 365 366 366 366 368 368 369 369 375 375 376 378 378 379

16 16.1 16.1.1 16.1.2 16.1.3 16.1.4 16.1.5 16.1.6 16.1.7

Protein–Protein Interactions The Two-Hybrid System Principle of Two-Hybrid Systems Elements of the Two-Hybrid System Construction of Bait and Prey Proteins Which Bait Proteins can be used in a Y2H Screen? AD Fusion Proteins and cDNA Libraries Carrying out a Y2H Screen Other Modifications and Extensions of the Two-Hybrid-Technology Biochemical and Functional Analysis of Interactions TAP-Tagging and Purification of Protein Complexes Analyzing Interactions In Vitro: GST- Pulldown Co-immunoprecipitation Far-Western Surface Plasmon Resonance Spectroscopy Fluorescence Resonance Energy Transfer (FRET) Introduction Key Physical Principles of FRET Methods of FRET Measurements Fluorescent Probes for FRET Alternative Tools for Probing Protein–Protein Interactions: LINC and STET Analytical Ultracentrifugation

381 381 381 382 382 385 385 386

16.1.8 16.2 16.3 16.4 16.5 16.6 16.7 16.7.1 16.7.2 16.7.3 16.7.4 16.7.5 16.8

16.8.1 16.8.2 16.8.3 16.8.4

Principles of Instrumentation Basics of Centrifugation Sedimentation Velocity Experiments Sedimentation–Diffusion Equilibrium Experiments Further Reading

410 411 412 415 416

17 17.1

Biosensors Dry Chemistry: Test Strips for Detecting and Monitoring Diabetes Biosensors Concept of Biosensors Construction and Function of Biosensors Cell Sensors Immunosensors Biomimetic Sensors From Glucose Enzyme Electrodes to Electronic DNA Biochips Resume: Biosensor or not Biosensor is no Longer the Question Further Reading

419

17.2 17.2.1 17.2.2 17.2.3 17.2.4 17.3 17.4 17.5

Part II 3D Structure Determination 18 18.1 18.1.1 18.1.2 18.1.3 18.1.4 18.1.5 18.1.6 18.1.7 18.2 18.2.1 18.2.2 18.2.3 18.2.4

391 393 394 397 398 399 400 402 402 403 403 406 408 409

18.2.5 18.2.6 18.2.7 18.2.8 18.2.9 18.2.10

19 19.1

Magnetic Resonance Spectroscopy of Biomolecules NMR Spectroscopy of Biomolecules Theory of NMR Spectroscopy One-Dimensional NMR Spectroscopy Two-Dimensional NMR Spectroscopy Three-Dimensional NMR Spectroscopy Resonance Assignment Protein Structure Determination Protein Structures and more — an Overview EPR Spectroscopy of Biological Systems Basics of EPR Spectroscopy cw- EPR Spectroscopy g-Value Electron Spin Nuclear Spin Coupling (Hyperfine Coupling) g and Hyperfine Anisotropy Electron Spin–Electron Spin Coupling Pulsed EPR Experiments Further Examples of EPR Applications General Remarks on the Significance of EPR Spectra Comparison EPR/NMR Acknowledgements Further Reading

Electron Microscopy Transmission Electron Microscopy – Instrumentation 19.2 Approaches to Preparation 19.2.1 Native Samples in Ice 19.2.2 Negative Staining

420 420 420 421 425 426 427 428 429 429

431 433 433 434 438 443 449 452 457 462 466 467 468 469 469 470 472 473 479 481 481 482 482 485 487 488 488 490

Table of Contents

19.2.3 19.2.4 19.3 19.3.1 19.3.2 19.3.3 19.3.4 19.3.5 19.3.6 19.4 19.4.1 19.4.2 19.4.3 19.4.4 19.4.5 19.5 19.5.1 19.5.2 19.5.3 19.6 19.6.1 19.6.2 19.6.3 19.7

20 20.1 20.2 20.3 20.4 20.5 20.6 20.7

21 21.1 21.1.1 21.1.2 21.1.3 21.1.4 21.2 21.2.1 21.2.2 21.2.3 21.3

Metal Coating by Evaporation Labeling of Proteins Imaging Process in the Electron Microscope Resolution of a Transmission Electron Microscope Interactions of the Electron Beam with the Object Phase Contrast in Transmission Electron Microscopy Electron Microscopy with a Phase Plate Imaging Procedure for Frozen-Hydrated Specimens Recording Images – Cameras and the Impact of Electrons Image Analysis and Processing of Electron Micrographs Pixel Size Fourier Transformation Analysis of the Contrast Transfer Function and Object Features Improving the Signal-to-Noise Ratio Principal Component Analysis and Classification Three-Dimensional Electron Microscopy Three-Dimensional Reconstruction of Single Particles Three-Dimensional Reconstruction of Regularly Arrayed Macromolecular Complexes Electron Tomography of Individual Objects Analysis of Complex 3D Data Sets Hybrid Approach: Combination of EM and X-Ray Data Segmenting Tomograms and Visualization Identifying Protein Complexes in Cellular Tomograms Perspectives of Electron Microscopy Further Reading

491 492 492 492 493 495 495 496 497 498 498 499 501 504 506 508 509 511 512 514 514 515 515 516 517

Atomic Force Microscopy Introduction Principle of the Atomic Force Microscope Interaction between Tip and Sample Preparation Procedures Mapping Biological Macromolecules Force Spectroscopy of Single Molecules Detection of Functional States and Interactions of Individual Proteins Further Reading

519 519 520 521 522 522 524

X-Ray Structure Analysis X-Ray Crystallography Crystallization Crystals and X-Ray Diffraction The Phase Problem Model Building and Structure Refinement Small Angle X-Ray Scattering (SAXS) Machine Setup Theory Data Analysis X-Ray Free Electron LASER (XFEL)

529 530 531 533 538 542 543 544 545 547 549

526 527

21.3.1 Machine Setup and Theory Acknowledgement Further Reading

Part III Peptides, Carbohydrates, and Lipids 22 22.1 22.2 22.3 22.4 22.5

Analytics of Synthetic Peptides Concept of Peptide Synthesis Purity of Synthetic Peptides Characterization and Identity of Synthetic Peptides Characterization of the Structure of Synthetic Peptides Analytics of Peptide Libraries Further Reading

23 23.1 23.1.1 23.1.2 23.1.3 23.1.4 23.1.5 23.2 23.2.1 23.2.2 23.3 23.3.1

IX 549 550 551

553 555 555 560 562 564 567 569

Carbohydrate Analysis General Stereochemical Basics The Series of D-Sugars Stereochemistry of D-Glucose Important Monosaccharide Building Blocks The Series of L-Sugars The Glycosidic Bond Protein Glycosylation Structure of the N-Glycans Structure of the O-Glycans Analysis of Protein Glycosylation Analysis on the Basis of the Intact Glycoprotein 23.3.2 Mass Spectrometric Analysis on the Basis of Glycopeptides 23.3.3 Release and Isolation of the N-Glycan Pool 23.3.4 Analysis of Individual N-Glycans 23.4 Genome, Proteome, Glycome 23.5 Final Considerations Further Reading

571 572 572 573 574 574 574 579 580 580 581

24 24.1 24.2

613 613

24.2.1 24.2.2 24.3 24.3.1 24.3.2 24.3.3 24.3.4 24.3.5 24.4 24.4.1 24.4.2 24.4.3 24.4.4

Lipid Analysis Structure and Classification of Lipids Extraction of Lipids from Biological Sources Liquid Phase Extraction Solid Phase Extraction Methods for Lipid Analysis Chromatographic Methods Mass Spectrometry Immunoassays Further Methods in Lipid Analysis Combining Different Analytical Systems Analysis of Selected Lipid Classes Whole Lipid Extracts Fatty Acids Nonpolar Neutral Lipids Polar Ester Lipids

582 588 590 599 610 611 612

615 616 616 618 618 622 622 623 623 626 626 627 628 630

X

Table of Contents

24.4.5 Lipid Hormones and Intracellular Signaling Molecules 24.5 Lipid Vitamins 24.6 Lipidome Analysis 24.7 Perspectives Further Reading 25 25.1 25.1.1 25.1.2 25.2 25.3 25.4 25.4.1

25.4.2 25.5 25.5.1 25.5.2 25.6 25.7

Analysis of Post-translational Modifications: Phosphorylation and Acetylation of Proteins Functional Relevance of Phosphorylation and Acetylation Phosphorylation Acetylation Strategies for the Analysis of Phosphorylated and Acetylated Proteins and Peptides Separation and Enrichment of Phosphorylated and Acetylated Proteins and Peptides Detection of Phosphorylated and Acetylated Proteins and Peptides Detection by Enzymatic, Radioactive, Immunochemical, and Fluorescence Based Methods Detection of Phosphorylated and Acetylated Proteins by Mass Spectrometry Localization and Identification of Post-translationally Modified Amino Acids Localization of Phosphorylated and Acetylated Amino Acids by Edman Degradation Localization of Phosphorylated and Acetylated Amino Acids by Tandem Mass Spectrometry Quantitative Analysis of Post-translational Modifications Future of Post-translational Modification Analysis Further Reading

Part IV Nucleic Acid Analytics 26 26.1 26.1.1 26.1.2 26.1.3 26.1.4 26.2 26.3 26.3.1 26.3.2 26.4 26.4.1 26.4.2 26.5 26.5.1 26.5.2

633 638 640 642 644

645 645 645 646 647 649 651

651 653 653 654 654 659 661 661

663

Isolation and Purification of Nucleic Acids Purification and Determination of Nucleic Acid Concentration Phenolic Purification of Nucleic Acids Gel Filtration Precipitation of Nucleic Acids with Ethanol Determination of the Nucleic Acid Concentration Isolation of Genomic DNA Isolation of Low Molecular Weight DNA Isolation of Plasmid DNA from Bacteria Isolation of Eukaryotic Low Molecular Weight DNA Isolation of Viral DNA Isolation of Phage DNA Isolation of Eukaryotic Viral DNA Isolation of Single-Stranded DNA Isolation of M13 Phage DNA Separation of Single- and Double-Stranded DNA

665 665 665 666 667 668 669 670 670 674 674 674 675 676 676 676

26.6 26.6.1 26.6.2 26.6.3 26.7 26.8

Isolation of RNA Isolation of Cytoplasmic RNA Isolation of Poly(A) RNA Isolation of Small RNA Isolation of Nucleic Acids using Magnetic Particles Lab-on-a-chip Further Reading

27 27.1 27.1.1 27.1.2 27.1.3 27.1.4 27.2 27.2.1 27.2.2 27.2.3 27.2.4 27.2.5 27.3 27.3.1 27.3.2 27.4 27.4.1 27.4.2 27.4.3 27.4.4 27.4.5 27.4.6 27.5 27.5.1 27.5.2 27.5.3 27.5.4 27.6 27.6.1 27.6.2

Analysis of Nucleic Acids Restriction Analysis Principle of Restriction Analyses Historical Overview Restriction Enzymes In Vitro Restriction and Applications Electrophoresis Gel Electrophoresis of DNA Gel Electrophoresis of RNA Pulsed-Field Gel Electrophoresis (PFGE) Two-Dimensional Gel Electrophoresis Capillary Gel Electrophoresis Staining Methods Fluorescent Dyes Silver Staining Nucleic Acid Blotting Nucleic Acid Blotting Methods Choice of Membrane Southern Blotting Northern Blotting Dot- and Slot-Blotting Colony and Plaque Hybridization Isolation of Nucleic Acid Fragments Purification using Glass Beads Purification using Gel Filtration or Reversed Phase Purification using Electroelution Other Methods LC-MS of Oligonucleotides Principles of the Synthesis of Oligonucleotides Investigation of the Purity and Characterization of Oligonucleotides 27.6.3 Mass Spectrometric Investigation of Oligonucleotides 27.6.4 IP-RP-HPLC-MS Investigation of a Phosphorothioate Oligonucleotide Further Reading 28

28.1 28.1.1 28.1.2 28.1.3 28.2 28.2.1 28.2.2 28.2.3 28.2.4 28.3 28.3.1

Techniques for the Hybridization and Detection of Nucleic Acids Basic Principles of Hybridization Principle and Practice of Hybridization Specificity of the Hybridization and Stringency Hybridization Methods Probes for Nucleic Acid Analysis DNA Probes RNA Probes PNA Probes LNA Probes Methods of Labeling Labeling Positions

676 677 678 679 679 680 680 681 681 681 682 682 685 690 691 697 698 700 701 702 702 704 704 704 704 705 706 707 707 708 708 708 708 709 709 709 711 712 714 717

719 720 721 722 723 729 730 731 732 732 733 733

Table of Contents

28.3.2 28.3.3 28.3.4 28.4 28.4.1 28.4.2 28.4.3 28.5 28.5.1 28.5.2 28.5.3

Enzymatic Labeling Photochemical Labeling Reactions Chemical Labeling Detection Systems Staining Methods Radioactive Systems Non-radioactive Systems Amplification Systems Target Amplification Target-Specific Signal Amplification Signal Amplification Further Reading

735 737 737 738 738 738 739 750 751 751 752 753

29 29.1 29.2 29.2.1 29.2.2 29.2.3 29.2.4 29.2.5 29.3 29.3.1 29.3.2 29.3.3 29.3.4 29.3.5 29.3.6 29.3.7 29.3.8 29.3.9 29.3.10 29.4 29.4.1 29.4.2 29.5 29.5.1 29.5.2 29.5.3 29.6 29.6.1

Polymerase Chain Reaction Possibilities of PCR Basics Instruments Amplification of DNA Amplification of RNA (RT-PCR) Optimizing the Reaction Quantitative PCR Special PCR Techniques Nested PCR Asymmetric PCR Use of Degenerate Primers Multiplex PCR Cycle sequencing In Vitro Mutagenesis Homogeneous PCR Detection Procedures Quantitative Amplification Procedures In Situ PCR Other Approaches Contamination Problems Avoiding Contamination Decontamination Applications Detection of Infectious Diseases Detection of Genetic Defects The Human Genome Project Alternative Amplification Procedures Nucleic Acid Sequence-Based Amplification (NASBA) Strand Displacement Amplification (SDA) Helicase-Dependent Amplification (HDA) Ligase Chain Reaction (LCR) Qβ Amplification Branched DNA Amplification (bDNA) Prospects Further Reading

755 755 756 756 758 761 763 763 766 766 767 767 767 768 768 768 769 769 769 770 770 771 772 772 773 776 777

29.6.2 29.6.3 29.6.4 29.6.5 29.6.6 29.7

30 DNA Sequencing 30.1 Gel-Supported DNA Sequencing Methods 30.1.1 Sequencing according to Sanger: The Dideoxy Method 30.1.2 Labeling Techniques and Methods of Verification 30.1.3 Chemical Cleavage according to Maxam and Gilbert

777 777 777 779 780 782 782 782 785 786 789 796 800

Gel-Free DNA Sequencing Methods – The Next Generation 30.2.1 Sequencing by Synthesis 30.2.2 Single Molecule Sequencing Further Reading

XI

30.2

806 807 813 815

31 31.1

Analysis of Epigenetic Modifications 817 Overview of the Methods to Detect DNA-Modifications 818 31.2 Methylation Analysis with the Bisulfite Method 819 31.2.1 Amplification and Sequencing of Bisulfite-Treated DNA 819 31.2.2 Restriction Analysis after Bisulfite PCR 820 31.2.3 Methylation Specific PCR 822 31.3 DNA Analysis with Methylation Specific Restriction Enzymes 823 31.4 Methylation Analysis by Methylcytosine-Binding Proteins 825 31.5 Methylation Analysis by Methylcytosine-Specific Antibodies 826 31.6 Methylation Analysis by DNA Hydrolysis and Nearest Neighbor-Assays 827 31.7 Analysis of Epigenetic Modifications of Chromatin 828 31.8 Chromosome Interaction Analyses 828 31.9 Outlook 829 Further Reading 829 32 Protein–Nucleic Acid Interactions 32.1 DNA–Protein Interactions 32.1.1 Basic Features for DNA–Protein Recognition: Double-Helical Structures 32.1.2 DNA Curvature 32.1.3 DNA Topology 32.2 DNA-Binding Motifs 32.3 Special Analytical Methods 32.3.1 Filter Binding 32.3.2 Gel Electrophoresis 32.3.3 Determination of Dissociation Constants 32.3.4 Analysis of DNA–Protein Complex Dynamics 32.4 DNA Footprint Analysis 32.4.1 DNA Labeling 32.4.2 Primer Extension Reaction for DNA Analysis 32.4.3 Hydrolysis Methods 32.4.4 Chemical Reagents for the Modification of DNA–Protein Complexes 32.4.5 Interference Conditions 32.4.6 Chemical Nucleases 32.4.7 Genome-Wide DNA–Protein Interactions 32.5 Physical Analysis Methods 32.5.1 Fluorescence Methods 32.5.2 Fluorophores and Labeling Procedures 32.5.3 Fluorescence Resonance Energy Transfer (FRET) 32.5.4 Molecular Beacons 32.5.5 Surface Plasmon Resonance (SPR) 32.5.6 Scanning Force Microscopy (SFM) 32.5.7 Optical Tweezers 32.5.8 Fluorescence Correlation Spectroscopy (FCS) 32.6 RNA–Protein Interactions

831 831 831 832 833 835 836 836 836 839 840 841 843 843 844 846 848 849 850 851 851 851 852 853 853 854 855 856 856

XII

Table of Contents

32.6.1 Functional Diversity of RNA 856 32.6.2 RNA Secondary Structure Parameters and unusual Base Pairs 857 32.6.3 Dynamics of RNA–Protein Interactions 857 32.7 Characteristic RNA-Binding Motifs 859 32.8 Special Methods for the Analysis of RNA–Protein Complexes 860 32.8.1 Limited Enzymatic Hydrolyses 861 32.8.2 Labeling Methods 861 32.8.3 Primer Extension Analysis of RNA 862 32.8.4 Customary RNases 862 32.8.5 Chemcal Modification of RNA–Protein Complexes 863 32.8.6 Chemical Crosslinking 866 32.8.7 Incorporation of Photoreactive Nucleotides 867 32.8.8 Genome-Wide Identification of Transcription Start Sites (TSS) 867 32.9 Genetic Methods 868 32.9.1 Tri-hybrid Method 868 32.9.2 Aptamers and the Selex Procedure 869 32.9.3 Directed Mutations within Binding Domains 870 Further Reading 870

Part V Functional and Systems Analytics 33 33.1 33.2 33.3 33.3.1 33.3.2 33.3.3 33.4 33.4.1 33.5 33.6 33.6.1 33.6.2 33.6.3 33.7 33.7.1 33.7.2 33.7.3 33.7.4 33.7.5 33.8 33.9 33.10 34 34.1

Sequence Data Analysis Sequence Analysis and Bioinformatics Sequence: An Abstraction for Biomolecules Internet Databases and Services Sequence Retrieval from Public Databases Data Contents and File Format Nucleotide Sequence Management in the Laboratory Sequence Analysis on the Web EMBOSS Sequence Composition Sequence Patterns Transcription Factor Binding Sites Identification of Coding Regions Protein Localization Homology Identity, Similarity, Homology Optimal Sequence Alignment Alignment for Fast Database Searches: BLAST Profile-Based Sensitive Database Search: PSI-BLAST Homology Threshold Multiple Alignment and Consensus Sequences Structure Prediction Outlook Analysis of Promoter Strength and Nascent RNA Synthesis Methods for the Analysis of RNA Transcripts

34.1.1 34.1.2 34.1.3 34.1.4 34.1.5 34.1.6 34.2 34.2.1 34.2.2 34.3 34.3.1 34.3.2 34.3.3 34.4 34.4.1 34.4.2 34.4.3

873 875 875 876 877 878 879 881 881 881 882 882 884 885 886 887 887 888 890 890 891 891 892 893

895 895

35 35.1 35.1.1 35.1.2 35.1.3 35.1.4 35.1.5 35.2 35.2.1 35.2.2

36 36.1 36.1.1 36.1.2 36.1.3 36.1.4 36.1.5 36.2 36.2.1 36.2.2 36.2.3 36.2.4 36.2.5 36.2.6 36.3

Overview Nuclease S1 Analysis of RNA Ribonuclease-Protection Assay (RPA) Primer Extension Assay Northern Blot and Dot- and Slot-Blot Reverse Transcription Polymerase Chain Reaction (RT-PCR and RT-qPCR) Analysis of RNA Synthesis In Vivo Nuclear-run-on Assay Labeling of Nascent RNA with 5-Fluoro-uridine (FUrd) In Vitro Transcription in Cell-Free Extracts Components of an In Vitro Transcription Assay Generation of Transcription-Competent Cell Extracts and Protein Fractions Template DNA and Detection of In Vitro Transcripts In Vivo Analysis of Promoter Activity in Mammalian Cells Vectors for Analysis of Gene-Regulatory cis-Elements Transfer of DNA into Mammalian Cells Analysis of Reporter Gene Expression Further Reading Fluorescent In Situ Hybridization in Molecular Cytogenetics Methods of Fluorescent DNA Hybridization Labeling Strategy DNA Probes Labeling of DNA Probes In Situ Hybridization Evaluation of Fluorescent Hybridization Signals Application: FISH and CGH FISH Analysis of Genomic DNA Comparative Genomic Hybridization (CGH) Further Reading Physical and Genetic Mapping of Genomes Genetic Mapping: Localization of Genetic Markers within the Genome Recombination Genetic Markers Linkage Analysis – the Generation of Genetic Maps Genetic Map of the Human Genome Genetic Mapping of Disease Genes Physical Mapping Restriction Mapping of Whole Genomes Mapping of Recombinant Clones Generation of a Physical Map Identification and Isolation of Genes Transcription Maps of the Human Genome Genes and Hereditary Disease – Search for Mutations Integration of Genome Maps

895 896 898 901 902 904 905 905 906 907 907 908 908 911 911 912 914 916

917 917 917 918 918 919 920 920 920 921 924 925 925 925 927 929 931 932 932 932 934 935 937 939 940 940

Table of Contents

36.4

The Human Genome Further Reading

942 942

37 37.1 37.1.1 37.1.2 37.1.3 37.2 37.2.1 37.2.2 37.2.3 37.2.4 37.2.5 37.3 37.3.1 37.3.2 37.3.3 37.4 37.4.1 37.4.2 37.5 37.5.1 37.5.2

DNA-Microarray Technology RNA Analyses Transcriptome Analysis RNA Splicing RNA Structure and Functionality DNA Analyses Genotyping Methylation Studies DNA Sequencing Comparative Genomic Hybridization (CGH) Protein–DNA Interactions Molecule Synthesis DNA Synthesis RNA Production On-Chip Protein Expression Other Approaches Barcode Identification A Universal Microarray Platform New Avenues Structural Analyses Beyond Nucleic Acids Further Reading

945 946 946 947 947 948 948 948 949 951 951 952 952 953 953 954 954 955 956 956 956 957

38

The Use of Oligonucleotides as Tools in Cell Biology Antisense Oligonucleotides Mechanisms of Antisense Oligonucleotides Triplex-Forming Oligonucleotides Modifications of Oligonucleotides to Decrease their Susceptibility to Nucleases Use of Antisense Oligonucleotides in Cell Culture and in Animal Models Antisense Oligonucleotides as Therapeutics Ribozymes Discovery and Classification of Ribozymes Use of Ribozymes RNA Interference and MicroRNAs Basics of RNA Interference RNA Interference Mediated by Expression Vectors Uses of RNA Interference microRNAs Aptamers: High-Affinity RNA- and DNAOligonucleotides Selection of Aptamers Uses of Aptamers Genome Editing with CRISPR/Cas9 Outlook Further Reading

38.1 38.1.1 38.1.2 38.1.3 38.1.4 38.1.5 38.2 38.2.1 38.2.2 38.3 38.3.1 38.3.2 38.3.3 38.3.4 38.4 38.4.1 38.4.2 38.5 38.6 39 39.1 39.2 39.3 39.4

Proteome Analysis General Aspects in Proteome Analysis Definition of Starting Conditions and Project Planning Sample Preparation for Proteome Analysis Protein Based Quantitative Proteome Analysis (Top-Down Proteomics)

959 960 960 961 962 964 964 965 965 966 967 967 968 969 970 971 971 973 974 975 976 977 977 979 980

39.4.1 Two-Dimensional-Gel-Based Proteomics 982 39.4.2 Two-Dimensional Differential Gel Electrophoresis (2D DIGE) 986 39.4.3 Top-Down Proteomics using Isotope Labels 986 39.4.4 Top-Down Proteomics using Intact Protein Mass Spectrometry 987 39.4.5 Concepts in Intact Protein Mass Spectrometry 987 39.5 Peptide Based Quantitative Proteome Analysis (Bottom-Up Proteomics) 998 39.5.1 Introduction 998 39.5.2 Bottom-Up Proteomics 998 39.5.3 Complexity of the Proteome 1000 39.5.4 Bottom-Up Proteomic Strategies 1000 39.5.5 Peptide Quantification 1001 39.5.6 Data Dependent Analysis (DDA) 1002 39.5.7 Selected Reaction Monitoring 1003 39.5.8 SWATH-MS 1010 39.5.9 Summary 1012 39.5.10 Extensions 1012 39.6 Stable Isotope Labeling in Quantitative Proteomics 1013 39.6.1 Stable Isotope Label in Top-Down Proteomics 1013 39.6.2 Stable Isotope Labeling in Bottom-Up Proteomics 1019 Further Reading 1021 40 40.1 40.2 40.3 40.4 40.5 40.6 40.7 40.8

Metabolomics and Peptidomics Systems Biology and Metabolomics Technological Platforms for Metabolomics Metabolomic Profiling Peptidomics Metabolomics – Knowledge Mining Data Mining Fields of Application Outlook Further Reading

Interactomics – Systematic Protein–Protein Interactions 41.1 Protein Microarrays 41.1.1 Sensitivity Increase through Miniaturization – Ambient Analyte Assay 41.1.2 From DNA to Protein Microarrays 41.1.3 Application of Protein Microarrays Further Reading

1023 1025 1026 1027 1028 1029 1030 1032 1032 1032

41

42 42.1 42.2 42.2.1 42.2.2 42.2.3 42.2.4

982

XIII

Chemical Biology Chemical Biology – Innovative Chemical Approaches to Study Biological Phenomena Chemical Genetics – Small Organic Molecules for the Modulation of Protein Function Study of Protein Functions with Small Organic Molecules Forward and Reverse Chemical Genetics The Bump-and-Hole Approach of Chemical Genetics Identification of Kinase Substrates with ASKA Technology

1033 1033 1034 1035 1037 1039 1041 1041 1043 1044 1046 1047 1050

XIV

Table of Contents

42.2.5 Switching Biological Systems on and off with Small Organic Molecules 42.3 Expressed Protein Ligation – Symbiosis of Chemistry and Biology for the Study of Protein Functions 42.3.1 Analysis of Lipid-Modified Proteins 42.3.2 Analysis of Phosphorylated Proteins 42.3.3 Conditional Protein Splicing Further Reading 43 43.1 43.1.1 43.1.2 43.1.3 43.2 43.2.1 43.2.2

Toponome Analysis “Life is Spatial” Antibody Based Toponome Analysis using Imaging Cycler Microscopy (ICM) Concept of the Protein Toponome Imaging Cycler Robots: Fundament of a Toponome Reading Technology Summary and Outlook Acknowledgements Mass Spectrometry Imaging Analytical Microprobes Mass Spectrometric Pixel Images

1052 1052 1054 1054 1055

43.2.3 Achievable Spatial Resolution 43.2.4 SIMS, ME-SIMS, and Cluster SIMS Imaging: Enhancing the Mass Range 43.2.5 Lateral Resolution and Analytical Limit of Detection 43.2.6 Coarse Screening by MS Imaging 43.2.7 Accurate MALDI Mass Spectrometry Imaging 43.2.8 Identification and Characterization of Analytes Further Reading

1067 1068 1068 1069 1070

1057 1057

Appendix 1: Amino Acids and Posttranslational Modifications

1073

1057 1058

Appendix 2: Symbols and Abbreviations

1075

Appendix 3: Standard Amino Acids (three and one letter code)

1081

Appendix 4: Nucleic Acid Bases

1083

Index

1085

1051

1059 1063 1063 1064 1064 1064

1065 1067

Preface

This is a book about methods. You may ask: Why do we need a dedicated book about methods and why should I buy it? We can offer at least two good answers. The first answer is of a theoretical nature: the method determines the quality of the scientific finding gained in that manner. Only by understanding a method, its strengths and, more importantly, its weaknesses, is it possible to estimate the general applicability of an observation or hypothesis. The development or improvement of a method is therefore a means to expand and improve the “tentative truth” generated by experimental science. Great value has been placed on describing the material critically and illuminatingly to enable the reader to engage with the material and gain a thorough understanding. This is, in our opinion, the most important reason why methods must be offered for classroom study. However, a deep and broad knowledge of methods is just as important for ongoing experimental work as it is for understanding past experiments. The second answer is the intent – hopefully successful – of this book to make getting to know and understand these methods clear and straightforward in order to make this book an irreplaceable tool for both students and teachers. Our intent results from our conviction, backed by our experience, that today every individual, whether student, teacher, or scientist, is hopelessly overwhelmed by the large number of different techniques currently in use in biological sciences. At the same time, using these techniques is imperative. We proudly undertook this intellectual enterprise to describe these techniques as completely as possible in an up-to-date manner. To the best of our knowledge, no English language textbook exists that is dedicated to these same goals and with the same level of coverage. One might wonder why the most apparent reason to publish this book has not been mentioned: namely, using this book to learn, or hope to learn, methods that are needed directly for ongoing experimental work. We wish to make two things clear: This is not a “cook book”. This means that after digesting a chapter the reader will not be able to go to his or her laboratory bench and apply what has just been read like a recipe – for that to be possible it will be first necessary for the reader to work through the

literature relevant to the topic covered. The reader should be in a position – at least this is our goal and wish – to optimize his approach through the overview and insights acquired. As for the second point of clarification: This book does not see itself as competition for existing laboratory manuals for diverse techniques, such as protein determination or PCR. The intent is much more to use carefully coordinated and complete descriptions of the methods, using frequent cross-referencing of other chapters, either in the text or in a flanking box, to illustrate the connections between apparently unrelated techniques and to show their mutual dependencies. We believe that the reader will profit from these lessons by gaining a sense of orientation and will understand the relationship between different techniques better, or possibly appreciate them for the first time. We do not wish to conceal the fact that for us, the editors, certain methodical relationships only became clear in the course of working through some of the manuscripts. As such, this book intends to provide coverage at a higher level, more than any single method manual or a simple collection of methods could. What is the actual content of this book? The book is titled Bioanalytics, which indicates that it is about analytical methods in biological sciences. This must be qualified. What are the biological sciences? Is it biochemistry or also molecular genetics, or cell and developmental biology, or even medicine? In any case, molecular biology would be included. This matter gets more complicated when one considers that modern medicine or cell biology are unimaginable without molecular biology. This book cannot satisfy all the needs of these sciences. In addition, not all analytical methods are contained within it, instead only those that involve biological macromolecules and their modifications. Macromolecules are most often proteins, but also include carbohydrates, lipids, and nucleic acids like DNA and RNA. Special methods for the analysis of small molecular metabolites are also not included. On occasion, we have crossed over the boundaries we have set for ourselves. For example, methods for the preparation of DNA and RNA are presented, simply because they are so closely and necessarily associated with the subsequent analytical techniques. In addition, many techniques, such as electrophoresis or chromatography, can be used at

XVI

Preface

both analytical and preparative scales. For other techniques it is not easy to distinguish between preparation and analysis if one does not wish to follow the traditional division between the two based solely on the amount of material involved. Is the identification of interaction partners using the two-hybrid system an analytic method, when the final step is based on the labo intensive construction of the corresponding clones, that is, to say based on a method that, at first, does not have anything to do with investigating the interaction? Similar is the case of site-specific mutation of genes for the investigation of gene function, which first requires the construction (and not the analysis) of the mutated sequences in vitro. On the other hand, we intentionally omitted the description of a few techniques that are clearly preparative. The synthesis of oligonucleotides – a clearly preparative technique – and the cloning of DNA were omitted. The latter is, despite being a requirement or goal of a large number of analytical methods, not an analytical method itself. In this case our decision was easy since there are already numerous good introductions and manuals about cloning DNA. In summary, the book describes the analytical methods of protein and nucleic acid (bio)chemistry, molecular biology, and, to a certain degree, modern cytogenetics. In this context, “molecular biology” means those parts of molecular genetics and biochemistry that involve the structure and function of nucleic acids. Methods of (classical) genetics, as well as traditional cell biology, are therefore rarely, if ever, included. We wish to emphasize that chapters that directly relate to the function of proteins and nucleic acids have been collected into a special section of the book, the “Systematic Analysis of Function.” We have gone along with the shift in paradigm from traditional bioanalytics to holistic analysis approaches. In this section many topics are addressed – even though they are sometimes not entirely mature – which are on the cutting edge of science. We are aware of the fact that this area is subject to rapid change and a few aspects could in the near future, perhaps, appear to be too optimistic or pessimistic. However, we believe that discussion of the most modern techniques and strategies at this point in time covers fascinating aspects and hopefully proves to be inspiring. The increasing availability of DNA and protein sequences of many organisms is, on the one hand, the critical fundament for this systematic function analysis and, on the other hand, makes high-throughput analysis and analysis of the data increasingly important. Information gained from the genome, proteome, and metabolome is compared with in silico analysis, which factors in the localization and interaction between biomolecules and unites everything into complex networks. The long-term goal of completely understanding the system can surely only be reached by the incorporation of further areas of expertise that are not yet an accepted component of bioanalytics. Bioanalysts must become, and are becoming, a kind of systems biologists, more interdisciplinary, and more successful in close cooperation together with experts in the fields of informatics, system theory, biotechnology, and cell biology. Who is this book addressed to? What has already been said provides a hint: Primarily biologists, chemists, pharmacists, physicians, and biophysicists. For some (biologists, chemists) the book will be interesting because it describes methods of their own discipline. For the second group (e.g., pharmacists, physicians, and biophysicists) the book is relevant because they can

find the background and fundamentals for much of the knowledge, which they find in their own discipline. Beyond these groups, this book is dedicated to interested readers who would like to know more about the subject matter. The material covered presumes that the user has taken at least an introductory course in the fundamentals of biochemistry or molecular genetics/gene technology, ideally both, or is in the process of doing so. We can imagine that this book would be an ideal supplement to such a course. It can and should especially be consulted when involved in experimental activities. This book is intended to be of equal value to students, teachers, and workers in these fields of science. The organization of the material proved to be one of the most difficult aspects of putting this book together. It is almost impossible to treat the techniques used in such complex fields in the two dimensions paper offers accurately without simultaneously compromising the didactic intentions of the book. We had a choice of two approaches: a more theoretical and intellectually stringent approach or a more practically oriented approach. The theoretical approach would have been to divide the methods exclusively according to type, for example chromatography, electrophoresis, centrifugation, and so on. Under each type of method its use would be divided according to objective and by the differing types of starting materials. This approach is more logical, but harder to comprehend and unrelated to actual practice. The more practical presentation begins with the concrete problem or question and describes the method that answers the question best. This is more intuitive, but inevitably leads to redundancies. A complete deep, “multidimensional” understanding of the material is only possible after the entire book has been absorbed. The approach in this book, for the most part, follows the second, practically oriented, approach. When possible, such as in the section “Protein analysis”, the methods were grouped and presented according to the topic addressed. This includes the fundamentals of instrumental techniques, which is knowledge required for the complete understanding of other sections. We approached the problem of redundancy by cross-referencing the first instance in which a method is described. Sometimes we left redundancies in place for didactic reasons. We leave it to our readers to determine if our choices represent the optimal solution to the problem of structuring the subject matter. An overview of the presented methods and their relationships can be found on the inside back cover. This flowchart should – particularly for readers new to the topic – illustrate how one can employ the analytical approaches, from splitting open the cells down to the molecular dimensions. In the diagram, the natural turbulences of the flow are deliberately sacrificed for the sake of clarity. Hopefully, the expert reader will forgive us! At this point we would like to explain a convention in this book, which is not in general use: the use of the terms in vitro and in vivo. To avoid misunderstandings, we explain here that we use these terms as molecular biologists usually understand them, which means using in vitro for “cell free” and in vivo for “in living cells” (in situ translates literally into “in place” and is used and understood as such). In contrast, pharmacologists and physicians often use the term in vivo to refer to experiments in animals and lack suitable terminology to distinguish between experiments conducted in cell culture and those done in test tubes. In cases

Preface

where the meaning may be unclear, we have used the precise term, “cell free”, “in living cells”, or “in animal experiments.” This first edition in English appears some 18 years after the initial publication of Bioanalytik in German. We are happy about it and finally can follow the repeated wish of the scientific community and use English as the lingua franca of the biological sciences. The sustained interest in this book within the Germanspeaking community has led to our desire to make this book available to a wider international audience. To maintain the same length as the original book, despite the addition of new chapters (calorimetry, sensors, and chemical biology), we have shortened or removed other chapters. The goal was to favor current methods and to reduce method descriptions of a more historical nature. It was sometimes hard to sacrifice some cherished memories to better accommodate the current Zeitgeist. We would be grateful to

XVII

our readers if they would point out any inaccuracies or deficiencies in our presentation, which we may have overlooked. As might be expected, this book involved a great deal of work, but was also a great deal of fun to write! We wish to thank our authors at this point, who through their conscientious and diligent work and their cooperation have been a pleasure to work with. Last but not least, we would like to thank our publisher Wiley-VCH and its dedicated team with Waltraud Wüst and the copyeditor John Rhodes, who, with remarkable enthusiasm and tenacity, were our consistent sources of support during the realization of this book.

Joachim W. Engels and Friedrich Lottspeich Munich and Frankfurt, January 2018

Introduction: Bioanalytics − a Science in its Own Right

In 1975, two publications by O’Farrell and Klose aroused the interest of biochemists. In their work, they showed spectacular images with thousands of neatly separated proteins, the first 2D electropherograms. At that time, a vision emerged in the minds of some protein biochemists: It might be possible to identify complex functional relationships, and ultimately to understand processes in the cell, by analyzing these protein patterns. For this purpose, however, the separated proteins had to be characterized and analyzed a task with which the analytics at that time was hopelessly overtaxed. Completely new methods had to be developed, existing ones drastically improved, and the synergies between protein chemistry, molecular biology, genome analysis, and data processing had to be recognized and exploited, so that today’s proteome analysis is on the threshold of realizing that utopian vision of more than 40 years ago. In 1995, an international consortium with strong support from Jim Watson (HUGO; for Human Genome Organization) decided to sequence the human genome. Even though the scientific community was initially divided on the benefits of this endeavor, those involved showed that it is possible to accomplish such a huge undertaking via international co-operation, completing it even ahead of schedule. The competition between commercial and academic participants certainly contributed to the success. Craig Venter is remembered by many for his appearance and his bold claims regarding shotgun sequencing. The major sequencing groups came from the US and Britain, such as the Sanger Institute in Cambridge, England, the Whitehead Institute in Cambridge, Massachusetts, and the Genome Sequencing Center in St. Louis. With the publication in the journal Nature in October 2004 the gold standard of the human genome was completed. The biggest surprise was that the actual number of genes is much lower than expected. With only about 21,000 genes, humans are nowhere near the top of the list with regard to their number of genes; being surpassed, for example, by parsley.

I.1 Paradigm Shift in Biochemistry: From Protein Chemistry to Systems Biology The human genome project has had a fundamental impact on the entire life sciences. We now know that it is technically possible to perform fully automated high-throughput analysis in bioanalytics, and to process the enormous amounts of data that it generates. The results of the genome projects showed that predominantly datadriven research can provide fundamental insights about biology. All of this initiated a profound change from the classical, targetand function-oriented approach to biological questions to a systems-level, holistic perspective.

I.1.1 Classical Approach Following the classical approach of the pre-genomic era, the starting point for almost every biochemical investigation was (and still is) the observation of a biological phenomenon (e.g., the alteration of a phenotype, the appearance or disappearance of enzymatic activity, the transmission of a signal, etc.). Next, an attempt is made to correlate this biological phenomenon to one or a few molecular structures, most often proteins. Once a protein has been isolated that plays a crucial role in the observed biological context, its molecular structure, including its posttranslational modifications, has to be elucidated using state-ofthe-art protein chemistry, so that finally the gene corresponding to this protein can be “fished”. Thus, the whole arsenal of bioanalytics has to be used for the accurate analysis of an important protein. Molecular biological techniques facilitated and accelerated the analysis and validation enormously, and provided hints on the expression behavior of any proteins that were found. Physical methods such as X-ray crystallography, NMR, and electron microscopy allowed deep insights into the molecular

XX

Introduction: Bioanalytics

a Science in its Own Right

structures that sometimes even led to an understanding of biological processes at the molecular level. However, it was quickly recognized that biological effects are rarely explained by the action of a single protein, but are often due to the sequential actions of different biomolecules. Therefore, it was an essential step in the elucidation of reaction pathways to find interaction partners of proteins under scrutiny. When they were found, the same laborious analysis was carried out on them. It is easy to see that this iterative process was quite timeconsuming so that the elucidation of a biological pathway usually took several years. Despite its slowness, the classical approach was incredibly successful. Virtually all our current knowledge of biological processes has been gained using this strategy. It has nevertheless some basic limitations in that it is extremely difficult to elucidate network-like structures or transient interactions and to gain a complete insight into more complex reaction processes of biological systems. Another principal limitation is that the data it yields are rarely quantitative and usually reflect a rather artificial situation. This is inherent in the strategy itself, in which the complex biological system is successively broken down into modules and subunits, moving further and further away from the biological in vivo situation. During the many separation and analysis steps some of the initial material is inevitably lost, which will affect different proteins in different and unpredictable ways. Thus, it becomes virtually impossible to make quantitative statements, which are extremely important for a mathematical modeling of reaction processes.

I.2 Methods Enable Progress Just as two-dimensional gel electrophoresis, DNA sequencing or the polymerase chain reaction opened up hitherto unthinkable levels of knowledge about biological relationships and at the same time spurred the development in their respective fields, methodical developments regularly are at the roots of truly significant advances in science. In the last decades, the life sciences have developed rapidly and revolutionized the understanding of biological relationships. The speed of this development is closely correlated with the development of separation and analysis methods, as shown in the table below. It is almost impossible to imagine modern biochemistry without one or more of these fundamental methodological achievements. Milestones of bioanalytical methodology 1828

urea synthesis

1866

Mendelian laws

1873

Microscopy

1890

crystallization

1894

key-lock principle

1906

chromatography

1907

peptide synthesis

1923

ultracentrifugation

1926

crystallization of urea

1930

electrophoresis

I.1.2 Holistic Strategy

1935

phase contrast microscopy

1937

raster electron microscopy

Encouraged by the success of the human genome project, conceptually new ways of answering biological questions began to be conceived. Instead of analytically dissecting a biological situation and then selectively analyzing the smallest units, the idea was born to view and examine the biological system as a whole (holistic, Greek holos, whole). The same approach is used very successfully, for example in physics, by deliberately disturbing a defined system and observing and analyzing the reaction of the system. This socalled perturbation analysis (Latin perturbare, to disturb) has the enormous advantage that the response of the system can be monitored without any bias, and any observed changes should be directly or indirectly due to the perturbation. This strategy is ideal for highly complex systems. It is amenable to network-like, transient and, above all, unexpected relationships, and being based on the whole system also very close to the real biological situation. However, to fully exploit the benefits of this strategy, the observed changes must be quantitatively measured. Due to the multitude of components in a biological system, this can be a challenge for highthroughput analytics, data processing, and advanced computing. Nevertheless, the methodological developments in bioanalytics and bioinformatics, driven and motivated by genome analysis, have reached a level that has made this kind of holistic analysis of a biological system feasible. It is seen as an essential enabling technology for systems biology, which aims to mathematically describe complex biological processes.

1941

partition chromatography

1944

EPR/ESR spectroscopy

1946

radioistotope labeling

1946

NMR spectroscopy

1950

protein sequence analysis

1953

gas chromatography

1953

DNA double helix

1959

analytical ultracentrifugation

1959

PAGE

1960

hybridization of nucleic acids

1960

X-ray structure analysis

1963

solid phase peptide synthesis

1966

isoelectric focusing

1967

automated sequence analysis

1972

restriction analysis

1974

gene cloning

1974

HPLC of proteins

1975

2D electrophoresis

1975

Southern blotting

1975

monoclonal antibodies

Introduction: Bioanalytics 1976

DNA sequence analysis

1981

site specific mutagenesis

1981

capillary electrophoresis

1982

transgenic animals

1982

scanning tunneling microscopy

1983

automated oligonucleotide synthesis

1985

CAT assay

1986

polymerase chain reaction

1986

atomic force microscopy

1987

MALDI MS

1987

ESI MS

1988

combinatorial chemistry

1990

phosphoimager

1990

cryo-electron microscopy

1991

yeast two-hybrid system

1993

FISH

1993

differential display

1995

proteome analysis

1995

DNA chip

1996

yeast genome sequence

1998

RNA interference

1999

STED microscopy

2004

human genome sequence

2012

CRISPR/Cas

First, separation methods were developed and their application significantly improved. Starting from the simplest separation procedures, extraction and precipitation, the conditions were created to obtain purified and homogeneous compounds via much more effective methods such as electrophoresis and chromatography. The preparation of pure substances in turn exerted an enormous development pressure on the analytical methods. It soon turned out that biomacromolecules have much more complex structures than the hitherto known small molecules. New methods had to be developed, and old ones adapted to the new requirements. To effect a real breakthrough, the methods had to be implemented instrumentally and the instruments had to become commercially available. Since the 1950s both methods and equipment have been developed at an enormous pace. Today, they are sometimes up to 10,000 times faster and more sensitive than when they were introduced. Thanks to state-of-the-art microprocessor controls, the space requirements of the devices are also orders of magnitude lower than those of their ancestors, and their handling has similarly become easier thanks to software-assisted user guidance. While each of these tools may be quite expensive on their own, their higher throughput has led, in effect, to a tremendous cost reduction. This highly dynamic phase of method developments persists to this day. To cite one example, mass spectrometry entered biology and biochemistry, thereby enabling completely new strategies for answering biological questions, such as proteome analysis. Another important example is the

a Science in its Own Right

XXI

success story of bioinformatics, which is used, inter alia, in the analysis of gene or protein databases and which undoubtedly has enormous potential for deployment and development. The advancement of ever-higher resolution light microscopy (near field scanning optical microscope, NFOM and confocal microscopy, 4Pi) now allows molecules to be observed in action in the cell. The well-known passage from the Bible, “because you have seen me, you have believed”, is also applicable to the scientist. All of this clearly shows that we are at the beginning of a phase of transition in which analytics not only has the task of confirming the data of others as an auxiliary science, but can formulate and answer questions on its own accord as a separate, relatively complex area of expertise. Thus, analytics is changing more and more from a purely retrospective to a diagnostic and prospective science. Typical for modern analytics is the interplay of a wide range of individual processes, in which each method is limited in itself, but whose concerted action produces synergisms that can yield answers of astounding and new quality. However, in order to make this synergy possible, a scientist needs to obtain a fundamental knowledge of the areas of application, possibilities and limits of the various techniques.

I.2.1 Protein Analysis Proteins as carriers of biological function normally must be isolated from a relatively large amount of starting material and separated from a myriad of other proteins. A purification strategy that is optimized for good yield while at the same time preserving biological activity is of utmost importance. The purification of the protein itself is still one of the greatest challenges in bioanalytics. It is often time-consuming and demands from the experimenter a substantial knowledge of the separation methods and properties of proteins. Purification is usually accompanied by spectroscopic, immunological and enzymological assays that identify and quantify proteins among a large number of very similar substances, allowing the purification process to be followed and assessed through various steps. Thorough knowledge of classical protein determination methods and enzymatic activity tests is essential, since these methods often depend on the specific properties of the protein to be measured and can be significantly influenced by contaminating substances. Once a protein is isolated, the next step is to obtain as much information as possible about its primary structure, the sequence of its amino acid building blocks. For this purpose, the isolated protein is analyzed directly with sequence analysis, amino acid analysis and mass spectrometry. Often, the identity of the protein can be ascertained at this stage by a database query. If the protein is unknown or needs to be analyzed more closely, for example, to determine post-translational modifications, it is broken down enzymatically or chemically into small fragments. These fragments are usually separated by chromatography and some of them are fully analyzed. The determination of the full amino acid sequence of a protein with protein-chemical methods alone is difficult, laborious and expensive and is usually restricted to the quality control of recombinant therapeutic proteins. In other cases, a few easily accessible partial sequences are usually sufficient. These partial sequences are used for the preparation

XXII

Introduction: Bioanalytics

a Science in its Own Right

of synthetic peptides, which are used in turn to generate monospecific antibodies, or oligonucleotide probes. These probes are used to isolate the gene of interest, ultimately leading to the DNA sequence through DNA analysis that is orders of magnitude faster and simpler than protein sequence analysis. This is translated into the complete amino acid sequence of the protein. However, posttranslational modifications are not detected in this detour via the DNA sequence. However, since they play a decisive role in determining the properties and functions of proteins, they must be subsequently analyzed with all the available high-resolution techniques on the purified protein. These modifications can as in the case of glycosylations be very complex and their structure elucidation is very demanding. Even if one knows the primary structure of a protein, has determined its post-translational modifications and can make certain statements about its folding (secondary structure), one will rarely understand the mechanism of its biological function at the molecular level. To achieve this, a high-resolution spatial structure, obtained by X-ray structure analysis, NMR, or electron microscopy, must be known. Also, the analysis of different complexes (e.g. between enzyme and inhibitor) can yield detailed insight into molecular mechanisms of protein action. Because of the high material requirements, these investigations generally take place via the detour of the overexpression of recombinant genes. Once the entire primary structure, the post-translational modifications and possibly even the spatial structure have been elucidated, the function of a protein often still remains in the dark. Building on an intensive analysis of molecular interaction data, functional analysis is then used to deduce the functional properties from the structures of the substances studied.

I.2.2 Molecular Biology Throughout their development, methods of biochemistry and molecular biology have mutually fertilized and supplemented each other. While molecular biology was initially synonymous with cloning, it has long become an independent discipline with its own goals, methods and results. In all molecular biological approaches, whether in basic research or in diagnostic-therapeutic or industrial applications, the experimenter deals with nucleic acids. Naturally-occurring nucleic acids exhibit a variety of forms; they can be double- or single stranded, circular or linear, of high molecular weight or short and compact, “naked” or associated with proteins. Depending on the organism, the form of the nucleic acid and the purpose of the analysis, a suitable method for their isolation is chosen, followed by analytical methods for checking their integrity, purity, shape and length. Knowledge of these properties is a prerequisite for any subsequent use and analysis of DNA and RNA. A first approximation to the analysis of the DNA structure is provided by restriction endonuclease cleavage. Only this tool enabled the birth of molecular biology about 50 years ago. The restriction endonuclease cleavage is also the prerequisite for cloning, i.e. the amplification and isolation of uniform individual DNA fragments. It is followed by a variety of biochemical analysis methods, most notably DNA sequencing and a variety of hybridization techniques that can identify, localize, and

quantify a particular, large, heterogeneous set of different nucleic acid molecules. The roughly thirty-year-old, truly Nobel Prizeworthy polymerase chain reaction (PCR) has revolutionized the possibilities of analyzing nucleic acids with a principle that is as ingenious as it is simple. Smallest amounts of DNA and RNA can be detected, quantified and amplified without cloning. The imagination of the researcher seems almost unlimited in PCR applications. Because of its high sensitivity, however, it also contains sources of error, which necessitate special caution by the user. Its evolution into a miniaturized fast and cost-effective standardized method is a good example for the lab-on-a-chip of the future. Of course, PCR has also found its way into the sequencing of nucleic acids, one of the classic domains of molecular biology. Nucleic acid sequencing was the basis for the highly sophisticated, international human genome project. Other model organisms were also sequenced in this context. In 2010 there were about 250 eukaryotes and 4,000 bacteria and viruses sequenced, thanks in particular to modern, massively parallel sequencing methods. Many compare the human genome project with the manned flight to the moon (although it did not require similar amounts of money the budget averaged a mere $ 200 million a year for ten years). Like similarly ambitious goals, it has led to significant technical innovations. The methods developed within the human genome project also had a major impact on biotechnology-related industries, such as medicine, agriculture or environmental protection. An analytical method intertwined with the goals of the human genome project is the mapping of specific chromosomal regions, which is done through genetic linkage analysis, cytogenetics, and other physical processes. Mapping is done for genes (i.e. “functional units”) or DNA loci which literally only exist as sequence units. A new approach has emerged with positional cloning, which used to be called reverse genetics. This “reversed” approach to traditional genetics (first gene, then function = phenotype) has already proved its worth in some cases. The most important diseases such as diabetes, cancer, heart attack, depression and metabolic diseases are each influenced by a multitude of genetic and environmental factors. Although two unrelated humans carry about 99.9% identical gene sequences, the remaining 0.1% can be crucial for the success of a therapy. Finding these differences in the gene sequences responsible for the risks offers a great opportunity to understand complex causes and processes of the disease. It will be interesting to find out which base exchanges in which positions contribute to the fact that an individual can tolerate a drug and that this drug also shows the desired effect. It is precisely this connection between the sequence on the one hand and the effect or function on the other that is the focus of functional gene diagnostics. Array diagnostics and siRNA analysis have proven to be very potent tools. Whereas the former detects the presence of the mRNA by hybridization, the latter can establish the connection between RNA and protein. The result is a high-resolution map of the human genome. siRNAs are small double-stranded RNAs (20 27-mers) that can recognize and switch off complementary mRNA. Since there are only about 21,000 genes in humans, high-throughput siRNA analysis is possible and all genes of an organism can be analyzed. For example, all genes of the nematode Caenorhabditis elegans have been RNAi-inhibited, leading to the first complete functional

Introduction: Bioanalytics

gene mapping. Chemical biology, as a discipline of chemistry at the interface to biology, attempts to find small organic molecules for the modulation of protein interactions. Similar to the siRNA approach, cellular functions can be analyzed, but the small organic molecules have the advantage of generating fast responses that are spatially and temporally reversible. Thus, it is possible to find small molecules for all cellular targets which illuminate the physiological correlations and ultimately help to exploit new therapeutic applications. The analysis of the linear structure of DNA is completed by determining DNA modifications, especially base methylation. They influence the structure of DNA and its association with proteins and affect a variety of biological processes. Base methylation is of particular importance for modulating gene activity. In consequence, humans with their comparatively small number of genes have the ability to regulate transcription by methylating the base cytosine. This phenomenon, known as epigenetics, is responsible for the differential expression of the genes in different cells. Because the specific modifications of the genomic DNA are lost in cloning or PCR amplification, their detection must be done directly with genomic DNA; this requires methods with high sensitivity and resolution.

I.2.3 Bioinformatics Even before the human genome project and other sequencing projects had produced a myriad of data, the trend from wet labs to net labs has been increasing over the past thirty years; that is, the activity of some researchers increasingly shifted from the lab bench to computer-related activities. Initially, these were limited to simple homology comparisons of nucleic acids or proteins, in order to elucidate relationships or to obtain clues about the function of unknown genes. Added to this are mathematical simulations, pattern recognition and search strategies for structural and functional elements, and algorithms for weighting and evaluating the data. Databases familiar to the molecular biologist today include not only sequences but also three-dimensional structures. It is remarkable and pleasing that one has free and sometimes interactive access to this vast amount of data and their processing over the internet. This networked information structure and its management is the basis of today’s bioinformatics.

I.2.4 Functional Analysis We have already seen how bioinformatics opens up systematic functional analysis. In this section, we will cover investigations of the interactions of proteins with each other or with nucleic acids. Researchers looked at protein-DNA interactions early in the history of molecular biology after it became clear that genetic trans-factors are mostly DNA-binding proteins. The binding site can be characterized very precisely with so-called footprint methods. In vivo footprints also allow correlating the occupancy state of a genetic cis element with a defined process - for example, active transcription or replication. This can provide information about the mechanism of activation and also about the protein function in the cell.

a Science in its Own Right

XXIII

Interactions between biomacromolecules can also be detected by biochemical and immunological methods, such as affinity chromatography or cross-linking methods, affinity (far-Western) blots, immunoprecipitation, and ultracentrifugation analysis. In these approaches, an unknown partner that interacts with a given protein usually needs to be subsequently identified by protein chemical methods. In genetic engineering this is easier because the interacting partner must first be expressed by a cDNA which has already been cloned. An intelligent genetic technique developed for this purpose is the two-hybrid technique, which can also be used to study interactions between proteins and RNA. It should be kept in mind, however, that the physiological significance of the identified interactions, however plausible they may appear, must be shown separately. Protein DNA, protein RNA, and protein protein interactions initiate a number of processes in the cell, including the expression of certain – as opposed to all genes. The activity of genes expressed only in specific cell types or under specific conditions can be measured by a variety of methods, such as differential display, which is equivalent to a 1: 1 comparison of expressed RNA species. Having found genes which undergo differential expression, their cis and trans elements in other words, the promoter and enhancer elements and the necessary transactivator proteins which effect the this regulation can be determined. For this purpose, functional in vitro and in vivo tests are carried out. Even though all the aforementioned analyses provide a solid insight into the specific expression of a gene and its regulation, the actual function of the gene its phenotype remains unknown. This is a consequence of the era of reverse genetics, in which it has become comparatively easy to sequence DNA and determine “open reading frames”. Correlating an open reading frame or transcription unit with a phenotype is more difficult. Doing so requires an expression disorder of the gene of interest. This gene disorder can be introduced externally, for example by gene modification, that is, by mutagenizing the region of interest. Until about 25 years ago, site-specific mutagenesis was only possible in vivo in microorganisms, by the application of genetic recombination techniques. Since then, various techniques have been optimized to the point that it is possible to introduce genes modified in vitro into higher cells or organisms and to replace the endogenous gene. However, a disruption of the gene or of the gene function can also be achieved by other methods. In this respect, the methods of translational regulation have proven particularly useful. Replacing earlier antisense or antigenic techniques, in which oligonucleotides complementary to certain regions are introduced into the cell and inhibit the expression of the gene, the advent of RNAi in 1998 has ushered in a new era. By appropriate choice of complementary RNA any mRNA can be switched off. It is important to note that this represents not a gene knock-out, but a knock-down. Thus, crucial genes can be downregulated without killing the organism. Instead of down-regulating, the amount of gene product can also be increased by overexpression. This has its significance for agricultural production through transgenic plants and animals. The latter methods gene modification, antisense and RNAi technique and overexpression have been widely used in medicine and agriculture. The reasons are as manifold as they are obvious. Transgenic animals

XXIV

Introduction: Bioanalytics

a Science in its Own Right

or plants can increase agricultural yields. Clinical expression cloning can open up new possibilities for combating malignant cells that are not recognized by the body’s own immune system without the expression of certain surface antigens. The antisense and RNAi technique can be employed to suppress the activation of undesired genes, for example oncogenes. Along these lines another oligonucleotide-based method with great potential in molecular biology is the CRISPR/Cas9 technology. In prokaryotes, Clustered Regulatory Interspaced Short Palindromic Repeats (CRISPR) and their associated cas genes serve as an adaptive immune system to protect them from infection by bacteriophages. The bacterial system was adapted for use in eukaryotic cells and can now be used for precise genome engineering. Compared to the antisense and RNAi approaches discussed above, the CRISPR/Cas9 technology is extremely efficient and easy to use. In addition, the CRISPR/Cas9 system enables the

full knockout of the target gene whereas alternative methods may only result in the partial knockdown of gene expression. However, because an organism is an infinitely more complex system than a controlled in vitro system or a single cell, the desired effect is not always achieved. By way of example, it should be remembered that some of the therapeutic successes of these techniques had nothing to do with nucleic acid hybridization in vivo but, as was later recognized, rather with a local, nonspecific activation of the immune system due to lack of methyl groups on the CpG dinucleotides of the oligonucleotides used or other protein-mediated effects. Such incidents and other, possibly less harmless complications, in our eyes lend more weight to the already existing duty of the researcher as well as the user to pay close attention to what is happening, and what can happen in their work. To that end, a solid knowledge of the available analytical methods and the interpretation of biological correlations is one of several prerequisites. This book aspires to (also) make a contribution towards that goal.

Part I

Protein Analytics

1

Protein Purification Friedrich Lottspeich Peter-Dörfler-Straße 4a, 82131 Stockdorf, Germany

Investigation of the structure and function of proteins has already kept scientists busy for over 200 years. In 1777 the French chemist Pierre J. Macquer subsumed under the term Albumins all substances that showed the peculiar phenomenon of change from a liquid to a solid state upon warming. Chicken egg white, casein, and the blood component globulin already belonged to this class of substances. Already as early as 1787 (i.e., about the time of the French Revolution) the purification of egg white-like (coagulating) substances from plants was reported. In the early nineteenth century many proteins like albumin, fibrin, or casein were purified and analyzed. It soon became apparent that these compounds were considerably more complicated than other organic molecules known at that time. The word protein was most probably introduced by the Swedish chemist Jöns J. von Berzelius in about 1838 and was then published by the Dutch Gerardus J. Mulder. Mulder also suggested a chemical formula, which at that time was regarded as universally valid for all egg white-like materials. The homogeneity and purity of these purified proteins did not correspond of course to today’s demands. However, it became clear that proteins are different and distinct molecules. At that time purification could succeed only if one could use very simple steps, such as extraction for enrichment, acidification for precipitation, and spontaneous crystallization. Already in 1889 Hofmeister had obtained chicken albumin in crystalline form. Although Sumner in 1926 could already crystallize enzymatically active urease, the structure and the construction of proteins remained unknown up to the middle of the twentieth century. Only by the development of efficient purification methods, which allowed single proteins to be isolated from complicated mixtures, accompanied by a revolution in analysis techniques of the purified proteins, was today’s understanding of protein structures possible. This chapter describes fundamental purification methods and also touches on how they can be used systematically and strategically. It is extremely difficult to look at this subject in general terms, because the physical and chemical properties of single proteins may be very different. However, this structural diversity, which in the end determines also the function of the various proteins, is biologically very meaningful and necessary. Proteins – the real tools and building materials of a cell – have to exercise a plethora of different functions.

1.1 Properties of Proteins Size of Proteins The size of proteins can be very different. From small polypeptides, like insulin, which consists of 51 amino acids, up to very big multifunctional proteins, for example, to the apolipoprotein B, a cholesterol-transporting protein which contains more than 4600 amino acid residues, with a molecular mass of more than 500 000 Dalton (500 kDa). Many proteins are composed of oligomers from the same or different protein chains and have molecule masses up to some millions Daltons. Quite in general it is to be expected that, the greater a protein is, the more Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

The molar mass (M) – often wrongly called molecular weight – is not a mass but is defined as the mass of a substance divided by the amount of substance: M ˆ m=n ˆ N A mM The unit is g mol

1

Absolute molecule mass (mM) is the molar mass of a molecule divided by the number of molecules in one mol (= Avogadro constant, NA). mM = M/NA. The unit is g. The relative molecular mass (Mr) is defined as the mass of one molecule normalized to the mass of 12 C (carbon 12), which by definition is equal to 12. Mr ˆ 12  m…molecule† =m…12 C † It is dimensionless, but it has been given the “unit” Dalton (Da) (formerly atomic mass unit).

4

Part I: Protein Analytics

Figure 1.1 Separation methods of proteins and peptides. The separation capacity (i.e., the maximal number of compounds that can be separated in a single analysis) of the various separation methods is very different for different molecular masses of the analyte. Abbreviations: SEC, size exclusion chromatography; HIC, hydrophobic interaction chromatography; IEC, ion exchange chromatography; RPC, reversed phase chromatography; CE, capillary electrophoresis.

Dalton (Da), named after the researcher John Dalton (1766–1844), is a non-SI mass unit. One Dalton is equivalent to the atomic mass unit (u = 1/12 of the mass of 12 C) and corresponds roughly to the mass of one hydrogen atom (1.66 × 10 24 g). In biochemistry the unit kDa (1 kilodalton = 1000 Da) is very often used.

Chromatographic Separation Techniques, Chapter 10

Proteome Analysis, Chapter 39

Detergents, Section 1.8

difficultly its isolation and purification will be. This has its reason in the analytical procedures which show very low efficiencies with big molecules. Figure 1.1 shows the separation capacity (the maximum number of analytes, which can be separated under optimal condition) of individual separation techniques against the molecule mass of the analytes. It is evident that for small molecules like amino acids or peptides some chromatographic procedures are clearly able to distinguish more than 50 analytes in a single analysis. In the area of proteins (Mr > 104 Da) one recognizes that of the chromatographic techniques actually only ion exchange chromatography is able to separate efficiently more complicated mixtures. In the molecular mass area of proteins electrophoretic methods are by far more efficient. That is why in proteome analysis (e.g., the analysis of all proteins of a cell), where several thousand proteins have to be separated, electrophoretic procedures (linear and two-dimensional) are very often used. From the figure is also evident that almost no efficient separation procedures exist for large molecules, for example, for protein complexes with molecular masses greater than 150 kDa, or for organelles. The separation efficiency of a method is not always the relevant parameter in a protein purification. If selective purification steps are available the separation capacity is no longer significant and the selectivity becomes the crucial issue. Consequently, an affinity purification, which is based on the specific binding interaction of a substance to an affinity matrix, for example, an immune precipitation or an antibody affinity chromatography, has a quite low separation capacity of 1, but has an extremely high selectivity. Due to this highly selectivity a protein can easily be isolated even from a complex mixture in a one-step procedure. With the most common purification techniques, electrophoresis and chromatography, the analytes must be present in a dissolved form. Thus, the solubility of the protein in aqueous buffer media is a further important parameter when planning a protein purification. Many intracellular proteins, located in the cytosol (e.g., enzymes), are readily soluble while structureforming proteins like cytoskeletal proteins or membrane proteins most often are much less soluble. Especially difficult to handle in aqueous solutions are hydrophobic integral membrane proteins, which are usually surrounded by lipid membranes. Without the presence of detergents such proteins will aggregate and precipitate during the purification.

1 Protein Purification

Available Quantity The quantity available in the raw material plays a crucial role in determining the effort that must be invested for a protein purification. A protein intended for the isolation is present perhaps only as a few copies per cell (e.g., transcription factors) or as a few thousand copies (e.g., many receptors). On the other hand, abundant proteins (e.g., enzymes) can constitute percentage shares of the total protein of a cell. Overexpressed proteins of proteins are often present in in clearly higher quantities (>50% in a cell) as well as some proteins in body fluids (e.g., albumin in plasma >60%). Purification with higher quantities of a protein is usually much simpler. Especially with the isolation of rare proteins different sources of raw material should be checked for the content of the protein of interest. Acid/Base Properties Proteins have certain acidic or basic properties because of their amino acid composition, properties that are used in separations via ion exchange chromatography and electrophoresis. The net charge of a protein is dependent on the pH of the surrounding solution. At a low pH-value it is positive, at high pH negative, and at the isoelectric point it is zero. Positive and negative charges compensate at the latter pH. Biological Activity The purification of a protein is often complicated by the fact that a particular protein often can be detected and localized among the various other proteins only due to its biological activity and location. Hence, one must take into account at every stage of protein isolation the preservation of this biological activity. Usually the biological activity is based on a specific molecular and spatial structure. If it is destroyed, one speaks of denaturation. This often is irreversible. To avoid denaturation, one must exclude in practice the application of some procedures. The biological activity is often stable to different extents under different environmental conditions. Too high or too low buffer concentrations, temperature extremes, contacts with artificial surfaces such as glass or missing cofactors can change biological characteristics of proteins. Some of these changes are reversible: small proteins in particular are, after denaturation and loss of activity, often able to renature under certain conditions, regaining the biologically active form. For larger proteins, this is rarely the case and often results in only a poor yield. Measurement of the biological (e.g., enzymatic) activity makes it possible to monitor the purification of a protein. With increasing purity a higher specific activity is measured. In addition, the biological activity itself can be utilized for the purification of the protein. The activity often goes hand in hand with binding properties to other molecules, such as enzyme–substrate or cofactor, receptor–ligand, antibody, antigen, and so on. This binding is very specific and can be used to design affinity purifications. These are characterized by high enrichment factors and may achieve great efficiency that is difficult to obtain by other techniques. Stability When proteins are extracted from their biological environment they are often markedly impaired in their stability. They may be degraded by proteases (proteolytic enzymes) or associate into insoluble aggregates, which almost always leads to an irreversible loss of biological activity. For these reasons, protease inhibitors are often added in the first steps of an isolation and the purification is carried out quickly and generally at low temperatures. Considering the diversity of the characteristics of proteins it immediately becomes obvious that a protein separation cannot be performed under a single schematic protocol. For a successful isolation strategy a realistic judgement of the behavior of a protein in different separation and purification methods, a minimal understanding of the solubility and charge properties of the protein to be purified, and a clear vision of why the protein is to be purified are absolutely necessary.

Goal of a Protein Purification Above all the first steps of a purification procedure, the level of purity to be aimed at and also the analytics to be used are highly dependent on the intention behind purifying a certain protein. Thus, far higher demands for cleanness must be made with the isolation of a protein for therapeutic purposes (e.g., insulin, growth hormones, or blood coagulation inhibitors) than for a protein that is used in the laboratory for structural investigations. In many cases one wants to isolate a protein only to make an unequivocal identification or to clarify some amino acid sequence segments. For such purposes usually a tiny amount of protein

Enzyme Activity Testing, Chapter 3

Immune Binding, Section 5.3 Protein Purification with Tags, Section 16.2-16.4

Protein Degradation, Chapter 9

5

6

Part I: Protein Analytics

(usually in the microgram range) is sufficient. With the sequence information one is able to identify the protein in protein data banks or it provides the information needed to produce oligonucleotides to isolate the gene corresponding to the protein. The protein can then be expressed in a host organism in much larger quantities (up to gram quantities) than was present in the original source (heterologous expression). Then, many of the other investigations are carried out not with the material from the natural source but with the recombinant protein. New strategic approaches to the analysis of biological questions, such as proteomics and other subtractive approaches, require completely new types of sample preparation and protein isolation. Here it is essential not to change the quantitative relations of the single proteins. A major advantage of these new strategies is that the preservation of the biological activity is no longer so important. Although each protein purification is to be regarded as a unique case, one can still can find, especially for the first purification steps, some general rules and procedures that have already been applied frequently in successful isolations; they will be discussed in detail below.

1.2 Protein Localization and Purification Strategy The first step in any protein purification aims to bring the protein of interest into solution and remove all particulate and insoluble material. Figure 1.2 shows a scheme for different proteins. For the purification of a soluble extracellular protein, cells and other insoluble components must be removed to obtain a homogeneous solution, which can then be subjected to purification methods discussed in the following sections (precipitation, centrifugation, chromatography, electrophoresis, etc.). Sources of extracellular proteins are, for example, culture supernatants of microorganisms, plant and animal cell culture media, or body fluids such as milk, blood, urine, and cerebrospinal fluid. Often, extracellular proteins are present only in relatively low concentrations and demand as the next step an efficient concentration step. To isolate an intracellular protein, the cells must be destroyed in a manner that releases the soluble contents of the cell and keeps the protein of interest intact. Cell disruption methods differ mainly according to cell type and amount of cells. Membrane Proteins and other Insoluble Proteins Membrane-associated proteins are usually purified after isolation of the relevant membrane fraction. For this purpose, peripheral membrane proteins that are bound loosely to membranes are separated by relatively mild conditions, such as high pH, EDTA addition, or lower concentrations of a non-ionic detergent. This fraction of peripheral membrane proteins often can then be treated like soluble proteins. Integral membrane proteins that aggregate outside their membrane via hydrophobic amino acid sequence regions and become insoluble can only be isolated from the membrane by using high detergent

Figure 1.2 Purification scheme for different proteins. According to localization and solubility different purification steps are necessary before any subsequent selective and highly efficient purification steps.

1 Protein Purification

concentrations. At present, they present probably the greatest challenge to the isolation and purification techniques. Proteins that are insoluble in normal aqueous buffers are in general structural proteins (e.g., elastin). Additionally, they are sometimes also crosslinked via post-translationally attached modifications (e.g., functional groups). Here a first and highly efficient purification step is to remove all soluble proteins. Further steps are usually possible only under conditions that destroy the native structure of the proteins. The further processing is often carried out by cleavage of the crosslinking of the denatured proteins and the use of chaotropic reagents (e.g., urea) or detergents. Recombinant Proteins A special situation occurs in the production of recombinant proteins. A rather simple purification is possible by the expression of recombinant proteins in inclusion bodies. These are dense aggregates of the recombinant product, which are present in a non-native state and are insoluble, because the protein concentration is too high, or because the expressed protein in the host environment cannot be correctly folded, or because the formation of the (correct) disulfide bonds in the reducing environment inside the host is not possible. After a simple purification by differential centrifugation (Section 1.5.2), in which the other insoluble cell components are removed, the recombinant protein is obtained in a rather pure form. However, it still needs to be converted into the biologically active state by renaturation. When the expression of recombinant proteins does not result in inclusion bodies, the protein is present in a soluble state inside or outside of the cell, depending on the vector. Here, further purification is similar to the purification of naturally occurring proteins but with the advantage that the protein to be isolated is already present in relatively large amounts. Recombinant proteins can be easily isolated by using specific marker structures (tags). Typical examples are fusion proteins in which at the DNA level the coding regions for a tag structure and the desired protein are ligated and expressed as a single protein. Such fusion proteins often can be isolated in a rather pure form in a one-step procedure on applying a specific antibody affinity chromatography against the tag structure. Examples are GST fusion proteins with antibodies against GST or biotinylated proteins using avidin columns. Another frequently used tag-structure is multiple histidine residues, which are attached to the N- or Cterminal end of the protein chain and are easy to isolate by immobilized metal affinity chromatography (IMAC).

1.3 Homogenization and Cell Disruption To purify biological components of intact tissues, the complex cell associations must be disrupted in a first step by homogenization. The result is a mixture of intact and disrupted cells, cell organelles, membrane fragments, and small chemical compounds derived from the cytoplasm and from damaged subcellular compartments. Since the cellular components are transferred to a non-physiological environment, the homogenization media should meet several basic requirements:

     

protection of the cells from osmotic bursting, protection from proteases, protection of the biological activity (function), prevention of aggregation, minimal destruction of organelles, no interference with biological analyses and functional tests.

Normally this is done by isotonic buffers at neutral pH. Often, a cocktail of protease inhibitors is added (Table 1.1). If you want to isolate intracellular organelles, such as mitochondria, nuclei, microsomes, and so on, or intracellular proteins, the still intact cells have to be disrupted. This is accomplished by mechanical destruction of the cell wall. This procedure releases heat of friction and therefore has to be carried out with cooling. The technical realization of the disruption process varies depending on the starting material and location of the target protein (Table 1.2).

Protein Interaction, Section 16.2–16.4

Immobilized Metal Affinity Chromatography, Section 10.4.8

7

8

Part I: Protein Analytics Table 1.1 Protease inhibitors. Substance

Concentration

Inhibitor of

Phenylmethylsulfonyl fluoride (PMSF)

0.1–1 mM

Serine proteases

Aprotinin

0.01–0.3 μM

Serine proteases

ε-Amino-n-caproic acid

2–5 mM

Serine proteases

Antipain

70 μM

Cysteine proteases

Leupeptin

1 μM

Cysteine proteases

Pepstatin A

1 μM

Aspartate proteases

Ethylenediaminetetraacetic acid (EDTA)

0.5–1.5 mM

Metalloproteases

For very sensitive cells (e.g., leukocytes, ciliates) repeated pipetting of the cell suspension or pressing it through a sieve is sufficient to achieve a disintegration by surface shear forces. For the slightly more stable animal cells, the shear forces are generated with a glass pestle in a glass tube (Dounce homogenizer). These methods are not suitable for plant and bacterial cells.

 Cells that have no cell wall and are not associated (e.g., isolated blood cells) can be broken

 



 







osmolytically by being placed in a hypotonic environment (e.g., in distilled water). The water penetrates into the cells and causes them to burst. In cells with cell walls (bacteria, yeasts) the cell walls must be treated enzymatically (e.g., with lysozyme) before an osmolytic digestion can succeed. Such exposure is very gentle and is therefore particularly suitable for the isolation of nuclei and other organelles. For bacteria repeated freezing and thawing is often used as a disruption method. By changing the aggregate state the cell membrane is deformed so that it breaks and the intracellular content is released. Microorganisms and yeasts can be dried at 20–30 °C in a thin layer for two to three days. This leads to destruction of the cell membrane. The dried cells are then ground in a mortar and can be stored at 4 °C if necessary also for longer periods. Soluble proteins can be extracted with an aqueous buffer from the dry powder in a few hours. With cold, water-miscible organic solvents (acetone, –15 °C, ten-times volume) cells can be quickly drained, with the lipids extracted into the organic phase, and thus the cell walls are destroyed. After centrifugation, the proteins remain in the precipitate, from where they can be recovered by extraction with aqueous solvents. With stable cells such as plant cells, bacteria, and yeasts a mortar and pestle can be applied for cell disruption, although larger organelles (chloroplasts) may be damaged. The addition of an abrasive (sea sand, glass beads) facilitates the disruption. For larger quantities, a rotating knife homogenization can be used. The tissue is cut by a rapidly rotating knife. As this produces considerable heat a way of cooling should be present. For small objects such as bacteria and yeasts, the efficiency of the pulping process is significantly improved by the addition of fine glass beads. Vibration cell mills are used for a relatively harsh disruption of bacteria. These are lockable steel vessels in which the cells are vigorously shaken with glass beads (diameter 0.1–0.5 mm). Again, the heat generated must be dissipated. Cell organelles can be damaged in this decomposition method. Rapid changes in pressure break cells and organelles in a very efficient manner. Therefore, strong pressure changes are produced in the suspension of a cell material with ultrasonic waves in the frequency range 10–40 kHz through a metal rod. Since in this method much heat is released, only relatively small volumes and short sound pulses with a maximal duration of 10 s should be applied. DNA is fragmented under these conditions. In a further disruption method that is particularly suitable for microorganisms, up to 50 ml of a cell suspension are pressed through a narrow opening (1.3

Nuclear membrane

3–12

1.18–1.22

1500g/15 min

Plasma membrane

3–20

1.15–1.18

1500g/15 min

Golgi-apparatus

1

1.12–1.16

Mitochondria

0.5–4

6

2000g/20 min

1.17–1.21

1 × 10 –5 × 10

4 4

4

10 000g/25 min 10 000g/25 min

Lysosomes

0.5–0.8

1.17–1.21

4 × 10 –2 × 10

Peroxisomes

0.5–0.8

1.19–1.4

4 × 103

10 000g/25 min

0.05–0.3

1.06–1.23

1 × 103

150 000g/40 min

1.55–1.58

70–80

1.2–1.7

1–25

3

Microsomes Endoplasmic reticulum

100 000g/1 h

Ribosomes Soluble proteins

0.001–0.01

for the enrichment of particles but also for concentration. Thus, for example, from one liter of bacterial cell culture the cells can be pelleted by centrifugation in 15 min at 2000g and then can be resuspended in a smaller volume. Zonal Centrifugation If the sedimentation rates of molecules do not differ sufficiently, the viscosity and density of the medium can be used to generate selectivity. In the zonal centrifugation a preformed flat density gradient, mostly from sucrose, is used and the sample layered over the gradient (see below). The particles which at the beginning of the centrifugation – in contrast to differential centrifugation – are present in a narrow zone are now separated according to the sedimentation velocity. The density gradient, in addition to the minimization of convection, also has the effect that at increasing density and viscosity those faster particles are slowed down that would otherwise sediment with the increasing RCF caused by increasing distance from the rotor axis. This gives an approximately constant rate of sedimentation of the particles. Zonal centrifugation, which is usually carried out at relatively low speeds with swinging-or vertical

Figure 1.5 Density and sedimentation coefficients of some cell compartments. The figure shows the distribution of different cell components in terms of their density and their sedimentation coefficients. ER, endoplasmic reticulum.

14

Part I: Protein Analytics

rotors, is an incomplete sedimentation; the maximum density of the medium must not exceed the lowest density of the particles. The centrifugation is stopped before the particles pellet. Isopycnic Centrifugation The previously discussed techniques of differential and zonal centrifugation are especially suitable for the separation of particles that differ in size. These techniques are not well suited for particles having a similar size but different densities. For these cases, isopycnic centrifugation (also known as sedimentation equilibrium centrifugation) is used. Here centrifugation is performed for long periods at high speed in a density gradient until equilibration. According to Stokes’ equation, particles remain in the floating state when their density and the density of the surrounding medium are equal (v = 0). Particles in the upper part of the centrifuge tube sediment until they reach the state of suspension and cannot sediment further because the layer below has a greater density. The particles in the lower region rise accordingly up to the equilibrium position. In this type of centrifugation, the gradient density must exceed the density of all particles to be centrifuged. Density Gradient To generate the density gradient, which can be continuous or discontinuous (in stages), various media are used, which have been found for the different application areas as appropriate:

 CsCl solutions can be prepared with densities up to 1.9 g ml 1. They are of very low



viscosity, but have the drawback of high ionic strength, which can dissociate some biological materials (chromatin, ribosomes). In addition, CsCl solutions have high osmolality, which makes them unsuitable for osmotically sensitive particles such as cells. CsCl gradients are particularly suitable for the separation of nucleic acids. Sucrose is often used for the separation of subcellular organelles in zonal centrifugation. The inexpensive and easy to prepare solutions are nonionic and relatively inert to biological materials. The low density of isotonic sucrose solutions (0.5% Triton X-100 >0.1% SDS sodium deoxycholate

UV methods

pigments phenolic compounds organic cofactors

2 Protein determination

Additional Methods, not described in detail In addition to the methods described in this chapter, the titrimetric determination of nitrogen by the Kjeldahl method and the ninhydrin assay can be used for the quantification of proteins after acid, thermal degradation of the proteins. Neither method is addressed in detail as they require a great deal of effort, but are discussed here briefly for the sake of completeness. Using the Kjeldahl method, organic nitrogen compounds are oxidized to CO2 and H2O and create an equivalent amount of NH3. Defined conditions are used for the heating with concentrated sulfuric acid and a catalyst (heavy metals, selenium). The NH3 obtained is bound by H2SO4 as (NH4)2SO4 (wet ashing). After the addition of NaOH, ammonia is released, transferred to a distillation apparatus and quantified by titration. The nitrogen content of proteins is approximately 16%. Therefore, multiplying the N-content determined by 6.25 can recalculate the amount of protein. Obviously, the non-protein nitrogen must be have been previously removed. The color assay with ninhydrin is used as a detection method for free amino groups. Therefore, the protein must, firstly, be hydrolyzed into its free amino acids. Exemplarily, this is realized by boiling in 6% sulfuric acid at 100 °C (12–15 h) in fused glass vessels in the absence of oxygen. The ninhydrin reagent is added to the protein hydrolyzate and the resulting purple– blue solution is measured spectrophotometrically at a wavelength of 570 nm. In most cases, Lleucine is used as a standard to generate the calibration curve. However, the color intensities, resulting from the different amino acids of the protein, are not identical. This is one of several sources of error in the ninhydrin method.

25

Ninhydrin Assay, Section 13.3.1

2.1 Quantitative Determination by Staining Tests Protein samples often consist of a complex mixture of different proteins. The quantitative determination of the protein content of such crude protein solutions is usually based on the color reactions of functional groups of proteins with dye-forming reagents. The intensity of the dye correlates directly with the concentration of the groups reacting and can be accurately measured with a spectrophotometer. The basics of spectroscopy (Lambert–Beer law, etc.) and the appropriate equipment are described in detail in Chapter 7. There are several variants for the four following staining methods, which are described in the literature. However, they are all based on the same principles. Spectral Absorption Coefficients Each staining method can only be used in a certain concentration range. Within this range, a constant dependency of the absorption measured on the protein concentration results (at a defined wavelength). The spectral absorption coefficient is determined graphically as the slope in the plot of absorbance versus concentration (abscissa). By default, the absorbance value is related to the path length of the cuvette (in cm) and the concentration of the dissolved protein value in micrograms per milliliter. Alternatively, with a known molecular weight of the protein, the concentration unit mole of dissolved protein per milliliter may be used. Then, a molar spectral absorption coefficient results (formerly molar extinction coefficient) with the units: 1/(moles of dissolved protein per liter) per cm or liters per mole of dissolved protein per cm. The requirements for the staining methods presented (protein concentration ranges, sample volumes) and the approximate resulting spectral absorption coefficient (ml final volume per microgram of protein dissolved per cm), with bovine serum albumin as a standard, are shown in Table 2.2 as an overview. Approximate values are presented in the table because spectral absorption coefficients between 2.3 and 3.2 ml final volume per microgram of protein in solution per cm can be found under apparently identical conditions in the literature (e.g., solely for the biuret-assay)! This is caused by the complexity of influencing factors such as the purity of the chemicals and the water used. Relative Deviations of the Staining Methods Ideally, non-proteinogenic impairment of the assays can be excluded and, apart from a few exceptions, appear under the determination methods presented for one and the same protein, deviating between at least 5% and 20%. The difference is even more dramatic for the quantification of crude protein solutions. It is extremely important

Multiple determinations should be performed in all cases. Triplicate measurements are usually realized and the mean value calculated. The samples are generally measured at the same wavelength against a so-called blank approach, which consists of the same ingredients and volumes as the respective color assay but the protein solution is replaced by distilled water.

Physical Principles of Spectroscopy, Section 7.1

It is very important for all staining methods to specify what the volume (ml) stands for. Several solutions have to be combined in different volumes with the protein sample, depending on the method. The volume specified should always stand for the final volume of the test approach after performing the assay and not the volume of the protein solution used.

26

Part I: Protein Analytics Table 2.2 Overview of the most common staining methods for protein determination. Method

Approx. sample volume required (ml)

Limit of detection (μg-protein ml 1)

Spectral absorption coefficienta) (ml final volume per μg dissolved protein per cm)

Biuret assay

1

1–10

2.3 × 10

4

A550 A650

Lowry assay (modified according to Hartree)

1

0.1–1

1.7 × 10

2

Bicinchoninic acid assay

0.1

0.1–1

1.5 × 10

2

A562

Bradford assay

0.1

0.05–0.5

4.0 × 10

2

A595

a) With the standard protein bovine serum albumin.

when reporting the specific activities of enzymes, antibodies, or lectins, expressed as biological activity per mg of protein, not only to state under which test conditions (e.g., substrate, pH, temperature) the activity was determined but also which method was used for the protein determination.

2.1.1 Biuret Assay

Figure 2.1 The colored protein Cu2+ complex that occurs in the biuret reaction.

This protein determination method is based on a color reaction with dissolved biuret (carbamylurea) and copper sulfate in an alkaline, aqueous environment (biuret reaction). The result is a red-violet colored complex between the Cu2+ ions and two molecules of biuret. The reaction is typical of compounds with at least two CO-NH groups (peptide bonds) and can, therefore, be used for the colorimetric detection of peptides and proteins (Figure 2.1). If tyrosine residues are present, they also contribute significantly to the dye formation by the complexation of copper ions. Thus, the detection is mainly oriented objectively on the peptide bonds and subjectively on the tyrosine residues of proteins. The spectral absorption coefficient given in Table 2.2 was determined at 550 nm. Otherwise, the color intensity can also be measured at 540 nm. Both wavelengths are close to the absorption maximum of the color complex. The absorption maximum varies slightly from protein to protein. The biuret assay is the least sensitive of the color assays (Table 2.2). The protein sample or standard sample is mixed with four parts of biuret reagent and allowed to stand for 20 min at room temperature. Then, the color intensity is directly measured in a spectrophotometer. Ammonium and weak reducing and strong oxidizing agents act especially in a disturbing way (Table 2.1). However, small amounts of sodium dodecyl sulfate (SDS) or other detergents are tolerable. If the solution has to be diluted due to the high absorption, this must done with the sample solution used and not with the final solution after color formation. The color formation reaction must be repeated. This ensures that the required amount of copper ions are present, due to the concentration-dependent equilibrium, which is necessary for complete saturation settings of the complex-forming groups.

2.1.2 Lowry Assay Lowry et al. published a method for the quantitative analysis of proteins in 1951. The method is a combination of the biuret reaction and the Folin–Ciocalteau phenol reagent and is referred to as the Lowry assay. The copper–protein complex mentioned above is formed in alkaline solution. This supports the reduction of molybdate or tungstate, which are used in the form of their heterogenic polyphosphoric acid (Folin–Ciocalteau phenol reagent), by primarily tyrosine, tryptophan, and, to a lesser extent, cysteine, cystine, and histidine of the protein. Here, presumably, Cu2+ of the copper–protein complex is reduced to Cu+, which subsequently reacts with the Folin–Ciocalteau phenol reagent. Due to the additional color reaction, the sensitivity increases enormously compared to the pure Biuret assay. The resulting deep blue color is measured at a wavelength of 750, 650, or 540 nm.

2 Protein determination

27

Various modifications for the Lowry assay are described in the literature. The aim was mostly to improve the relatively high breakdown susceptibility of the Lowry method. The data presented in Tables 2.1 and 2.2 were obtained using a published version of Hartree (1972). The modified method extends, by the same sensitivity, the linear range, compared to the conventional Lowry assays, by 30–40% to about 0.1–1.0 mg ml 1 (Table 2.1). The method shows no problems with dropout salts and used only three stock solutions, which also have better storage stability, instead of the five stock solutions of the original Lowry assay. In this variant, three reagents (parts A : B : C = 0.9 : 0.1 : 3.0) are added successively to one amount of protein sample (1.0): A (carbonate/NaOH solution), B (alkaline CuSO4 solution), and C (diluted Folin–Ciocalteau reagent). After the addition of A and C, the mixture is heated to 50 °C for 10 min each time. Overall, the Lowry assay according to Hartree takes about 30 min. Any necessary dilutions must, as already explained for the biuret assay, be performed with the protein solution. The Lowry method is affected by a wide range of non-proteinogenic substances (Table 2.1). In particular, the usual additives for enzyme purification, such as EDTA, ammonium sulfate, or Triton X-100, are not compatible with the Lowry assay. Compared with the biuret assay, subjective criteria contribute more intensely to dye formation – in particular the individual rates, depending on the protein, of tyrosine, tryptophan, cysteine, cystine, and histidine. Again, the staining is relatively unstable. The measurement of the samples should be carried within 60 min of the last reaction step.

2.1.3 Bicinchoninic Acid Assay (BCA Assay) Smith and colleagues published a highly regarded alternative to the Lowry assay in 1985 that combines the biuret assay with bicinchoninic acid (BCA) as the detection system. Hitherto, BCA had been used for the detection of other copper-reducing compounds, such as glucose or uric acid. Twenty parts of a freshly prepared bicinchoninic acid/copper sulfate solution is added to one part of sample and incubated for 30 min at 37 °C. Similar to the Lowry assay, the method is based on the reduction of Cu2+ to Cu+. The BCA forms a color complex specifically with Cu+ (Figure 2.2). This allows a sensitive, colorimetric detection of proteins at a wavelength of 562 nm (the absorption maximum of the complex). Comparisons with the Lowry assay showed that cysteine, cystine, tyrosine, tryptophan, and the peptide bond reduce Cu2+ to Cu+ and, therefore, allow the color formation with BCA. The intensity of the color formation and the redox behavior of the groups involved depend on, among other things, the temperature. Thus, the BCA assay can be varied between different temperatures to obtain the sensitivity desired. The BCA and Lowry assays are in good agreement for the determination of the concentrations of standard proteins, such as bovine serum albumin, chymotrypsin, or immunoglobulin G. Significant deviations of almost 100% were determined with avidin, a glycoprotein from chicken egg-white. The mechanism of the BCA assay is similar in principle to that of the Lowry assay. However, in no cases are they equivalent. The advantages of the BCA assay over the Lowry assay are the simpler implementation, the ability to influence the sensitivity, and the good stability over time of the color complex formed. The disadvantage is the higher price of the assay, due to the high price of the sodium salt of bicinchoninic acid. The sensitivity of the BCA assay is in the range of the Lowry assay modified by Hartree (Table 2.2). The breakdown

Figure 2.2 The bicinchoninic acid assay: a combination of the biuret reaction with the selective bicinchoninic acid complexation of Cu+.

28

Part I: Protein Analytics

Figure 2.3 Coomassie Brilliant Blue G250 (as sulfonate), the reagent of the Bradford assay.

susceptibility of the BCA assay is also quite high. In addition to the substances listed in Table 2.2, further substances, such as small amounts of ascorbic acid, dithiothreitol, or glutathione, complexing and reducing compounds, interfere with the assay.

2.1.4 Bradford Assay

Protein Detection in Electrophoresis Gel, Section 11.3.3

In contrast to the dyeing methods described so far, no copper ions are involved in this assay. It is named after M.M. Bradford and was published in 1976. The focus is on blue acid dyes, which are called Coomassie Brilliant Blue. In many cases, Coomassie Brilliant Blue G 250 (Figure 2.3) is used. The absorption maximum of Coomassie Brilliant Blue G 250 shifts from 465 to 595 nm in the presence of proteins and in an acidic environment. The reason for this is probably the stabilization of the dye in its unprotonated, anionic sulfonate form by complex formation between the dye and protein. The dye binds fairly nonspecifically to cationic and nonpolar, hydrophobic side chains of proteins. The interactions with arginine, and less so with lysine, histidine, tryptophan, tyrosine, and phenylalanine, are most important. The Bradford assay is also used for staining proteins in electrophoresis gels. It is approximately a factor of two more sensitive than either the Lowry or BCA assay (Table 2.2) and is, thus, the most sensitive quantitative staining assay. It is also the simplest assay, because the stock solution, consisting of dye, ethanol, and phosphoric acid, is added to the sample solution in a ratio of 20-to-50 : 1 and, after 10 min at room temperature, the measurement of the absorbance at 595 nm can be started. Another advantage is that several substances that interfere with the Lowry or BCA assay do not affect the result of the Bradford assay (Table 2.1); this is especially the case concerning the tolerance to reducing agents! On the other hand, all substances that affect the absorption maximum of Coomassie Brilliant Blue are disturbing. This is sometimes difficult to estimate beforehand due to the lack of specificity of the interactions. The biggest disadvantage of the Bradford assay is that equal amounts of different standard proteins can cause significant differences in their resulting absorption coefficient. Thus, the subjectivity of this color assay is considerable and is bigger than that of the three other more complex staining methods.

2.2 Spectroscopic Methods Spectroscopic Bases and Measurement Techniques, Section 7.1

Spectroscopic methods are less sensitive and require higher concentrations of protein than colorimetric methods. These spectroscopic methods should be used with purer or high-purity protein solutions. The spectral absorption or emission properties of the proteins at a defined wavelength in an optical pathway are measured. Therefore, the protein solution (sample solution) is placed in a quartz cuvette (optical pathway in the cell: usually 1 cm). The spectrophotometer is previously calibrated with pure, protein-free solvent in the same quartz cuvette and set at zero as a reference. Subsequently, the value of the sample solution measured will result, either based on literature tables or the calibration curve, in the corresponding protein concentration in mg ml 1. The latter

2 Protein determination

29

Table 2.3 Overview of the most common spectroscopic protein determination methods. Protein component on which the determination is essentially based

Limit of detection (μg-protein ml 1)

Dependence on protein composition

Susceptibility to interference

A280

Tryptophan, tyrosine

20–3,000

Strong

Low

A205

Peptide bonds

1–100

Weak

High

Tryptophan (tyrosine)

5–50

Strong

Low

Method

Photometry:

Fluorimetry:

excitation280 emission320–350

is recommended due to the interference effects of buffer substances, pH values, inaccuracies of devices, and so on. Ideally, a calibration curve is obtained with the pure protein of interest as a standard. An overview of the detailed spectroscopic methods discussed below is given in Table 2.3.

2.2.1 Measurements in the UV Range Absorption Measurement at 280 nm (A280) Warburg and Christian measured the protein concentration of cell extract solutions of different purification degrees at a wavelength of 280 nm (A280) in the early 1940s. The aromatic amino acids, tryptophan and tyrosine, and to a lesser extent phenylalanine, absorb at this wavelength (Table 2.4). Since larger amounts of nucleic acids and nucleotides are present in the protein solutions – this is generally the case after digestion of cells – the values measured at A280 had to be corrected. This is because the nucleic acid bases also absorb at A280. Thus, Warburg and Christian determined a second value at 260 nm (A260), which was correlated with A280 according to the following formula:   Protein concentration mg ml 1 ˆ …1:55A280 †−…0:76A260 †

(2.1)

This relationship can be used up to 20% (w/v) nucleic acids in solution or an A280/A260 ratio of less than 0.6. The A280 measurement alone is sufficient in protein solutions with only a low content of nucleic acids. In accordance with the molar spectral absorption coefficient (ε) (Table 2.4), the A280 method is based essentially on tryptophan, which has an absorption maximum at 279 nm. The two other aromatic amino acids contribute relatively less to the A280 value. Since the content of aromatic amino acids can vary from protein to protein, the corresponding A280 values also vary. At a concentration range of 10 mg ml 1 (A1%), the A280 value of most proteins is between 0.4 and 1.5. However, there are extreme exceptions where A1% is 0.0 (parvalbumin) or 2.65 (lysozyme). An ideal standard protein should have the same level of aromatic amino acids as the protein measured, or should be identical with it. Unfortunately, this is extremely rarely realizable in practice. The A280 method can be used for protein concentrations between 20 and 3000 μg ml 1. It is an easy and fast method and is a lot less disturbed by parallel absorption of non-protein substances Table 2.4 Molar spectral absorption coefficient (ε) at 280 nm and absorption maxima of aromatic amino acids.a) Amino acid

ε × 10

Tryptophan

5.559

219, 279

Tyrosine

1.197

193, 222, 275

Phenylalanine

0.0007

188, 206, 257

a) For aqueous solutions at pH 7.1.

3

(l mol

1

cm 1) at 280 nm

Absorption maxima (nm)

The absorbance value determined (sample or standard) should not exceed 1.0. At a value greater than 1.0, the linear dependency of spectral absorption on the concentration is no longer given. The emission value for fluorescence measurements should not exceed 0.5. If necessary, the sample solution must be diluted and the dilution factor has to be taken into consideration when determining the concentration.

30

Part I: Protein Analytics Table 2.5 Maximum concentration of disturbing additives allowed by the A205 and A280 method. The additives are often used in protein chemistry.a) Additive

A205 method

A280 method

Ammonium sulfate

9% (w/v)

>50% (w/v)

Brij 35

1% (v/v)

1% (v/v)

Dithiothreitol (DTT)

0.1 mM

3 mM

Ethylenediaminetetraacetic acid (EDTA)

0.2 mM

30 mM

Glycerol

5% (v/v)

40% (v/v)

Urea

1 M

KCl

50 mM

100 mM

2-Mercaptoethanol

1 M

NaOH

25 mM

>1 M

Phosphate buffer

50 mM

1M

Sucrose

0.5 M

2M

SDS (sodium dodecyl sulfate)

0.1% (w/v)

0.1% (w/v)

Trichloroacetic acid (TCA)

N↓, the vector points along the +z-axis. The x- and y-components are uniformly distributed on the surface of the double precession cone and do not produce a macroscopic magnetization.

(18.6)

where k is the Boltzmann constant and T is the absolute temperature. Because the energy difference between both levels is orders of magnitude smaller than the thermal energy (kT), both levels are almost equally occupied. For example, for protons at 300 K and a magnetic field of 18.8 T (800 MHz) the excess population in the lower energy level amounts to only 1.3 in 10 000 particles (i.e., N↓ = 0.99987 × N↑). This is the main reason why NMR spectroscopy is so insensitive compared with other spectroscopic methods. Even this tiny difference suffices to give rise to a macroscopic bulk magnetization M0, which results from the combined magnetic moments of the individual nuclear spins. The macroscopic magnetization of spin ½ particles is then given by: M0 ˆ N

γ 2 ħ2 B0 I …I ‡ 1† γ 2 ħ2 B0 ˆN ; 3kT 4kT

for I ˆ

1 2

(18.7)

This shows that the magnitude of the equilibrium magnetization depends on the magnetic field strength B0, the number of spins N, and the temperature T of the sample. Varying each of these parameters can enhance the observable signal (which is the reason for the ongoing development of magnets with increasing field strengths). Importantly, the spectrometer records the time development of exactly this macroscopic bulk magnetization. Moreover, the classical theory of NMR spectroscopy normally considers this macroscopic magnetization as due to its more descriptive behavior than to that of individual spins. In equilibrium, magnetization M0 exists only along the axis of the main field (by convention the z-direction, i.e., Mz = M0). The transverse x- and y-components of the magnetic moments are uniformly distributed and add up to zero (Mx = My = 0) (Figure 18.3). The Bloch Equations According to the Bloch equation, the change of the magnetization vector M over time results from the interaction of the magnetization with an effective magnetic field Beff : dM ˆ γ …M  Beff †; dt

Beff ˆ

  ω0 B0 ‡ ‡ B1 ˆ B1 γ |fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}

(18.8)

0

To simplify the mathematical treatment of the magnetization vector, we will introduce a rotating coordinate system, which precesses about the z-axis with the Larmor frequency of the nuclei (ω0 = γB0). In this coordinate system all nuclear spins rotating with the Larmor frequency appear stationary. This concept should be familiar for us because we all live in a rotating coordinate system — the Earth. A person standing on the equator moves for an observer in space with a speed of approx. 1700 km h 1. Consider that this person on earth throws a ball straight up into the air and then sees it falling down again. For him or her the ball moves on a simple straight vertical path; however, for our observer in space this ball would move on a complicated trajectory. In the rotating coordinate system the contribution of the main magnetic field B0 to Beff cancels for nuclei with Larmor frequency ω0. Moreover, if only the main magnetic field B0 is applied, then also Beff is zero and the magnetization vector becomes time-independent. However, if an additional field B1 is applied perpendicular to the main magnetic field B0 , then Beff ˆ B1 . The magnetization vector will now precess around the axis of the B1 field if its frequency matches the Larmor frequency of the nuclei (ωrf = ω0, the resonance condition). Thus, the B1 field induces a rotation of the magnetization vector Mz from the equilibrium position to the transverse plane (cross product). Physically, the B1 field is nothing other than a

18 Magnetic Resonance Spectroscopy of Biomolecules

437

short radio frequency pulse. For example, a so-called excitation pulse or 90°-pulse completely converts z- into y-magnetization if the B1 field along the x-axis is of appropriate strength and duration (Figure 18.4). How can we imagine the origin of the y-magnetization for individual spins after a 90°-pulse? (i) Both energy levels are equally populated because Mz = 0. (ii) The magnetization dipoles of individual spins are not uniformly distributed around the z-axis, but small parts precess bundled about the z-axis. It is this state of phase coherence that gives rise to the macroscopic ymagnetization (Figure 18.5). Relaxation The Bloch equation shown above is incomplete because it predicts an infinitely long precession of the magnetization vector once the sample has been excited. However, the transverse magnetization after the 90°-pulse corresponds to a non-equilibrium state from which the system will return to its thermodynamic equilibrium after a short while. Therefore, Bloch defined two different relaxation time constants denoted as longitudinal (T1) and transverse (T2) relaxation time constants. Assuming that the respective relaxation processes follow first-order rate equations, the Bloch equations are modified to: dMx;y ˆ γ …M  B†x;y dt

Mx;y T2

dMz M 0 Mz ˆ γ …M  B†z ‡ dt T1

(18.9)

Figure 18.4 Effect of a 90°-pulse on zmagnetization. A pulse about the x-axis (bold wavy arrow) rotates the equilibrium z-magnetization (grey arrow) by 90° counterclockwise around the x-axis and creates –y-magnetization (grey arrow).

(18.10)

The equations indicate that due to relaxation the transverse components (Mx, My) decay to their equilibrium value of zero, whereas longitudinal magnetization Mz approaches its equilibrium value M0. In high resolution NMR spectroscopy, the T1 relaxation time constants for protons are in the range of one to several seconds. Usually, T2 is similar to T1 (small molecules); however, for large molecules like proteins T2 is much smaller than T1. From this fact results the wellknown size limitation of NMR spectroscopy to proteins, because the enhanced T2 relaxation in large proteins decreases the resolution and sensitivity of their NMR spectra (Section 18.1.2, Spectral Parameters). Notably, relaxation effects can often be ignored during the radiofrequency pulses, because relaxation times are long compared to commonly used pulse lengths (10–50 μs). Relaxation is caused by different time-dependent interactions (e.g., dipolar couplings) between the spins and their environment (T1) and between the spins themselves (T2). Historically, T1 is also referred to as the spin–lattice and T2 as the spin–spin relaxation time constant. The relaxation time constants depend on several factors including the Larmor frequency (or magnetic field strength) and the molecular mobility of the molecule in solution. The latter is characterized by the rotational correlation time, τ c. We will discuss the measurement of relaxation times in more detail later during the analysis of protein dynamics (Section 18.1.7, Determination of Protein Dynamics). Pulsed Fourier Transformation Spectroscopy Modern NMR spectrometers operate with a technique called pulsed Fourier Transformation NMR (FT-NMR) spectroscopy. This technique replaced older NMR methods (e.g., continuous wave NMR spectroscopy) because it greatly improved the sensitivity and resolution, and also facilitated the development of multidimensional NMR methods (Sections 18.1.3 and 18.1.4). In pulsed FT-NMR spectroscopy all nuclei are excited simultaneously through a radio frequency pulse. The radio transmitter works at a fixed frequency ν0 and would therefore excite only nuclei with this Larmor frequency (resonance condition!). However, the pulse duration is inversely proportional to the frequency bandwidth (and thus the energy of the radiation). If therefore a very short pulse (a few microseconds) is emitted, then the pulse is less “frequency selective.” It contains a broad excitation band around ν0 and excites the Larmor frequencies of all nuclear spins in the sample at once. Strictly speaking, the flip angle through which the pulse rotates the bulk magnetization depends on the offset (or distance) of the Larmor frequencies from the transmitter frequency. For nuclei that are off-resonance, the effective field Beff is not collinear with the B1 field as in the resonant case. As a result, the flip angle for off-resonance nuclei decreases with increasing offset. However, the projection of the transverse magnetization on the y-axis depends on the sine

Figure 18.5 Illustration of transverse magnetization. The identical number of nuclear spins (grey arrows) in both energy levels shows that they are equally populated. Some nuclear spins precess bundled (or in phase) about the direction of the B0 field. Their magnetic moments add up to the macroscopic My-magnetization (bold arrow).

438

Part II: 3D Structure Determination

of the flip angle. For example, even for an 80° flip angle for far off-resonance nuclei the projection is 98.5% of that of a 90°-pulse – a more than acceptable result for NMR spectroscopy. After this excitation pulse, the different nuclei precess with their different Larmor frequencies about the z-axis. According to Maxwell’s equations, a rotating magnetic moment creates a changing magnetic field that induces a current in a wire coil. In the NMR spectrometer, a sensitive receiver coil records this small oscillating current. Through T2 relaxation the induced current decays over time, which is why the recorded data is called free induction decay (FID). Because the current is detected in a time-dependent manner (and not frequency–selective), the acquired signal is a superposition of all frequencies that have been excited with the pulse. The mathematical operation Fourier transformation converts this time-domain signal into the frequency-domain or spectrum. In analogy to other spectroscopic methods one could imagine the resonance phenomenon differently. In the resonant case nuclei absorb the radio radiation if the frequency of the radio pulse matches their Larmor frequency. After the pulse all excited nuclear spins simultaneously emit the absorbed radio radiation, which is then detected. Therefore, the pulsed Fourier transformation method is often compared to the tuning of a bell. In principle, one could determine the individual tones, which make up the sound of the bell, in the fashion of a “continuous wave-experiment.” The bell is sequentially excited with all sonic frequencies through a loudspeaker from the lowest tones to the limit of ultrasound, and the reaction of the bell is measured with a microphone. This procedure is extremely cumbersome and every bell founder knows that it can be done quicker; one simply takes a hammer and strikes! The sound of the bell contains all tones at once and every human being can analyze the sound with his or her ears (an ingeniously “constructed” biological tool for Fourier transformation). Note, however, that no frequencydependent detection occurs on modern NMR spectrometers.

18.1.2 One-Dimensional NMR Spectroscopy The 1D Experiment With these basic theoretical principles we are able to understand the simplest variant of NMR spectroscopy – the one-dimensional (1D) experiment (Figure 18.6). Each 1D NMR experiment consists of two parts: preparation and detection. During the preparation the spin system is brought to a defined state; during the detection the response to the preparation is recorded. Figure 18.6 Schematic illustration of the pulse sequence for a one-dimensional NMR experiment. A 1D experiment consists of two parts, the preparation and the detection. In the simplest case, the preparation consists of a single 90° pulse (black bar). Subsequently, the response of the spin system (FID) to this pulse is recorded during the detection period.

Preparation of the spin system consists in the simplest case of a short, hard excitation pulse (ca. 10 μs) that creates transverse magnetization (compare Figure 18.4). The resulting FID is recorded and saved during the detection period. After a short waiting time (the relaxation delay) that allows the magnetization to return to its equilibrium value through T1 relaxation, the experiment can be repeated multiple times. The individual data of the measurement are then added together, increasing the signal-to-noise ratio. Multiplication of the FID data with window functions can enhance either the sensitivity or the resolution of the spectrum. Additionally, this operation suppresses artifacts that arise from the subsequent Fourier transformation, which converts the FID (time-domain) into the NMR spectrum (frequency domain). Spectral Parameters We will discuss the different NMR spectral parameters (chemical shift, scalar couplings and line width) on the basis of the simple 1D NMR spectrum of ethanol

18 Magnetic Resonance Spectroscopy of Biomolecules

(Figure 18.7). This spectrum contains three signals (or peaks) originating from the methyl (CH3) protons, the methylene (CH2) protons, and the hydroxyl (OH) proton. Because the protons of the methyl group and of the methylene group, respectively, are each equivalent to each other, they each give rise to only one peak. The two peaks appear as so-called multiplets because their signals are split into several lines by scalar coupling. The integral over the respective multiplets yields the number of protons that give rise to these signals. For ethanol one obtains a ratio of the integrals of 3 : 2 : 1 corresponding to the number of protons that contribute to the respective signals. Chemical Shift In a molecule the electrons surrounding the nucleus create a weak magnetic field and shield the nucleus slightly from the main field. This shielding depends on the specific chemical environment (i.e., the structure of the molecule) and influences the Larmor frequencies of the nuclei. The effect is called the chemical shift and is one of the fundamental parameters in NMR spectroscopy, because it determines the distinct positions of individual signals in the NMR spectrum. The chemical shift δ of a signal in ppm (parts per million) is defined as: δˆ

ωsignal ωreference  106 ppm ωreference

(18.11)

The frequencies are given in ppm instead of Hertz because the former unit is independent of the magnetic field strength. The common reference frequency (ωreference) on which the chemical shift is based is the signal of the methyl groups of tetramethylsilane (TMS). By definition, it has a chemical shift of 0 ppm. For aqueous solutions of proteins and nucleic acids the preferred standards are the methyl signals of 2,2-dimethyl-2-silapentane-5-sulfonic acid (DSS) or trimethylsilylpropanoic acid (TSP). By convention, the chemical shift is plotted on the xaxis of a NMR spectrum from right to left. One often encounters expressions like “a signal appears at high field” (i.e., at low ppm values) or “downfield shift” (i.e., shift towards higher ppm values). These expressions originate from the time when NMR spectra were acquired at constant transmitter frequency through variation of the magnetic field (continuous wave technique). The position of a peak in the NMR spectrum provides substantial information about the origin of the respective signal. Many chemical or functional groups display specific chemical shifts (Figure 18.8); for example, the chemical shift of the hydroxyl group of ethanol differs from that of the methyl group (Figure 18.7). For proteins, the chemical shift alone suffices to distinguish between the signals from the HN, Hα, aromatic and aliphatic protons (Figure 18.12 below). The chemical shift further contains information about the secondary and tertiary structure of a protein, which is very valuable in different stages of structure determination (Section 18.1.6, Determination of the Secondary Structure). Scalar Coupling In the 1D spectrum of ethanol the signals of the methylene and methyl protons (Figure 18.7) appear as multiplets. This line splitting results from the scalar coupling (or indirect

439

Figure 18.7 One-dimensional 1 H NMR spectrum of ethanol. The signal of the methylene group is split into four lines with an intensity ratio of 1 : 3 : 3 : 1 (quartet); that of the methyl group into three lines with a ratio of 3 : 6 : 3. According to the number of hydrogen atoms, one obtains an intensity ratio (= integrals of the signals) of 2 : 3 for the methylene and methyl signals. The hydroxyl proton rapidly exchanges with hydroxyl protons from other ethanol molecules. Therefore, its signal at 2.6 ppm is much broader relative to the other protons. For the same reason, no coupling occurs between the hydroxyl proton and the methylene protons. Hence, neither the hydroxyl signal is split nor does the hydroxyl proton contribute to an additional splitting of the methylene protons.

440

Part II: 3D Structure Determination

coupling) between the protons, and is mediated through the electrons in the atomic bonds connecting the nuclei. Besides the nuclear Overhauser effect (NOE), scalar coupling is the most important mechanism in multidimensional NMR spectroscopy (Section 18.1.3) by which magnetization is transferred between nuclei. Line splitting arises due to the different orientations of a spin ½ to the external magnetic field. Each of the two methylene protons can adopt either a parallel or antiparallel orientation, which corresponds to two different magnetic quantum numbers m. The protons of the methyl group, which are scalar coupled to the methylene protons, therefore, “experience” four possible combinations (↑↑, ↑↓, ↓↑, and ↓↓). The orientations ↑↑ and ↓↓ marginally enhance or attenuate, respectively, the external magnetic field and thus shift the resonance frequency of the methyl protons (Figure 18.9). This gives rise to two lines lying symmetrically left and right of the actual resonance frequency. The orientations ↑↓ and ↓↑ are equivalent and compensate their respective enhancements or attenuations of the main field, thus, an unshifted line with twice the intensity results. The generated splitting pattern for the methyl group is called a triplet.

Figure 18.8 (a) Typical proton chemical shift ranges of different chemical groups. Source: adapted from Bruker Almanac, 1993. (b) Typical proton chemical shift ranges of individual amino acids in proteins. Source: adapted from Wishart, D.S., Sykes, B.D., and Richards, F.M. (1991) J. Mol. Biol., 222, 311–333. (With permission Copyright  1991 Published by Elsevier Ltd.)

18 Magnetic Resonance Spectroscopy of Biomolecules

Figure 18.8 (Continued)

441

442

Part II: 3D Structure Determination

If two nuclei with spin quantum numbers I and S couple with each other, then the resonance of I splits into 2S + 1 lines and the resonance of S into 2I + 1 lines. If the coupling partners of S consist of several identical I-nuclei, then the resonance S splits into 2nI + 1 lines, where n is the number of identical coupling partners (and vice versa).

Figure 18.9 Origin of the triplet: In an AX2 spin system the coupling with two identical nuclei X causes the resonance of nucleus A to split into a triplet. Each of the two X-nuclei can orient itself either parallel or antiparallel to the external magnetic field giving rise to four possible orientations. A parallel orientation enhances, while an antiparallel orientation attenuates, the external magnetic field. Therefore, the line associated with the ↑↑ orientation shifts upfield and the respective line for the ↓↓ orientation shifts downfield. The orientations ↑↓ and ↓↑ are indistinguishable and the respective lines appear at the actual resonance frequency. The single lines of the triplet have an intensity ratio of 1 : 2 : 1.

Figure 18.10 Typical values of different coupling constants (in Hz) in the protein backbone. Only coupling constants larger than 5 Hz are considered. Direct CC- and CN-couplings are shown in black; direct CH- and NH-couplings in grey. Black or grey dashed lines highlight indirect CCand CN-couplings, or indirect CH- and NH-couplings, respectively. Source: adapted from Bystrow, W.F. (1976) Progr. NMR Spectrosc., 10, 41–81. (With permission Copyright  1976 Published by Elsevier B.V.)

The number of individual spin combinations determines the intensity of these lines and follows a binomial distribution that can be illustrated in a Pascal’s triangle. For ethanol, two methylene protons (two I-nuclei) couple with three methyl protons (three S-nuclei). Thus, the signal of the methyl group splits in 2 × 2 × ½ + 1 = 3 lines (triplet), and that of the methylene group in 2 × 3 × ½ + 1 = 4 lines (quartet). The lines of the triplet have intensity ratios of 1 : 2 : 1; the lines of the quartet of 1 : 3 : 3 : 1. The separation of the lines (in Hz) in a multiplet corresponds to the coupling constant J, which is independent of the applied magnetic field strength. In general, only couplings over one bond …1 J †, two bonds …2 J †, and three bonds (3 J , or vicinal coupling) are observed (Figure 18.10). An important aspect of vicinal coupling constants is that their magnitude depends on the torsion angle between the two protons. The semiempirical Karplus relationship describes this dependence (Figure 18.11): J …ϕ† ˆ A cos2 …ϕ

60†

B cos…ϕ

60† ‡ C

(18.12)

in which A, B, and C are empirically determined constants that are different for each type of torsion angle (e.g., the ϕ, ψ, and χ angles in proteins). For example, protein structure determination exploits the information about the molecular geometry contained within the 3 J …HN Hα † coupling constant to restrain the torsion angle ϕ (HN-N-Cα-Hα). Line Width The line widths of NMR peaks provide direct evidence about the lifetimes of the respective resonances. The longer the lifetime of a resonance is, the narrower is the line width of its peak (and vice versa). The resonance lifetimes are manly determined by T2 relaxation and chemical exchange. As mentioned earlier, the short T2 relaxation time constants of large molecules (proteins > 50 kDa) cause broad line widths and thus result in peaks with low intensity. Similarly, chemical exchange during de- and reprotonation reactions can reduce the lifetime of a proton. For example, the hydroxyl proton of ethanol (Figure 18.7) exchanges with other solvent protons from solvents (in this case other ethanol molecules) and thus possesses a broader line.

Figure 18.11 (a) Relationship between the 3J(HN–Hα) coupling constant and the ϕ angle (Karplus relationship). The torsion angle between COi and COi 1 (b) is plotted against the coupling constant 3J(HNHα). The plot shows that at least two angles correspond to a given coupling constant. The index i in the Newman projection (b) denotes the relative position of the amino acids to each other in the protein sequence.

18 Magnetic Resonance Spectroscopy of Biomolecules

443

Figure 18.12 1D 1 H NMR spectrum of a 14 kDa β-sheet protein with an immunoglobulin fold at 30 °C. The characteristic chemical shift of each proton type facilitates their identification in the different spectral regions. At the left end, at 11 ppm are the peaks of the tryptophan indole protons. The peaks from 10 to 6 ppm are assigned to the amide protons of the protein backbone and of the asparagine and glutamine side chains. Between 7.5 and 6.5 ppm are the aromatic protons, followed by the Hα protons from 5.5 to 3.5 ppm. To the righthand end of the spectrum (2D) NMR spectroscopy. Heteronuclear NMR Experiments Besides protons, biomolecules contain other magnetically active nuclei (the so-called heteronuclei). Multi-dimensional NMR experiments for structure determination rely particularly on the magnetically active isotopes of carbon (13 C) and nitrogen (15 N). Due to the low natural abundances and small gyromagnetic ratios (Table 18.1), the relative sensitivities of 13 C and 15 N are low compared with protons. Therefore, to increase the sensitivity of heteronuclear experiments, two general strategies exist:

 First, recombinant expression of proteins in bacteria allows for the production of isotopically enriched proteins. To this end, bacteria are cultivated in minimal media, which contain

18 Magnetic Resonance Spectroscopy of Biomolecules

447

Figure 18.17 (a) Representative 2D NOESY spectrum of a 115 amino acid long protein. The very strong water signal in the center of the spectrum was removed during Fourier transformation. Grey rectangles in (b) schematically highlight the different spectral regions of this spectrum. For each rectangle the observable NOE signals in the respective region are given. The water line is shown as a grey bar.

15



NH4Cl as the sole nitrogen source. For enrichment of 13 C, the sole carbon source is 13 C-glucose. In this manner, singly-labeled (15 N or 13 C) or even doubly-labeled (13 C, 15 N)-samples are produced. Additionally, one can obtain deuterated proteins if D2O instead of H2O is used as the solvent for the culture medium. Second, the signal-to-noise ratio of an NMR experiment depends among other factors on the gyromagnetic ratios of the excited and detected nuclei. Therefore, the direct excitation/detection of heteronuclei is less sensitive relative to excitation/detection of protons. Thus, experiments in general rely on the transfer of the large magnetization of protons to an attached heteronucleus (and vice versa). This achieves an optimal signal-to-noise ratio at only minor magnetization losses. For historical reasons, such experiments are called inverse heteronuclear experiments.

HSQC – Heteronuclear Single Quantum Coherence The HSQC (heteronuclear singlequantum coherence) experiment constitutes the most important experiment that transfers

448

Part II: 3D Structure Determination

Figure 18.18 Pulse sequence of the HSQC experiment. Narrow black bars represent 90° pulses; broad black bars 180° pulses. The upper line displays pulses on the proton (1 H) frequency; the lower line on the nitrogen (15 N) frequency.

Figure 18.19 HSQC spectrum of severin DS111M at 32 °C and pH 7.0. The assignment for each peak is shown as single letter code (the numbers refer to the position in the protein sequence). The nitrogen frequency is plotted on the x-axis; the proton frequency on the y-axis.

18 Magnetic Resonance Spectroscopy of Biomolecules

449

magnetization to a heteronucleus and back (Figure 18.18). The HSQC correlates the frequency (ω1) of a heteronucleus (13 C or 15 N) with that of the directly bound proton (ω2). For example, in a two-dimensional 15N-HSQC each peak represents one proton bound to one nitrogen atom, that is, the spectrum consists essentially of all the amide signals (HN–N) of the protein backbone. Additionally, peaks arise for the aromatic, nitrogen-bound protons of the tryptophan and histidine side chains, and for the side chain amide groups of asparagine and glutamine, respectively (Figure 18.19). In the latter case, two peaks appear at the same nitrogen frequency because two amide protons are bound to the same side chain nitrogen. Under favorable conditions, the nitrogen-bound protons of the side chains of arginine and lysine are also visible. The advantage provided by the additional nitrogen dimension of the HSQC experiment is that it resolves amide proton resonances that often overlap in 1D and homonuclear 2D spectra of larger proteins. Compared with a homonuclear spectrum, the HSQC has of course no diagonal because it correlates completely different types of nuclei during the t1 and t2 times.

18.1.4 Three-Dimensional NMR Spectroscopy The modularity of NMR experiments opens up the possibility for multidimensional NMR simply by introducing further dimensions. For example, we can create a 3D experiment through replacement of the acquisition time after the first mixing period of the 2D experiment (Figure 18.13) with another indirect evolution time and subsequent second mixing period (Figure 18.20). In four-dimensional NMR, a third indirect time follows as well as an additional mixing period. The different indirect times are each incremented individually. The direct data acquisition forms the end of each multidimensional experiment. We start our discussion of 3D NMR with the pulse sequences that are combinations of two 2D experiments because they are conceptually easier. Later, we will describe the so-called triple-resonance experiments that correlate three different types of nuclei (1 H, 13 C, 15 N).

The NOESY-HSQC and TOCSY-HSQC Experiments We have mentioned above that spectral overlap limits the application of 2D spectra (NOESY or TOCSY) for larger proteins. Due to the dispersion of the peaks in a cube instead of a plane, the introduction of a third dimension can greatly resolve this overlap. In general, a heteronuclear coordinate like 15 N or 13 C constitutes the third (vertical) dimension of this cube because the wider frequency range of the heteronucleus provides a better signal resolution than an additional proton dimension. We can create such a 3D experiment simply by combining the pulse sequences for a 2D NOESY and a 2D HSQC, in which instead of the data acquisition the HSQC experiment follows directly after the NOESY experiment. The created experiment is called 13 C- or 15NNOESY-HSQC. In an analogous way, we can convert a 2D TOCSY experiment into a 3D TOCSY-HSQC by combining a 2D TOCSY with a 2D HSQC. The 15N-NOESY-HSQC and 15 N-TOCSY-HSQC represent the basic experiments for the sequence-specific assignment of medium-sized proteins (10–15 kDa). The respective 13 C variants are very useful in assigning the side chains and in identifying NOE signals between the side chain protons. The HCCH-TOCSY and HCCH-COSY Experiments The HCCH-TOCSY and HCCHCOSY experiments are alternatives to the 15N-TOCSY-HSQC, whose sensitivity strongly decreases for larger proteins. Both experiments transfer magnetization exclusively through scalar J-couplings between nuclei. For example, initially the magnetization transfer takes place from the Hα proton to the Cα nucleus (Figure 18.21). From there the magnetization transfer continues to the next carbon nucleus of the side chain, or in the case of the HCCH-TOCSY continues to all carbon nuclei of the side chain. Because the 1JCC-coupling is about 35 Hz, the

Figure 18.20 Schematic illustration of a 3D experiment showing the NOESYTOCSY experiment as an example. Compared with the 2D experiment, a 3D experiment contains additional evolution times and mixing periods. The mixing period of the NOESY transfer step consists of two pulses with a delay time (τm) in between; the mixing period of the TOCSY transfer step consists of a complicated series of pulses called the MLEV mixing sequence (after its inventor Michael Levitt).

450

Part II: 3D Structure Determination

Figure 18.21 Slices of different planes from an HCCH-TOCSY experiment of reduced DsbA from Escherichia coli. All correlation signals of the amino acid Leu185 are shown. The 13 C chemical shift of the peaks is given next to the corresponding plane; the 1 H chemical shift is plotted horizontally along the lowest plane. Individual proton assignments are given next to the respective peak; carbon assignments are next to the slices of the respective plane. Collectively, these peaks produce the typical spin system pattern for a leucine residue also seen in a 2D TOCSY.

mixing time is shorter compared to the homonuclear case (nJHH < 10 Hz) and thus more sensitive for larger proteins. The time duration for the magnetization transfer is calculated according to tmix = 1/(2J). The data acquisition follows after the final magnetization transfer from each carbon nucleus to the directly bound proton. In general, the appearance of an HCCHTOCSY (or HCCH-COSY) spectrum is identical to that of a 13 C-TOCSY-HSQC spectrum. Again characteristic spin system patterns similar to the 2D TOCSY and 2D COSY facilitate the identification of the amino acid type. Triple-Resonance Experiments For proteins larger than 15 kDa, spectral crowding affects even the 3D NOESY-HSQC and TOCSY-HSQC spectra, thus challenging the protein backbone assignment (Section 18.1.5, Analysis of Heteronuclear 3D Spectra). The sequential assignment of large proteins therefore relies on triple-resonance experiments due to their simple appearance. For each amino acid only few peaks appear – often only one. Therefore, the problem with overlapping peaks occurs less frequently in triple-resonance spectra, and enables the sequential assignment of proteins up to 30 kDa. However, for certain nuclei the chemical shift of one residue may accidently match that of a different residue. This so-called degeneracy is especially common for the Cα nuclei. Identification of the correct connection between amino acids with degenerate chemical shifts is, therefore, one of the main obstacles for sequential assignment via triple-resonance experiments (Section 18.1.5, Sequential Assignment from Triple-Resonance Spectra). Because triple-resonance experiments correlate three different nuclei with each other, they require more expensive doubly-labeled 13 C, 15 N or triply-labeled 2 H, 13 C, 15 N protein samples. A further advantage of triple-resonance experiments is their high sensitivity due to an efficient magnetization transfer through the strong 1J- and 2J-couplings (Figure 18.10) between the nuclei (i.e., directly via the covalent atom bonds). The required times for the transfer are, therefore, comparatively short so that relaxation losses are substantially decreased relative to a TOCSY experiment. This high sensitivity is another reason why signal assignment is possible for proteins up to a molecular weight of 30 kDa. For even larger proteins the sensitivity of tripleresonance experiments decreases mostly due to short transverse relaxation times. Especially the Hα protons enhance the relaxation of the Cα and HN nuclei through dipolar interactions. Reducing the number of the aliphatic protons through protein deuteration greatly attenuates this

18 Magnetic Resonance Spectroscopy of Biomolecules

451

type of relaxation and extends the molecular weight of proteins that can be studied by NMR to 50 kDa and beyond (Section 18.1.7, Structure and Dynamics of High Molecular Weight Systems and Membrane Proteins). Nomenclature of Triple-Resonance Experiments A large variety of triple-resonance experiments exists of which Figure 18.22 shows the most important representatives. Even though their nomenclature sounds cryptic, it is very descriptive. The name of the experiment specifies the detected nuclei, the magnetization transfer pathway, and the appearance of the spectrum. To this end, all nuclei through which magnetization is transferred are listed in a row. For example, the experiment denoted HNCO detects three nuclei (1 H, 13 C, 15 N) with the following flow of ´ magnetization: HN …i† ! N…i† ! C …i 1† (Figure 18.22). Brackets in experiment names mark nuclei that serve only as “relay stations” and whose frequencies remain undetected. For example, the HN (CA)CO detects the same types of nuclei as the HNCO, however, the magnetization transfer α ´ α differs: HN …i† ! N…i† ! C …i† ! C…i† , (Figure 18.22). The C nucleus only relays magnetization ´ from nitrogen to the carbonyl (C ) carbon, but its chemical shift is not recorded. Note that in both experiment types the magnetization follows the same pathway back to the amide proton for acquisition (“out-and-back” transfer). The appearance of the two spectra is similar. For each amino acid residue one peak arises that correlates the amide proton and nitrogen with a nitrogenbound carbonyl carbon. However, while the HNCO spectrum shows the correlation with the C´ of the preceding residue (residue i 1), the HN(CA)CO mainly shows the correlation with the intraresidual C´ (residue i). The HNCA Experiment The HNCA constitutes one of the simplest and most useful examples of a triple-resonance experiment. The magnetization transfer pathway is given by α HN …i† ! N…i† …t 1 † ! C…i†=…i 1† …t 2 †; t1 and t2 specify the indirect dimensions during which the chemical shifts of the heteronuclei are encoded (Figure 18.20). The HNCA utilizes an “outand-back” transfer to detect the amide proton chemical shift in t3. In all cases the magnetization transfer occurs through strong J-couplings (1JHN = 92–95 Hz, 1JNC = 11 Hz). Because the 2JNC

Figure 18.22 Overview of the most important triple-resonance experiments. Only the chemical shifts of the nuclei colored dark grey are detected during the experiment. Nuclei colored light grey serve as transmitters for the magnetization and remain undetected. Arrows mark the magnetization transfer pathway and direction. Under each experiment name all observable correlations are listed, in which subscripts i and i 1 denote the position of the amino acid residues relative to each other. Even though the HCCH-TOCSY is not a tripleresonance experiment, it is included here due to its complementarity with the triple-resonance experiments for the assignment process.

452

Part II: 3D Structure Determination

coupling constant between the nitrogen nucleus and the Cα nucleus of the preceding amino acid is only marginally smaller (7 Hz) than the 1JNC-coupling (11 Hz, Figure 18.10), magnetization transfer from nitrogen occurs to the Cα nuclei of both the amino acid it is part of and the preceding one. Therefore, for each amino acid two peaks arise in the HNCA spectrum: one intra- and one interresidue correlation. The related HN(CO)CA only shows the correlation to the preceding residue. In principle, the intra- and interresidue correlations of the HNCA enable the sequential assignment of the protein backbone. However, in practice one requires additional experiments to identify the cross-peaks of the preceding residue and also to resolve chemical shift degeneracies (Section 18.1.5, Sequential Assignment from Triple-Resonance Spectra).

18.1.5 Resonance Assignment Sequential Assignment of Homonuclear Spectra NMR spectra contain all the necessary information about proton distances and torsion angles to calculate the structure of a protein or nucleic acid, which constitutes one possible aim of biomolecular NMR. To this end, it is necessary to assign each peak observed in the spectrum to the respective proton in the protein. Due to the large number of peaks in COSY, TOCSY, and NOESY spectra, this requires a simple and universal analysis method. Kurt Wüthrich (Nobel Prize in Chemistry 2002) and colleagues developed a method that could exactly achieve this – the sequence-specific assignment. This method exploits the distance information contained in the NOESY spectrum. Because of their direct proximity, an amino acid i + 1 displays specific contacts with amino acid i (in each case i denotes the relative position of the amino acids to each other). For example, due to the molecular geometry, the distances of the amide proton of residue i + 1 to the Hα, Hβ, or Hγ protons of amino acid i, respectively, are almost always less than 5 Å (Figure 18.23). Therefore, at the amide proton chemical shift of amino acid i + 1 (horizontal frequency axis) cross-peaks arise at the chemical shifts (vertical frequency axis) of the respective protons of amino acid i. These interresidual peaks between neighboring amino acids are also called sequential peaks. The method of sequence-specific peak assignment requires a distinction between the interand intraresidual peaks for a given amide proton chemical shift. A simple comparison of the 2D NOESY spectrum overlaid with the 2D TOCSY spectrum of the same sample facilitates this distinction (Figure 18.23). While the sequential cross-peaks provide information about the connectivity to the preceding amino acid, the characteristic pattern of the intraresidual crosspeaks determines the amino acid type. Prolines interrupt this chain of sequential connectivities because they lack an amide proton and therefore show no peak in the amide region of the spectrum. However, proline residues, which adopt the more frequent trans-conformation (Figure 18.24), give rise to sequential Hα…i 1† Hδ…i† cross-peaks at the Hα chemical shift or HN Hδ…i† cross-peaks at the amide proton …i 1† chemical shift of the preceding residue. Another problem, which arises especially for larger proteins, is that the large number of peaks often results in peak overlap, preventing an unambiguous continuation of the sequential assignment at some positions in the sequence.

Figure 18.23 Overlay of a 2D NOESY (black circles) and a 2D TOCSY spectrum (grey circles) illustrating the schematic peak pattern of two neighboring amino acids. The interresidual, sequential NOESY peaks (filled black circles) on the succeeding amino acid i + 1 form the basis for the sequence-specific assignment. Arrows in the dipeptide on the left-hand side mark the respective protons that give rise to the sequential cross-peaks. For completeness, dashed, open circles signify the intraresidual peaks of amino acid i + 1 and the sequential peaks of amino acid i 1 at the amide proton chemical shift of amino acid i, respectively. For clarity, the symmetrical peaks occurring below the diagonal are omitted.

18 Magnetic Resonance Spectroscopy of Biomolecules

453

Figure 18.24 Conformation of the cis/ trans-isomers of the peptide bond for an amino acid X and a proline residue. The Cα atoms of both residues as well as the connecting bonds that define the torsion angle ω are emphasized.

The first step of the sequence-specific assignment is to identify individual amino acids that serve as starting points for the assignment. Initially, the search is restricted to amino acids such as glycine, alanine, valine, or isoleucine because their characteristic peak pattern differs from other amino acid types. For example, glycine possesses two Hα protons. Detection of these two Hα peaks at the amide proton chemical shift and the appearance of the corresponding Hα1–Hα2 provide definite evidence for glycine (Figure 18.25). The characteristic double peak row of the methyl groups at 0–1.5 ppm (Figure 18.25) signifies valine, leucine, and isoleucine. In addition to the information provided by the amide region of the spectrum, the aliphatic region contains various diagnostic cross-peaks, especially for prolines, that help to validate the spin systems. The next step is to identify the specific sequential contacts in the 2D NOESY spectrum and thus to determine the preceding residue for each of these distinct amino acids. Then, one determines for every newly identified residue the preceding amino acid in an iterative manner. Thus, the initially identified dipeptide is extended into an oligopeptide chain. The information about the amino acid types contained in these fragments enables one to place the fragments into the protein sequence – that is to assign each spin system to the respective amino acid.

Figure 18.25 Schematic illustration of the characteristic peak patterns of (a) valine and (b) glycine. Both residues serve as starting points for the sequence-specific assignment due to their easily recognizable and distinctive patterns in the TOCSY spectrum. The left-hand side shows the structures of the respective amino acids together with the proton designations and typical chemical shift values. For clarity, the symmetrical peaks occurring below the diagonal are omitted.

454

Part II: 3D Structure Determination

Analysis of Heteronuclear 3D Spectra The method of sequence-specific assignment explained above for homonuclear 2D TOCSY and NOESY spectra is also appropriate for the respective heteronuclear 3D spectra 15N-NOESY-HSQC and 15N-TOCSY-HSQC. Every 15Nplane of the spectra contains the NOESY and TOCSY peaks of an amide proton bound to its nitrogen. Again a superposition of the respective 3D TOCSY- and 3D NOESY-HSQC spectra facilitates the distinction between intra- and interresidual correlations. The 15N-plane (of the 3D 15 N-NOESY-HSQC) is, therefore, a kind of sub-spectrum of the respective 2D spectrum (2D NOESY). A major difference from the 2D spectra is the frequency range sampled in the acquisition dimension of 15N-edited 3D NOESY and TOCSY spectra. The experiments select only correlations to protons bound to 15N-nuclei. Therefore, every 3D 15N-NOESY-HSQC and 15 N-TOCSY-HSQC only contains frequencies between 12 and 5 ppm in the acquisition dimension; the side chain region beyond the water signal on the high field side of the spectrum is empty. Selective Amino Acid Labeling 15N-Labeling of selective amino acids constitutes an alternative way to determine the amino acid type. To this end, recombinant Escherichia coli cells are cultivated in a minimal medium containing all 20 naturally occurring amino acids. This medium composition greatly suppresses the cellular de novo synthesis of amino acids because the cells take up the amino acids directly from the medium. Addition of a commercially available 1-15N-L-amino acid to the medium achieves the selective labeling of this residue type. All other amino acids are added to the medium as “normal” 14 N-amino acids. In particular, those unlabeled amino acids need to be added in excess, which can be metabolically synthesized from the selectively labeled amino acid (the so-called scrambling). An HSQC spectrum acquired on a selectively labeled protein shows only the signals from the labeled amino acid type. Therefore, the labeling of different amino acid types facilitates a quick assignment of the NMR signals and further validates the assignment obtained from TOCSY and NOESY spectra. In general, selective amino acid labeling is less expensive than the production of doublylabeled proteins because it requires less protein amounts and the 15N-labeled amino acids are moderately priced (the actual price depends on the specific amino acid type). However, to assign all residues in a protein, different selectively labeled proteins have to be made, which increases the workload compared to the production of a single, uniformly labeled sample. Thus, the production of multiple samples can easily compensate for the previous savings. Sequential Assignment from Triple-Resonance Spectra To illustrate the sequential assignment with triple-resonance experiments, we will restrict ourselves to the most popular

Figure 18.26 Schematic illustration of the sequence-specific assignment for an HNCA spectrum. The 1 H–15 N projection of the 3D spectrum shows one peak for each amino acid (grey peaks). At each amide proton chemical shift two crosspeaks exist, one originating from the correlation with the Cα nucleus of the residue it is part of and one from the correlation with the preceding residue (blue). Starting from the blue signals, one can “walk” via the blue signals stepwise through the amino acid sequence (blue arrows).

18 Magnetic Resonance Spectroscopy of Biomolecules

455

representatives. The HNCA spectrum will serve again as an example to explain the general appearance of triple-resonance spectra and the assignment strategy. We have already shown in Section 18.1.4 (Triple-Resonance Experiments) that the HNCA spectrum consists of three frequency axes (1 H, 15 N, and 13 C). Its 1HN–15 N projection looks like an HSQC spectrum and every peak in this projection corresponds to a single residue. As mentioned earlier, two cross-peaks appear in the 13 Cα dimension at the chemical shift of each amide proton: one intraresidual correlation and one interresidual correlation to the 13 Cα nucleus of the preceding amino acid (Figure 18.26). In principle, this sequential information of the cross-peaks suffices to “walk” through the complete sequence (Figure 18.27). The “sequential walk” requires a clear distinction between intraresidual and sequential crosspeaks. In general, the intraresidue peaks tend to be slightly more intense than the interresidue peaks (due to the more efficient magnetization transfer via the 1JNC coupling). This tendency is, however, fallible because processes like relaxation influence the peak intensity. In contrast, targeted experiments like the HN(CO)CA force the magnetization through the carbonyl carbon and reject the intraresidual pathway, while the spectrum otherwise looks identical to the HNCA. Thus, one obtains exclusively the sequential cross-peaks and resolves the ambiguity that is present in the HNCA.

Figure 18.27 Slices from an HNCA (black contours) and a CBCA(CO)NH spectrum (blue contours) of huMIF. Each stripe corresponds to one amino acid from Phe18 to Lys32 at the respective amide proton (x-axis) and nitrogen chemical shifts (z-axis, not shown). The superposition of both spectra illustrates the “sequential walk” in which black horizontal lines indicate the sequential connectivities.

456

Part II: 3D Structure Determination

Proline residues and chemical shift degeneracies interrupt the assignment procedure (similar to the assignment of homonuclear spectra, Section 18.1.5 (Sequential Assignment of Homonuclear Spectra)). Proline residues lack an amide proton and therefore produce no cross-peaks, while chemical shift degeneracies prevent an unambiguous assignment of the preceding residues because several possibilities exist. As outlined below, other types of triple-resonance experiments correlating the amide group with other carbon nuclei can resolve the chemical shift degeneracy; however, proline residues will always interrupt the sequential assignment in amide proton-detected experiments. Combination of the HN(CA)CO with the HNCO experiment constitutes an independent alternative to validate the connectivities found in the HNCA. The two experiments correlate the amide proton with the intraresidual carbonyl carbon (HN(CA)CO) or that of the preceding residue (HNCO), that is, the sequential assignment is established through C´ instead of Cα connectivities. The superposition of both spectra results in a pattern analogous to the HNCA spectrum. Furthermore, the HNCO spectrum, which is the most sensitive triple-resonance experiment, can be used to resolve accidental signal degeneracies in the HSQC projection. In proteins, each amide proton–nitrogen pair is covalently attached to only one C´ . Therefore, one observes one cross-peak per amide proton frequency. However, if one finds two cross-peaks at the frequency of an amide proton than this means that the signals of two amino acids are degenerate in the HSQC projection. Two further pairs of experiments provide independent assignment strategies. The CBCANH and CBCA(CO)NH experiments give rise to intra- and interresidual correlations of the amide group with the Cα and Cβ resonances, respectively. The closely related HBHA(CBCACO)NH and the HBHA(CBCA)NH experiments provide the correlations with the Hα and Hβ resonances. The Cα and Cβ chemical shifts obtained from the first two experiments are useful to narrow down the amino acid type. Because of the distinct value of the Cα and Cβ chemical shifts, they enable a preliminary identification of alanine, glycine, isoleucine, proline, serine, threonine, and valine residues (Figure 18.28). If the resonances of different amino acids are degenerate in the 15N-HSQC projection, then the HCACO spectrum provides a simple alternative to differentiate these residues. This experiment establishes the correlations between the Hα, Cα, and C´ frequencies of an amino acid and thus allows for the continuation of the assignment.

Figure 18.28 Typical ranges of Cα and Cβ chemical shifts of the 20 amino acids. Source: adapted from: Cavanagh, J., Fairbrother, W.J., Palmer, A.G. III, and Skelton, N. (1996) Protein NMR Spectroscopy, Academic Press.

18 Magnetic Resonance Spectroscopy of Biomolecules

While triple-resonance spectra provide the connectivities between individual spin systems (and thus the sequential assignment) they only yield limited information about the side chain. The side chain assignment therefore relies on HCCH-TOCSY and the HCCH-COSY experiments, and starts from the known frequencies of Cα (obtained from the HNCA) or Hα (obtained from, for example, the 15N-TOCSY-HSQC or the HCACO). Summary: The general strategy to sequentially assign a protein by triple-resonance experiments requires the acquisition of several independent spectra. First, one establishes the sequential connectivity between the spin systems through at least two different nuclei (Cα, Cβ, or C´ ). This strategy minimizes complications of the assignment by chemical shift degeneracies. Secondly, one restricts the amino acid type through the chemical shift information provided by CBCA(CO)NH or CBCANH spectra. Third, one assigns the side chains with HCCH-TOCSY and 13 C-NOESY-HSQC experiments (Section 18.1.4, The NOESY-HSQC and TOCSY-HSQC Experiments). Taking all the information together one can finally place the identified spin systems into the protein sequence.

18.1.6 Protein Structure Determination Constraints for the Structure Calculation Up to now, we have described methods to identify and assign the observable NMR signals to the respective amino acids. Once the assignment of the resonances to individual nuclei has been completed, the next step is to extract structure-defining data from the NMR spectra. Most important are 2D NOESY, 3D 15NNOESY-HSQC, and 3D 13 C-NOESY-HSQC spectra (Section 18.1.4, The NOESY-HSQC and TOCSY-HSQC Experiments), as they provide a wealth of proton–proton distances. For a medium-sized protein (ca. 120 amino acids) one usually observes more than 1000 NOE contacts, which have to be assigned to specific protons on the basis of the previously established sequence-specific resonance assignment. Particularly important are the non-sequential NOEs that define the three-dimensional structure. Medium-range NOEs (less than four amino acids separate the residues involved in the NOE) provide information about the local backbone conformation of the protein and serve to determine secondary structural elements. Long-range NOEs (five or more amino acids separate the involved residues) define the relative orientation of the secondary structural elements to each other and are thus the essential parameters for the determination of the tertiary structure. Because the NOE signal intensity I depends on the distance r between two nuclei i and j according to:

1 I NOEi;j ∝ 6 r ij

(18.13)

the internuclear distances can be obtained through integration of the NOE signal. Alternatively, it is possible to qualitatively estimate their intensity. In both cases, however, the signal intensities have to be calibrated with respect to a NOE signal of known distance (e.g., to known distances in secondary structural elements). Depending on the signal intensities NOEs are classified into different distance groups with fixed distance boundaries (Table 18.2). Table 18.2 Relationship between the intensity of a NOE signal and the respective proton distance. During structure calculations NOE distances act like elastic springs between the atoms. If an atomic distance exceeds the upper limit, then this NOE violation is penalized with an energy factor that depends on the extent of violation. This energy factor forces the respective protons closer together in the next iteration of the structure calculation. Because various effects can reduce the NOE intensity in the NMR spectrum irrespective of the atomic distance, usually no lower limits are used for the calculation; only the “normal” repulsion between the atoms due to their van der Waals radii acts. NOE intensity

Distance (Å)

Upper limit (Å)

Strong

2.4

2.7

Medium

3.0

3.3

Weak

4.4

5.0

Very weak

5.2

6.0

457

458

Part II: 3D Structure Determination

In addition to distance restraints, structure calculations utilize the 3J(HN–Hα) coupling constants provided by COSY or HNCA-J spectra (a variant of HNCA). As described in Section 18.1.2 (Spectral Parameters), 3J(HN–Hα) coupling constants restrain the ϕ torsion angles of the protein backbone through the Karplus relationship (Figure 18.11). Residual dipolar couplings (RDCs) represent a rather new class of NMR restraints that includes chemical shift anisotropy and cross-correlated relaxation. In isotropic solution rotational diffusion averages anisotropic interactions such as the dipolar coupling between two nuclear spins to zero. Dipolar couplings therefore do not result in line splitting in isotropic solution, but act as a relaxation mechanism. However, if the molecular tumbling is anisotropic and certain orientations are preferred or avoided (i.e., a molecular alignment is present), dipolar couplings average incompletely, resulting in scaled down values (relative to the static value) of the dipolar interaction. The corresponding couplings are called residual dipolar couplings. In contrast to NOEs and scalar couplings, which yield distance information about the local geometry, orientational restraints such as RDCs provide long distance restraints for the whole molecule. For example, RDCs determine the relative orientation of secondary structural elements or of individual protein domains to each other. Additionally, RDCs provide dynamic information about slow internal motions of a bond vector. The use of RDCs in structure calculations is, however, more complicated than for NOEs or J-couplings. All RDCs refer to the same reference frame that is fixed to the molecule – the anisotropic alignment tensor σ. Thus, RDCs depend on the orientation of the internuclear vector relative to the molecular reference frame, which is described by two angles (θ, ϕ) (Figure 18.29); θ defines the angle between the internuclear vector and the z-axis of the principle axis frame, and ϕ is the corresponding azimuthal angle. The principle axis frame is defined as that molecular frame in which the alignment tensor σ is diagonal (i.e., with principal values |σz| > |σy| > |σx|). The dipolar coupling DA,B between two nuclei A and B is given by: h 2 DA;B ˆ 0:75DA;B max 3 cos θ

Figure 18.29 The orientation of an N H bond vector i relative to the alignment tensor σ determines the magnitude and sign of the residual dipolar coupling. The alignment tensor is fixed to the protein; however, it depends in magnitude and orientation on the orienting medium (such as bicelles and Pf1).

1 σ z ‡ sin2 θ cos 2ϕ σ x

σy

i

(18.14)

The magnitude of RDCs scales with the degree of alignment. Experimentally, one prefers a weak alignment of the proteins to determine RDCs with a magnitude of a few Hertz. When the alignment is weak (corresponding to the alignment of 1 out of 1000 molecules) the spectra remain comparable to those in isotropic solution – that is the number of lines does not increase strongly and the NMR signals remain sharp. Several methods exist to achieve a weak alignment of a protein. Originally, particles were studied that aligned spontaneously in the magnetic field (through anisotropic tensors of the magnetic susceptibility). Meanwhile, researchers have resorted to systems that display a liquid-crystalline behavior in the magnetic field even when strongly diluted:

   

bicelles (flat micelles consisting of two types of lipids with different acyl chain lengths); rod-like virus particles (tobacco mosaic virus or bacteriophage Pf1); poly(ethylene glycol)/alcohol mixtures; purple membranes of halophile bacteria.

In addition, mechanically stretched or compressed polyacrylamide gels also achieve a partial alignment of the solute (protein or nucleic acid). To determine RDCs, mostly heteronuclear experiments, which are also used for measurement of scalar couplings, are recorded. Determination of the Secondary Structure Regular secondary structural elements like α-helices and β-sheets are regions with a defined conformation of the protein backbone, that is, with well-defined torsion angles and fixed proton–proton distances. Therefore, diagnostic NOE signal patterns and 3J(HN–Hα) coupling constants allow for a distinction between the most important secondary structural elements (Figure 18.30). Characteristic for an α-helix are strong N NOE signals between the amide protons HN HN HN …i† …i‡1† and H…i† …i‡2† , and also the very 3 N α specific NOEs between Hα…i† HN (Figure 18.30). J(H -H ) coupling constants less than …i‡3† 5 Hz additionally confirm the existence of an α-helix. In contrast, for β-sheets a specific pattern of Hα…i† Hα…j† and HN HN …i† …j† NOEs arises between the two parallel or antiparallel strands of the sheet (i and j denote amino acids in different strands of the sheet). The extended structure of the 3 N α protein backbone is further evident from strong Hα…i† HN …i‡1† NOEs and J(H –H ) coupling constants of more than 8 Hz for several consecutive residues.

18 Magnetic Resonance Spectroscopy of Biomolecules

459

Figure 18.30 Characteristic NOEs and typical 3J(HN–Hα) coupling constants for three regular secondary structures. The plot schematically illustrates typical intensities for the HN…i† HN…i‡1† , Hα…i† HN…i‡1† , HN…i† HN…i‡2† , Hα…i† HN…i‡2† , Hα…i† HN…i‡3† , Hα…i† Hβ…i‡3† , and Hα…i† HN…i‡4† cross-peaks. The height of the rectangles reflects the peak intensity.

We will use the DS111M domain of severin to illustrate the identification of secondary structures. This second domain of severin plays a central role in the polymerization and depolymerization of filamentous actin, a constituent of the cytoskeleton. Severin DS111M consists of 114 amino acids and contains three α-helices evident from Hα…i† HN …i‡3† connectiviN ties (Figure 18.31). In addition, strong sequential Hα…i† HN HN …i‡1† contacts and missing H…i† …i‡1† NOEs indicate the existence of five β-sheets. The 3J(HN–Hα) coupling constants (>8 Hz) provide further support for a β-structure in these regions. Moreover, particularly strong Hα…i† Hα…j† contacts between two strands facilitate the identification of associated strands within the β-sheet (Figure 18.32). These short distances (∼2.2 Å) are easily recognized in 2D NOESY or 3D 13 C-NOESY-HSQC spectra. Because the Hα resonances have chemical shifts close to that of water, it is advisable to record those NOESY spectra on samples dissolved in D2O, in order to reduce the interference from the water signal. Next to the direct structural information provided by distances, orientational restraints and torsion angles, hydrogen/deuterium (H/D) exchange rates of amide protons yield indirect structural information. Hydrogen bonds, which generally stabilize secondary structures, strongly attenuate the H/D exchange. Thus, slowly exchanging amide protons are often an indication for the existence of secondary structure (Figure 18.31). In addition, one obtains clues about the possible position of the residue in the protein structure. Amide protons in the center of the protein possess reduced exchange rates with the solvent due to their reduced accessibility, while amide protons at the surface display higher exchange rates. To experimentally determine the exchange rate, the protein sample is freeze-dried and subsequently dissolved in pure D2O. Over time, those signals disappear that are associated with rapidly exchanging amide protons. Conversely, signals of slowly exchanging amide protons remain visible in the spectrum for longer times (sometimes up to several months). Those amide protons localize almost exclusively to regions of the protein with regular secondary structure (Figure 18.31). The chemical shifts of protein residues provide further evidence for secondary structure. Relative to the disordered state (the random coil shift) the chemical shift changes if an amino acid is in a secondary structural element. This difference is called the secondary chemical shift and reflects the uniformity of the chemical environment in regular secondary structure. For example, Cα or C´ carbons experience downfield shifts (i.e., larger ppm values) in α-helices and

460

Part II: 3D Structure Determination

Figure 18.31 Overview of the sequential and short-range NOEs of DS111M. The height of the bars reflects the NOE peak intensity. The 3J(HN–Hα) coupling constants are given below the amino acid sequence. Filled and open circles denote residues with slow and intermediate, respectively, amide proton exchange rates. Above the amino acid sequence the secondary structure is shown schematically (arrow = β-sheet, spring = αhelix).

Figure 18.32 The complete network of NOE cross-peaks within the five-stranded β-sheet of severin DS111M. The four β-strands β1, β2, β3, and β4 run antiparallel to each other, while β-strands β4 and β5 run parallel. Arrows mark the NOE cross-peaks that occur within and between the backbones of the five β-strands. Solid arrows denote sequential Hα(i)-HN(i+1) cross-peaks, while thick and thin double-headed arrows and dashed arrows denote interstrand HN…i† HN…j† , Hα…i† Hα…j† , and Hα…i† HN…j† cross-peaks, respectively. Additionally, thin dotted lines indicate hydrogen bonds.

18 Magnetic Resonance Spectroscopy of Biomolecules

upfield shifts (i.e., smaller ppm values) in β-sheets (vice versa for Hα and Cβ). The method by which secondary structures are identified from secondary chemical shifts is known as the chemical shift index. Therefore, already in the early stages of structure determination the above-described methods facilitate the identification of regions with secondary structure in the protein. However, the relative spatial positions of these elements to each other as well as the global fold of the protein remain unknown. Calculation of the Tertiary Structure A computer-assisted structure calculation is used to convert the geometric data (distances and angles obtained from the analysis of the NMR spectra) into a three-dimensional structure of the biomolecule. However, the NMR data provide only a limited number of distance and angle restraints between atom pairs (preferentially the protons) that alone are insufficient to determine all atom positions. Luckily, this is only of minor importance because the bond geometries of many chemical groups are well known from X-ray diffraction experiments. This molecular information is contained in the so-called force field, which also includes further general atomic parameters such as van der Waals radii and electrostatic (partial) charges. While NMR data are directly obtained from a target molecule and within error limits are considered as “real,” force field parameters are derived from measurements on reference molecules. It is assumed that the force field parameters depend only on the chemical but not the spatial structure. Therefore, the force field constitutes a reasonable model for the real molecule. Because different possibilities exist to extract force field parameters from experimental reference data, several different force fields were developed for specific applications. Even though structure calculations depend on a force field, the experimental NMR data should determine the result irrespective of the choice of the force field. Additionally, compared to pure molecular dynamics simulations, NMR structure calculations use simplified force fields. For example, the solvent is generally considered only in the form of a fixed dielectric constant. The most important degrees of freedom that determine the 3D structure of a protein constitute rotations about the N–Cα bond (torsion angle, ϕ) and the Cα–C´ bond (torsion angle, ψ). To obtain reliable structures, NMR data should accurately define these angles and not the force field. Especially for loops at the surface, the N- and C-termini, or even the amino acid side chains this is often impossible due to their enhanced mobility. Thus, those regions do not assume a single defined structure, but rather fluctuate between different conformations. Flexible regions of biomolecules are therefore best described by an ensemble of conformations and a single static protein structure may not inevitably be the best representation of the microscopic reality. Practical experience shows that the number of distance restraints (NOEs) is more important than the accuracy with which the distances are determined. Thus, the distance classification introduced in Table 18.2 is sufficiently accurate for structure calculation. In principle, two different methods exist to calculate the structure of a protein in solution (which also can be combined):

 The distance geometry method creates matrices from the NMR data and the force field that



contain distance bounds for all atom pairs. Through mathematical optimization methods Cartesian coordinates are calculated for all atoms, which fulfill the distance bounds reasonably well. However, the solution to this problem is ambiguous and one can calculate many independent structures that all reasonably well agree with the NMR data. Because distance geometry takes the covalent geometry only insufficiently into account, all distance geometry structures require further refinement. Simulated annealing is a molecular dynamics (MD) method. Newton’s equation of motion states that a force acting on an atom either accelerates or retards it. By numerically solving this equation, MD simulates the physical motion of atoms. NMR data are included as constraints and guarantee that the protein only adopts conformations that agree with the experimentally determined data. Starting from a template structure, a simulation period at high temperature allows the protein to find a structure that is compatible with the NMR data and the force field. Because at this stage both force field and experimental constraints are comparatively weak compared to the thermal energy, the simulated molecule can perform great conformational changes. The subsequent simulated annealing protocol decreases the simulation temperature and increases the impact of the force field and the experimental NMR constraints. In this way fluctuations in the structure are reduced until, finally, a 3D structure

461

462

Part II: 3D Structure Determination

Figure 18.33 (a) Stereo image of the 20 lowest-energy structures of severin DS111M. Only the heavy atoms of the protein backbone (N, Cα, C´ , and O) are shown. (b) Stereo image of the ribbon model of DS111M.

with minimal energy is obtained. Because the result of simulated annealing can depend on the starting structure, it is necessary to perform multiple calculations with different template structures. NMR structure calculations do not provide a single structure but a family of conformers, which occupy a relatively narrow conformational space. The root-mean-square deviation (RMSD) expresses the variability within this structure family, in which small deviations indicate a narrow conformational space. In general, one determines the RMSD for each structure of this family relative to an average structure (which has to be calculated in advance). Alternatively, one can determine the RMSD by pairwise comparison of two structures of the family and calculating the mean of these deviations. The RMSD differs for individual parts of the protein structure because regions lacking a defined secondary structure show larger deviations due to few NMR constraints. Similarly, flexible regions with an increased internal mobility also give rise to higher RMSD values. Relaxation measurements (Section 18.1.7, Determination of Protein Dynamics) can clarify if the variability within the structural family is due to insufficient NMR restraints or is caused by local dynamics. Figure 18.33 shows the result of a structure calculation for the protein severin DS111M. The individual conformers of the NMR ensemble are almost superimposable apart from a few poorly defined regions (the N-terminus and α-helix 2 at the C-terminus). Figure 18.33b displays DS111M as a ribbon model to enhance the presentation of the β-strands and α-helices. The orientation of the ribbon model is identical to the ensemble in panel (a). These structures were calculated with 1011 distance bounds and 55 ϕ torsion angles, and clearly demonstrate the necessity for a large number of constraints to obtain a well-defined structure.

18.1.7 Protein Structures and more — an Overview NMR spectroscopy constitutes a versatile method with the possibility to investigate multiple chemical and biophysical problems apart from the mere determination of protein structures. The

18 Magnetic Resonance Spectroscopy of Biomolecules

determination of a protein structure is, thus, not the end but rather the start of a structural and biophysical characterization of a protein. Knowledge of the three-dimensional structure allows for a meaningful approach towards functional aspects, such as protein dynamics, interactions with other molecules (e.g., proteins, DNA, or ligands), catalysis mechanism, hydration, and protein folding. An extensive description of these techniques is beyond the scope of this chapter. The following subsections therefore only give a short overview of possible experiments and applications. The bibliography at the end of the chapter contains a selection of important textbooks and review articles, which explain individual subjects more precisely and extensively. Speeding-up NMR Spectroscopy Due to the low sensitivity, NMR spectroscopy is a rather slow and time-consuming method. Even though simple 1D experiments last only a few seconds to minutes, the measurement time for more complex spectra increases dramatically with increasing dimensionality of the experiment. Two-dimensional experiments typically last for tens of minutes to hours; 3D experiments can take up to a few days to finish. Therefore, the acquisition of higher-dimensional spectra (4D or 5D), in which each indirect time dimension is incremented independently (as described above), is impractical. Mainly two factors determine the length of an experiment, the relaxation delay (Section 18.1.2, The 1D Experiment) and the number of increments in the indirect dimension(s) (Section 18.1.3, General Scheme of a 2D Experiment). During the relaxation delay (ca 1–5 s) the magnetization recovers to its equilibrium value through T1 relaxation. Hence, this delay determines how fast individual experiments (e.g., with different t1 increments) can be repeated. The SOFAST- and BEST-type NMR experiments are specifically designed to selectively excite only a subset of the spins (usually the amide protons). The unperturbed protons enhance the longitudinal relaxation of the excited spins through dipole–dipole interactions. As a result one can reduce the relaxation delay to a few hundred microseconds and thus increase the repetition rate five to ten times. The number of increments in the indirect dimension determines the achievable resolution in this dimension. Therefore, high-resolution spectra require more measurement time. Two different approaches exist to reduce the number of increments while maintaining the same resolution (or to increase the resolution in the same experiment time). In normal experiments the indirect time is incremented in constantly spaced intervals Δtn. In contrast, fewer randomlyspaced increments in the indirect dimension (ca 30% relative to conventional sampling schemes) are used in the so-called non-uniform sampling method. Mathematical approaches (maximum entropy or compressed sensing), which are distinct from Fourier transformation, convert the sparsely sampled indirect time-domain data into a conventional spectrum. The second approach relies on the simultaneous incrementation of at least two indirect timedomains to generate a two-dimensional projection of the n-dimensional spectrum. Imagine the three-dimensional cube of an HNCA spectrum with 1 H, 15 N, and 13 C on the x-, y-, and z-axes, respectively. Because of the co-incrementation of the 13 C and 15 N dimension, the resulting projection will intersect the 13 C–15 N plane at a certain angle with respect to the z-axis. Thus, in the projection the signals on the y-axis are the combination of the 13 C and 15 N frequencies, while frequencies on the unaffected x-axis correspond to the amide protons. Even though we are unable to imagine a four-dimensional object, mathematically it is very simple to transfer the described method to 4D experiments (and higher). Suitable projection reconstruction methods facilitate the creation of the n-dimensional spectrum from a limited number of projections. Alternatively, in automated projection spectroscopy (APSY) the chemical shifts for each peak are directly calculated from the projections without any reconstruction of the spectrum. For example, with projection spectroscopy one can acquire 6D experiments in three to four days. Determination of Protein Dynamics Proteins are not rigid, static entities, but rather exist as ensembles of conformations. The internal motions, which give rise to transitions between different structural states, are collectively referred to as protein dynamics and occur on a wide range of time scales (ns–s). Protein dynamics play crucial roles in protein functions particularly for the interaction with binding partners, in enzyme catalysis, and in allosteric regulation. Luckily, many NMR spectroscopic parameters depend both on the mobility of the whole molecule (translational or rotational motions) and on internal motions (transitions between different conformations or bond rotations), providing insight into a wide variety of dynamic

463

464

Part II: 3D Structure Determination

processes ranging from fast fluctuations (lasting picoseconds) to slower conformational changes (lasting microseconds or more). The measurement of relaxation parameters is possible for different nuclei. While 15 N relaxation data of the amide group provide information about the backbone flexibility of each amino acid, relaxation measurements of side chain groups (especially the methyl groups of valine, isoleucine, and leucine) determine their respective mobility. The 15N-relaxation measurements have the advantage that they can be performed on inexpensive, 15N-labeled protein. In addition, a simplified model can be used to analyze 15N relaxation data because it is primarily the directly bound proton that determines the relaxation of the 15 N spin. In this model three parameters characterize the dynamic behavior: the rotational correlation time and the amplitude and the time scale of local motions. The correlation time τc describes the statistic motion of the whole molecule in the form of a rotational diffusion, which for proteins is on the order of several nanoseconds. The 15N-relaxation measurements have been particularly successful in the identification and characterization of excited protein states that have populations as low as a few percent but which are important for protein function and misfolding. Thermodynamics and Kinetics of Protein–Ligand Complexes NMR spectroscopy can elucidate many aspects of the interaction of proteins with small ligands, polypeptides and other proteins, or nucleic acids (DNA and RNA). It characterizes the dynamic, kinetic, and thermodynamic properties of protein–ligand complexes. Even without precise structural information about the ligand, it is possible to identify the residues of a protein involved in binding a ligand. To this end, 15 N- or 13 C-labeled protein is titrated with an NMR-invisible, unlabeled ligand and a HSQC is recorded at each titration point. Due to ligand binding, some peaks display changes in chemical shift and line width, which allow for the identification of the involved residues. Furthermore, analysis of the chemical shifts changes as a function of ligand concentration facilitates the determination of the respective dissociation constant. Detailed binding information can also be obtained when the protein cannot be labeled with 13 15 C/ N. Under suitable conditions (weak binding of a small ligand to a large receptor), so-called transfer NOEs yield the receptor-bound conformation of a ligand even though the receptor is too large for NMR spectroscopy. Additionally, one can detect an interaction between a ligand and a receptor through saturation transfer, which results in intensity changes for the NMR peaks of the ligand. While the ligand is in large excess, the receptor is essentially invisible due to its dilution by a few orders of magnitude. This technique is particularly useful to screen ligands for certain target receptors. Additionally, NMR spectroscopy allows for the determination of the translational diffusion, which depends on the size and shape of the molecule. Therefore, one can determine if a protein is a monomer, a dimer, or if it forms a complex with other proteins. Protein Folding and Misfolding NMR spectroscopic techniques also allow atomic level insight into the folding and misfolding of proteins. In combination with sub-zero temperatures (down to 15 °C) or high pressures (up to 2 kbar) it is possible to determine the 3D structure of partially folded equilibrium intermediates, and to analyze the kinetics of protein folding pathways. In addition, through combinations of H/D exchange and other NMR methods one can follow the formation of hydrogen bonds within a stable secondary structure during the folding process. Because H/D exchange and folding are competitive reactions, one can determine the velocity of the secondary structure formation from the exchange rate of individual protons. Alternatively, a so-called quenched flow apparatus permits H/D exchange reactions only during certain time periods of the folding process. Thus, a time-resolved picture of the formation of secondary and tertiary structures becomes accessible. Furthermore, 15N-relaxation dispersion as well as real-time NMR methods offer unique insight into the processes of folding and misfolding of proteins. Intrinsically Disordered Proteins Intrinsically disordered (also called natively unstructured) proteins (IDPs) lack a well-defined tertiary structure, but still fulfill critical biological functions as part of signal transduction cascades, in spliceosomes, or in cancerassociated processes. The degree of disorder ranges from completely unfolded proteins to mostly folded proteins, which contain disordered regions of 30 residues or more. IDPs possess a very distinct amino acid composition with a high abundance of proline, polar, and charged amino acids (serine, glycine, glutamate, arginine). At the same time, they are depleted in hydrophobic and aromatic amino acids that usually form the hydrophobic core of globular

18 Magnetic Resonance Spectroscopy of Biomolecules

proteins. Due to their dynamic nature, only NMR is capable of providing structural information about IDPs with atomic resolution. The low sequence complexity and the absence of stable structure result in similar chemical environments for each residue. Therefore, IDPs display a narrow chemical shift dispersion (especially the amide protons) that results in severe signal overlap. Chemical shift degeneracy of Cα and Cβ further limits the application of standard triple-resonance experiments (for each amino acid type the chemical shifts are close to the random coil shift) (Section 18.1.4, TripleResonance Experiments). Because the 15 N (and 13 C´ ) dimension provide the highest dispersion in IDPs, sequential connectivities are established with 3D experiments that correlate to two nitrogen spins (e.g., HNN or (H)CANNH). Alternative approaches rely on carbon detection, which correlate the C´ spins with the directly bound nitrogen. For IDPs, the so-called NCO experiments provide improved resolution when compared to 15N-HSQCs, and also allow for the detection of proline residues. Due to the lower gyromagnetic ratio, however, carbon-detected experiments are significantly less sensitive than those that use proton detection. The disordered state imparts IDPs with favorable relaxation properties. Therefore, IDPs are amenable for high dimensional 5D–7D experiments, which provide optimal resolution. In addition, the longer transverse relaxation times of IDPs enable the characterization of proteins significantly larger than 30 kDa. Even without deuteration it was possible to sequentially assign the two microtubule-associated proteins tau and MAP2c, both of which are larger than 45 kDa. IDPs exist as ensembles of rapidly interconverting structures and therefore only provide a few NOE distance restraints (often only short-range). The secondary chemical shifts (Section 18.1.6, Determination of the Secondary Structure) obtained from the sequential assignment allow for the identification of transient secondary structural elements. Often those elements become stabilized upon binding to protein interaction partners. Long-range distance restraints to describe the structural ensemble can be obtained from RDCs and PREs (paramagnetic relaxation enhancements). To measure PREs, a paramagnetic spin label (e.g., a nitroxide) is attached to the IDP. The spin label enhances the relaxation of nearby residues (ca. 25 Å) resulting in line broadening and thus in an intensity decrease. Normalizing the reduced intensity by the intensity of the respective residue in the absence of the spin label allows for the calculation of the distance between the nucleus and the spin label, which is of course an ensemble average. Sophisticated computer programs can utilize this entire structural information together with data from small-angle X-ray scattering to calculate ensemble structures. Structure and Dynamics of High Molecular Weight Systems and Membrane Proteins To extend the application of heteronuclear experiments to even larger proteins or protein complexes of several hundred kilodaltons, Kurt Wüthrich and colleagues developed a technique called TROSY (transverse relaxation optimized spectroscopy) at the end of the last century. The TROSY technique achieves the compensation of two relaxation mechanisms (the dipolar interaction and chemical shift anisotropy). Especially for large proteins, the resulting narrow line widths reduce the signal overlap and improve the sensitivity of the experiments. Due to its modularity, TROSY can be combined with standard 3D and triple-resonance experiments (e.g., NOESY-HSQC or HNCA) to enable a conventional sequential assignment. With this approach the structure of the 81 kDa protein malate synthase G was determined. In addition, the structure of α-helical membrane proteins, which are solubilized in detergent micelles, bicelles, or nanodiscs, can be determined using TROSY techniques. This facilitates pharmacological studies providing detailed insight into the interaction of drugs with membrane receptors. Large protein complexes give rise to vast numbers of peaks with a great potential for chemical shift degeneracies. Therefore, to minimize spectral crowding, TROSY experiments are restricted to the analysis of an isotopically labeled subunit in an otherwise unlabeled complex, or to complexes consisting of symmetrical subunits. To circumvent the problems associated with the multitude of signals in large systems, Lewis Kay and colleagues developed the methyl-TROSY approach. This method combines the advantages of the sharp lines from the TROSY with the limited number of signals provided by selective amino acid labeling (Section 18.1.5, Selective Amino Acid Labeling). Hence, TROSY-type relaxation experiments on protonated methyl groups (isoleucine, leucine, valine, alanine, methionine, or threonine) in an otherwise perdeuterated protein allowed for the functional analysis of the gating mechanism in the proteasome. For very large proteins (>200 kDa) the magnetization transfer through scalar coupling (also used in the TROSY-HSQC) becomes ineffective. For those proteins, the CRIPT and CRINEPT

465

466

Part II: 3D Structure Determination

techniques achieve an efficient magnetization transfer via cross-relaxation. Solid-state NMR represents an alternative to solution NMR methods for the study of large molecular weight systems. As the name implies, proteins are not analyzed in solution but as solid powders or microcrystals. “Magic angle” spinning of the solid protein in specially designed rotors at several thousand Hertz results in line narrowing of the resonances. Theoretically no size limitation exists; however, the complexity of the spectra for large protein restricts their analysis. Solidstate NMR mainly relies on the detection of 13 C and 15 N spins; efforts are underway to directly detect protons at very high spinning speeds (> = 60 kHz). Due to the low sensitivity of the respective nuclei, solid-state experiments are often restricted to two dimensions. However, the development of dynamic nuclear polarization (DNP) techniques to enhance the sensitivity holds the promise of the development of higher-dimensional experiments. In-cell NMR Spectroscopy For structural characterization proteins are highly purified, whereas in the cell the protein coexists with other cellular parts, membranes, and thousands of other proteins at very high concentration (200–300 g l 1). Therefore, it is of great interest to analyze the structure and dynamics of proteins in a cellular context. To address this need, NMR spectroscopy was applied to intact cells, so-called in-cell NMR. Initially, in-cell NMR methods were developed for bacterial cells and the structures of small proteins could be solved. More recently the focus also shifted to mammalian cells. The simplest approach for in-cell NMR is to cultivate the cells in media supplemented with isotopically labeled amino acids and to overexpress the target protein. However, metabolites often produce strong background signals and thus decrease the contrast in the resulting spectra. To obtain clear spectra, protein delivery systems were developed that introduce isotopically labeled proteins (obtained from heterologous bacterial expression) into unlabeled cells. Good delivery efficiencies might be obtained by electroporation or Lipofectamine transfection. Currently, in-cell NMR measurements are limited to only a few hours due to acidification of the buffer and the resulting stress for the cells. Yet, this time enables the measurement of 2D HSQC spectra that can provide information about intracellular protein–protein interactions, post-translational modification, or protein dynamics. To circumvent some of the problems associated with mammalian cells, it may sometimes be more advantageous to work with cell lysates or cytoplasmic extracts. These extracts are easy to make and the reaction conditions can be better controlled, for example, when post-translational modifications are analyzed. Microinjection of proteins into large Xenopus laevis oocytes constitutes another alternative.

18.2 EPR Spectroscopy of Biological Systems Olav Schiemann and Gregor Hagelüken University of Bonn, Institute of Physical and Theoretical Chemistry, Wegelerstr. 12, 53115 Bonn, Germany

Electron paramagnetic resonance (EPR) or electron spin resonance (ESR) is a spectroscopic method that is used to obtain information on the chemical nature, structure, dynamics, and local environment of paramagnetic centers. Such centers are defined by having one or more unpaired electrons. In biological macromolecules these can be metal ions (e.g., Cu(II), Fe(III), Mn(II), Mo(V)), metal clusters (e.g., iron-sulfur clusters or manganese), or organic radicals. Organic radicals are formed, inter alia, as intermediates in electron transfer reactions of proteins (e.g., semiquinone anion, thiyl, or tyrosyl radicals) or they are induced by radiation damage in DNA molecules (e.g., sugar or base radicals) (Figure 18.34). Frequently, these centers are involved in catalytic cycles or in biologically relevant reactions. Diamagnetic biomolecules, in which all electrons are paired, can be made accessible to EPR spectroscopy by spin labeling techniques. In particular, nitroxides can be site specifically and covalently linked to biomolecules. This site directed spin-labeling approach together with EPRbased distance measurements between spin labels can be used to study configuration changes, to obtain coarse-grained structures of a whole biomolecule or to localize paramagnetic centers. Similar to NMR spectroscopy, which was described in the previous section, EPR spectroscopy is a magnetic resonance technique, in which the normally degenerate ms = ±½ levels of an electron spin s = ½ are split by an externally applied magnetic field, and transitions between

18 Magnetic Resonance Spectroscopy of Biomolecules

467

Figure 18.34 Examples of paramagnetic centers in biological systems. The wavy lines and –R represent the peptide chains, oligonucleotides, or cofactors. (a) Cu(II) in plastocyanin, (b) Fe4S4 clusters in ferredoxins, (c) Mo(V) in dimethyl sulfoxide reductases, (d) 3´ -sugar radical in γ-irradiated DNA, (e) thymyl radical in γ-irradiated DNA, (f) benzosemiquinone anion radical, (g) tyrosyl radical in photosystem II, (h) thiyl radical in ribonucleotide reductases, (i) 1-oxyl-2,2,5,5-tetramethylpyrroline-3-acetylene (TPA), (j) 1-oxyl-2,2,5,5tetramethylpyrroline-3-methyl) methanethiosulfonate (MTSL; nitroxide spin label for proteins), (k) 4,4-dimethyl-oxazolidine-3-oxyl-based nitroxide spin label for membranes.

these levels are induced by microwaves. The first continuous-wave (cw) EPR experiment was performed in 1944 by E.K. Zavoisky and the first pulsed EPR experiment in 1961 by W.B. Mims. The high technical requirements for pulsed EPR experiments, however, meant that pulsed EPR spectrometers only became commercially available in the late 1980s, about two decades later than for NMR. Since then, and in conjunction with the ongoing development of high frequency/high-field EPR spectrometers, computer-based EPR simulation programs, and quantum chemical methods for the translation of the EPR parameters into structural data, EPR spectroscopy has become increasingly important.

18.2.1 Basics of EPR Spectroscopy At this point, the physical principles of EPR parameters will be briefly outlined. For a more indepth treatment, a quantum mechanical description is inevitable. For this, the reader is referred to the references in the “further reading” section at the end of this chapter. Electron Spin and Resonance Condition Since the Stern–Gerlach experiment it has been known that an unpaired electron has a quantum mechanical angular momentum, the so-called electron spin s. The length of the vector s is given by: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jsj ˆ ħ s…s ‡ 1†

(18.15)

468

Part II: 3D Structure Determination

Here s = ½ is the spin quantum number and ħ is Planck’s constant divided by 2π. As with any charged particle with an intrinsic angular momentum, the electron spin is linked with a magnetic moment μe, which is oriented along the electron spin vector, but opposite to it due to its negative charge: μe ˆ ge

e s 2me

(18.16)

Here, e is the elementary charge, me the resting mass of the electron, and ge = 2.0023 is the gfactor of the free electron. In an external magnetic field B, there are only two possible orientations of the spins, parallel or anti-parallel to the external field. The component of the electron spin in the direction of the magnetic field B0 (usually defined as the z-direction) is: sz ˆ ms ħ

(18.17)

The magnetic quantum number ms can take values of +½ and ½ for s = ½. Due to Heisenberg’s uncertainty relation, the sx and sy components cannot be simultaneously determined with sz. From the orientation of the electron spin along the z-axis follows a corresponding orientation of the magnetic moment μe. Its z-component μe,z is given by: μe;z ˆ ge μB ms

(18.18)

where μB is the Bohr magneton, which itself is given by: μB ˆ

eħ ˆ 9:274  10 2me

24

JT

1

(18.19)

Thus, the energy E of a magnetic moment that is oriented along the z-axis of the external magnetic field is: E ˆ μe;z B0 ˆ ge μB ms B0

(18.20)

While, without a magnetic field, the two spin orientations ms = ½ and ms = +½ are energetically degenerate, they are split by applying an external magnetic field (Zeeman splitting). The energies E of the two orientations are given by: 1 E …ms ˆ‡1† ˆ ge μB B0 2 2 E…ms ˆ 1† ˆ 2

1 g μ B0 2 e B

(18.21) (18.22)

The energy difference ΔE between the two energy levels is proportional to the magnitude of B0 (Figure 18.35): ΔE ˆ ge μB B0

(18.23)

In a macroscopic sample with N spins, more spins are present in the low energy ms = ½ spin state than in the ms = +½ state. The ratio of the occupation numbers N …ms ˆ‡1† /N …ms ˆ 1† is 2 2 described by the Boltzmann distribution: N…ms ˆ‡1† 2

N…ms ˆ 1† 2

Figure 18.35 (a) Zeeman splitting for an electron spin in a magnetic field B0. (b) The absorption line, which is obtained when the resonance condition is satisfied, and (c) the first derivative obtained by the field modulation of the absorption line.

ˆe

ΔE kT

ˆe

ge μB B0 kT

(18.24)

For T = 300 K, B0 = 340 mT and k = 1.3806 × 10 23 J K 1 (Boltzmann constant) a ratio of the occupation numbers of 0.999 can be calculated from this equation (NMR: 0.99999). Thus, at room temperature, the lower energy ms = ½ state is populated slightly more. Using electromagnetic radiation, spins from the ms = ½ level can be excited into the ms = +½ level, when the energy of the radiation is equal to ΔE and thereby satisfies the resonance condition: hν ˆ ΔE ˆ ge μB B0 …resonance condition†

(18.25)

Here, as in NMR spectroscopy, the magnetic field component of the electromagnetic radiation interacts with the magnetic moment of the electron spin.

18.2.2 cw- EPR Spectroscopy In cw-EPR spectrometers, microwaves at a constant frequency are continuously applied to the sample and the external magnetic field is swept. The experiment is set-up such that at

18 Magnetic Resonance Spectroscopy of Biomolecules

some point during this process the resonance condition is met. Most commonly, cw-EPR spectrometers operating at a microwave frequency of about 9.5 GHz (X-band) are used, so that for a radical with g  2 the resonance absorption occurs at a magnetic field of about 340 mT. Since the ratio of the occupation numbers is close to one, the absorption signal is of relatively low intensity. In cw-EPR spectrometers, the sensitivity is increased by adding a small modulated magnetic field component to the external magnetic field, which enables the lock-in detector to filter-out noise. This field modulation is also the reason that the absorption signal is obtained in the form of its first derivative (Figure 17.35). In addition, the sensitivity can be further improved by making measurements at low temperatures and stronger magnetic fields (increase of the occupation ratio). To indicate the position of the line independently of the magnetic field and the microwave frequency, the position of the absorption line is expressed as a g-value (similar to the chemical shift of NMR spectroscopy): gˆ



ν hν ˆ 7:144775  10 2 …in MHz=mT† μB B0 B0

(18.26)

18.2.3 g-Value

469

Figure 18.36 Splitting scheme for an unpaired electron (blue), which is strongly coupled to a 1 H nucleus (I = ½, black); Aiso is assumed to be positive. The dashed lines show the transitions, which are allowed by the EPR selection rules (Δms = ±1 and ΔmI = 0). Thus, in the EPR spectrum two lines, centered around giso and separated by Aiso, would be observed.

For a free electron only one line would be observed, with g = ge = 2.0023. However, if the unpaired electron resides in a molecule, deviations from ge occur due to spin–orbit coupling. These deviations are characteristic for the electronic state, the bonding situation, and the geometry of the particular molecule. For organic radicals, these deviations are usually small, since their magnetic orbital moment is nearly zero. The nitroxide TPA (Figure 18.34i), for instance, has a g-value of 2.006. For transition metal ions, however, significantly larger deviations can be observed. The Fe+ in an MgO matrix, for example, has a g-value of 4.1304. The g-values can thus be used to characterize and distinguish different paramagnetic species. A quantitative calculation or a translation of observed g-values into structural parameters with, for example, density functional theory (DFT) methods is, however, still complicated, especially for transition metal ions, and is the subject of current research.

18.2.4 Electron Spin Nuclear Spin Coupling (Hyperfine Coupling) In addition to the spin–orbit coupling, the magnetic moment of the electron spin can be coupled to the magnetic moments of nuclei (Figure 18.36) if the nuclear spin I is greater than zero (e.g., 1 H, 14 N, 31 P, 13 C, 17 O, 33 S, 55 Mn, 95 Mo, 57 Fe, 51 V). The coupling of the electron and the nuclean magnetic moments leads to a splitting of the absorption line into M = 2NI + 1 lines (multiplicity rule). Here, N is the number of equivalent magnetic nuclei and I is their nuclear spin quantum number. The magnitude of the splitting, which is called the hyperfine coupling constant Aiso, depends linearly on the magnetic moment μI of the coupling nucleus and the spin density jψ …r ˆ 0†j2 of the unpaired electron at the nucleus (Fermi contact interaction): Aiso ∼Mi jψ …r ˆ 0†j2

(18.27)

Since only s-orbitals have a non-zero value at the nucleus but the unpaired electron in most radicals resides in a p-, π, or d orbital, the question arises as to why such radicals show a hyperfine coupling in the first place. The reason for this is spin polarization, a mechanism that generates spin density in low energy s orbitals (Figure 18.37). If an EPR spectrum of a radical is obtained with resolved hyperfine coupling structure, the spin density distribution can be determined from the hyperfine coupling constants and thus statements regarding the nature and structure of the radical can be made. An example is the cw-EPR spectrum of the nitroxide TPA (Figure 18.38a). As in all alkyl nitroxides the unpaired electron resides in a π orbital between the nitrogen and oxygen atom. At both nuclei, spin density is generated through spin polarization. However, since the most abundant oxygen isotope, 16 O, has a nuclear spin of I = 0, no line splitting is induced by this nucleus. In contrast, nature’s most abundant nitrogen isotope, 14 N, has a nuclear spin of I = 1, so the absorption line is split into a triplet with a hyperfine coupling constant

Figure 18.37 Mechanism of spin polarization: the unpaired electron with an α-spin in the p-orbital has its largest probability at a relatively large distance from the nucleus. According to Hund’s rule, an electron in the s orbital has an electron spin (α) parallel to the unpaired electron in the p-orbital, if it is also far away from the nucleus. Due to Pauli’s rule, the second s-electron close to the nucleus must then be set anti-parallel to the first s-electron (β). In this way, excess β-spin density is induced at the nucleus, since the probability for α- and β-s electrons at the nucleus is no longer equal.

470

Part II: 3D Structure Determination

Figure 18.38 (a) cw-X-band EPR spectrum of TPA in liquid solution. (b) Model for an electron transfer chain. The reduction of Co(III) to Co(II) is carried out by an intramolecular electron transfer. The electron is first transferred from the Cr(II) to the pyrazine and only then to the Co(III). The pyrazine radical could be detected by EPR spectroscopy on the basis of the hyperfine signature and the g-value. Often the cw-EPR spectra are shown without an x-axis and only a scale ruler is given. Source: Spiecker, H. and Wieghardt, K. (1977) Inorg. Chem., 16, 1290–1294. With permission, Copyright  1977, American Chemical Society.

Aiso = 1.4 mT = 39.2 MHz. The additionally observed small lines on the low and high field side of each 14 N-line are caused by 13 C hyperfine coupling to one of the directly bonded carbon nuclei (I = ½) or one of the carbon nuclei of the methyl groups. This hyperfine coupling of 0.6 mT splits each of the three lines into a doublet. However, due to the low natural abundance of 13 C (1.1%), the intensity of this triplet of doublets is low, so that the 14 N triplet dominates the EPR spectrum. Actually, each of the three 14 N lines would have to be split by six 13 C nuclei. However, the probability that multiple 13 C isotopes are found within one molecule is so small that these couplings are not observed. Although, the question of the spin density distribution may appear merely academic for a nitroxide, it is important for cofactors in electron transfer proteins. For example, to understand electron transfer mechanisms in proteins, it is crucial to know whether the transported electron can reside on a cofactor in an electron transfer pathway or which part of a co-factor acts as the electron acceptor or electron donor (Figure 18.38b).

18.2.5 g and Hyperfine Anisotropy The considerations in the preceding sections are based on radicals in liquid solution, that is, the radicals rotate so fast compared to the EPR time scale (actually compared to the size of the anisotropy of the interaction) that they do not a have fixed preferential orientation with respect to the external magnetic field B0. Spectra acquired under such conditions are called isotropic. If the sample is frozen, a powder or a single crystal, each molecule has a fixed orientation with respect to B0 and the EPR spectra are characterized by orientationdependent (anisotropic) contributions. Such contributions can be found for g, the hyperfine coupling and the coupling between the unpaired electrons (Section 18.2.6). Although the occurrence of these anisotropic contributions in an EPR spectrum renders its analysis more difficult, they also offer more detailed information about the nature and structure (angles and distances) of the paramagnetic system. For this reason, as well as to increase the sensitivity and to extend the lifetime of short-lived radicals; most EPR experiments on biological systems are performed in frozen solution at very low temperatures (down to, for example, 3.5 K). g Anisotropy If the unpaired electron resides in an s orbital, for which, due to its spherical symmetry, the three spatial directions x, y, and z are equivalent, then gx = gy = gz. For EPR spectra in such a spherically symmetric case and in the absence of hyperfine interactions, one observes one line with a g-value at giso even in the solid state. Anisotropy in g occurs when the unpaired electron resides in an orbital of the molecule that is not spherical. In the axisymmetric case (two equivalent directions in space), a spectrum as shown in Figure 18.39a with two canonical g-values (g? and g|) is obtained. In the orthorhombic case (x ≠ y ≠ z) all three g-values are different (gx ≠ gy ≠ gz) and one obtains a spectrum similar to the one shown in

18 Magnetic Resonance Spectroscopy of Biomolecules

Figure 18.39b. The indices x, y, and z stand for the three canonical directions in space, meaning that g changes in accordance with the orientation of the molecule with respect to B0. In liquid solution, however, when the molecule rotates rapidly, the anisotropic components of g are averaged out such that: giso ˆ

 1 gx ‡ gy ‡ gz 3

(18.28)

1 2g? ‡ gk 3

(18.29)

471

Figure 18.39 (a) Axial cw-X-band EPR spectrum. The first derivative is given at the bottom and, for clarity, the absorption spectrum at the top. The absorbance at g = gz = g| results from molecules that are oriented with g| parallel to B0, while at gx = gy = g? only those molecules absorb that are oriented with g? parallel to B0. Between these two points, only those molecules contribute to the absorption that have neither g| nor g? parallel to B0. An example of an axially symmetric molecule with resolved g| = 2.00 and g? = 5.67 is the high-spin Fe(III) porphyrin of cytochrome P450. Source: from Woggon, W.-D. et al. (1998) Angew. Chem., 110, 3191–3195. With permission, Copyright  1998 WILEYVCH Verlag GmbH, Weinheim, Germany. (b) Orthorhombic cw-X-band EPR spectra. A system with orthorhombic symmetry and g-values at g1 = 2.196, g2 = 2.145, and g3 = 2.010 obtained from the Ni center in the [Ni -Fe] hydrogenase. Source: Foerster, S. et al. (2005) J. Biol. Inorg. Chem., 10, 51–62. With permission, Copyright  2004, SBIC. The assignment of x, y, and z to the corresponding g-values is possible by measurements on single crystals.

or: giso ˆ

High Field/High-Frequency EPR Spectrometer In many cases, especially in organic radicals, the three g-values are not separated at a magnetic field of 340 mT/microwave frequency of 9.5 GHz (X-band) due to the relatively small g-anisotropy. But it is possible to resolve the g-anisotropy by using higher frequencies/fields (Figure 18.40). This exploits the fact that the separation of the g-values (measured in magnetic field units) increases with the magnetic field. In contrast, the size of the line splitting due to the hyperfine coupling is independent of the magnetic field. EPR spectrometers working at higher frequencies and that are commercially available are those working at 36 GHz (Q-band)/1.3 T, 95 GHz (W-band)/

Figure 18.40 Magnetic field dependent splitting of gx, gy, and gz. An example of the cw-EPR spectra of a semiquinone anion radical in X(9.5 GHz/0.34T)- and G (180 GHz/T 6,4)-band. The g-values themselves do not change, only the distance (in mT) between them increases.

472

Part II: 3D Structure Determination

3.4 T, or 260 GHz/9.2 T. Spectrometers with even higher frequencies (360 GHz, 640 GHz, and in the THz range) are also used, but are technically very demanding and not yet commercially available. However, the technical effort is worthwhile, not only for the resolution of the gvalues, but also because of the following advantages:

 Superimposed spectra of different radicals can be separated by their different g-values.  The sensitivity of the spectrometer increases at higher magnetic field/frequency (due to the  Figure 18.41 Illustration of the distance vector r and the angle θ. A and B can be either an electron and a nucleus or two electrons (Section 18.2.6).

larger population difference), which means that smaller sample amounts are needed (e.g., at 180 GHz, 0.1 μl of a 1 mM Mn(II) solution is sufficient). If the same sample is measured at different microwave frequencies, the magnetic-fieldindependent hyperfine splitting constant can be separated from the magnetic field dependent g-splitting (in magnetic field units).

Hyperfine Anisotropy An anisotropic hyperfine coupling component Ai, which is observed in a spectrum, is the sum of the isotropic hyperfine coupling constant Aiso and an anisotropic, purely dipolar component Ai,dip. The subscript i stands for one of the three spatial directions x, y, or z, and means that the hyperfine coupling, depending on the spatial direction or orientation of the radical, varies relative to B0:

Ai ˆ Aiso ‡ Ai;Dip

(18.30)

The anisotropic portion of the hyperfine coupling arises from the dipole–dipole coupling between the magnetic moments of the electron and the nucleus. It depends both on the distance r between the electron and the nucleus as well as on the angle θ between the distance vector r and the external magnetic field B0 (Figure 18.41). Depending on the symmetry of the paramagnetic center, a distinction of three special cases for the hyperfine coupling is made: spherical symmetry with Ax = Ay = Az, axial with A? (= Ax = Ay) and A| (= Az), and orthorhombic with Ax ≠ Ay ≠ Az (Figure 18.42). In liquid solution the dipolar hyperfine coupling components cancel each other out, so that: Aiso ˆ

1 Ax ‡ Ay ‡ Az 3

(18.31)

1 2A? ‡ Ak 3

(18.32)

Aiso ˆ

18.2.6 Electron Spin–Electron Spin Coupling If, within a biomolecule, two unpaired electrons A and B are located at a fixed distance r to each other, a coupling νAB between these two electrons can occur. As seen for the hyperfine coupling constant, this electron–electron coupling is the sum of an isotropic and an anisotropic component: νAB ˆ J ‡ D

(18.33)

The isotropic part, J, is the exchange interaction, which is non-zero when the wave functions of both electrons overlap (R < 10 Å) or if the two electrons are interacting via a conjugated bridge

Figure 18.42 Hyperfine and g anisotropy for the example of the cw-G (180 GHz/6.4 T)-band EPR spectrum of TPA in frozen solution. The Ax and Ay hyperfine couplings are not resolved.

18 Magnetic Resonance Spectroscopy of Biomolecules

473

(super exchange). It can be observed in liquid solution and allows to draw conclusions as to whether the two electron spins are aligned parallel (ferromagnetic) or antiparallel (antiferromagnetic) to each other. The anisotropic component D of the coupling is based on the interaction between the magnetic dipoles of the two electrons and is orientation-dependent. It can only be observed when the biomolecules have a fixed orientation with respect to B0 (frozen solution, powder, or single crystal). In liquid solution, the anisotropic component is averaged out: D ˆ νDip 1

νDip ˆ

3 cos2 θ



m2B gA gB μ0 1  3 4πh r AB

(18.34)

(18.35)

where: θ is the angle between the distance vector r and the external magnetic field B0 (Figure 18.41); gA and gB are the g-values of the two electrons; μ0 is the permeability in vacuum; mB is the Bohr magneton. The first term in Equation 18.35 (νDip) is therefore a constant. In an anisotropic cw-EPR spectra, the electron–electron dipolar coupling manifest itself as an additional line splitting if it is larger than the line width. If the dipolar coupling constant ν dip can be determined from a spectrum, then the distance between the two unpaired electrons can be calculated. Using cw-EPR spectra, distances of up to 20 Å can be determined in this way. At greater distances, the dipolar coupling is obscured by the intrinsic linewidth and pulsed methods can be used (see also Section 18.2.7). Such distance measurements are important in the determination of the arrangement of paramagnetic centers in biological macromolecules (Figure 18.43). Similarly, the distance between two subunits of a protein can be measured by site-directed spin labeling of the subunits with two nitroxides and concomitant determination of the dipolar coupling between the two spin labels. By varying the spin label positions, and measuring the distances between the respective pairs, it is possible to figure out the spatial arrangement of the subunits. In the same way one can examine whether the binding of ligands (small organic molecules, metal, proteins, RNA, DNA) leads to global structural changes.

18.2.7 Pulsed EPR Experiments With cw-EPR methods, strongly coupled nuclei in the vicinity of the unpaired electron can be characterized. Pulsed EPR methods allow the resolution of weaker couplings to more distant nuclei or distant unpaired electrons, which in the cw spectrum are hidden underneath the broad spectral lines. Other advantages of pulsed experiments are that the pulse sequences allow selection of those contributions to the spectrum that one would like to analyze and that multidimensional experiments are possible. In particular, with experiments such as ESEEM (electron spin echo envelope modulation) and pulse ENDOR

Figure 18.43 Example of a biological system with electron–electron dipolar coupling. (a) Arrangement of the binuclear CuA center and the Mn(II) ion in cytochrome c oxidase from Paracoccus denitrificans. (b) cw-W-band (94 GHz/3.3 T) EPR spectra of the Mn(II) center. The Mn(II) ion (s = 5 =2 and I = 5 =2 ) leads to a W-band cw-EPR spectrum with six lines. In this case, the CuA center has been switched into the diamagnetic s = 0 state, so that it cannot be detected by EPR spectroscopy. If the sample conditions are then adjusted such that the CuA center turns into the s = ½ state, a cwEPR spectrum can be observed in which each manganese line is split into a doublet. From the magnitude of the splitting (2.2 mT) the dipolar coupling constant was determined to be 3.36 mT and with this a CuA–Mn distance of 9.4 Å could be calculated. The paramagnetic CuA center itself was not detected directly. Source: Käß, F. et al. (2000) J. Phys. Chem. B, 104, 5362–5371. With permission, Copyright  2000, American Chemical Society.

474

Part II: 3D Structure Determination

(electron nuclear double resonance) detailed conclusions about the structural environment of the unpaired electron can be made (within a radius of up to 10 Å). With PELDOR (pulsed electron–electron double resonance), even distances of up to 8 nm between two electron spins can be measured. Larmor frequency: Frequency with which an electron or nucleus spins about the z-axis, the direction of an externally applied magnetic field. The name commemorates the Irish physicist Sir Joseph Larmor.

Figure 18.44 (a) Illustration of the spin orientation relative to B0. (b) Illustration of the magnetization M (big arrow). The black arrows represent the magnetic moments μe,z of the electron spins.

Basics The basics of pulsed EPR spectroscopy are comparable to those of pulsed NMR spectroscopy (Section 18.2.1). For EPR, these considerations must simply be transferred to the electron spin. In Section 18.2.1 it was shown that, for a given electron spin, the length of the s-vector and its z-component are defined, while the x and y components are undefined. Note that the magnetic moment of the spins is thus not exactly parallel to B0 (Figure 18.44a) – even if we say so in the following. However, the presence of the magnetic field leads to the electron spins precessing on cones parallel and anti-parallel to the B0 magnetic field axis (Figure 18.44b). The magnetic-field dependent frequency of this precession is known as the Larmor frequency. Since, according to the Boltzmann distribution, more spins are oriented with their magnetic moments μe parallel to B0, the sample experiences a macroscopic magnetization M parallel to B0.

Pulses To generate microwave pulses, the microwave radiation is turned on, and quickly turned off again, after a short period of time. This corresponds to a rectangular microwave pulse. In EPR spectroscopy, such pulses typically have pulse lengths in the range of nanoseconds (NMR: microseconds). Because of the very short pulse lengths, the pulses do not have a singular microwave frequency (such as the microwaves in the cw experiment), but a certain range of frequencies (Heisenberg uncertainty principle). It follows that, at a constant magnetic field, a larger part of the EPR spectrum can be excited. In the following, 90°- or 180°-pulses designate such pulses, which rotate the electron spin by 90° or 180°. In this case, all pulses are irradiated along the x-axis (from x to +x) and the magnetization is rotated clockwise. Both pulse width and pulse amplitude affect the magnitude of the rotation angle. Thus, a 180° pulse has either twice the length or a twofold power of a 90° pulse.

If a 90° pulse is applied to a sample, then this pulse interacts with the spins and rotates M from the z to the +y-axis (Figure 18.45). In Section 18.2.1 it was stated that spins can only align parallel or antiparallel to the field. This raises the question of how the individual spins must be aligned to cause a magnetization in the +y direction. This can be illustrated in the following way: The 90° pulse induces an equal population of spins on both energy levels, making the individual magnetic moments along the zaxis add up to zero. Simultaneously, the spins are no longer uniformly distributed on either of the two precession cones but form spinning packages in the direction of the +y axis. Since the spins in these packages have the same orientation and speed, this phenomenon is called phase coherence (Section 18.1.1 and Figure 18.5). The spinning packages are not static along the +yaxis, but precess at their Larmor frequency around the z-axis. In contrast to the cw experiment in which the absorption of the microwave is detected, this precession of the spins induces a current or a signal in the detector, which can then be recorded. Relaxation In general, relaxation means the return from the excited state to the ground state. Due to random spin–spin interactions, the phase coherence described above is lost with time, so that the spins are again evenly distributed in the two precession cones. This loss of phase coherence is called T2 relaxation. For electron spins, T2 is in the range of nano- to microseconds. At the same time, due to energy exchange processes with the environment, the spins return again into the Boltzmann equilibrium distribution between the two energy states. This relaxation is called T1 relaxation and for electron spins it lies in the range of micro- to milliseconds. Spin Echoes For technical reasons, and for most samples, no free induction decay (FID, Section 18.1.2) can be observed after the 90° pulse in EPR spectroscopy. Instead, as a

18 Magnetic Resonance Spectroscopy of Biomolecules

475

workaround, a so called “spin echo” is detected. Below, a simple two-pulse sequence is described that generates such a spin echo (Figure 18.46). At a time τ after the 90° pulse, the spin packets have traveled different distances from the y-axis due to their differing rotational velocities (Figure 18.45c). The different rotation speeds are the result of the slightly different magnetic environments and thus Larmor frequencies of the spins. If the sample is now irradiated with a 180° pulse, the spins (Figure 18.45d) are rotated by 180° around the x-axis. Since they maintain their direction and speed of rotation, the spins start to re-phase (Figure 18.45e) and form a magnetization or a spin echo along the y-axis after the time 2τ (Figure 18.45f). In honor of its discoverer Erwin Hahn, this echo is called Hahn echo. A different spin echo can be generated with a three-pulse sequence (Figure 18.47). With a 90° pulse, the magnetization is rotated to the x/y plane, then after a short time τ a second 90° pulse follows. By this second 90° pulse, the x/y magnetization is stored along the z-axis. With a third 90° pulse after a time interval T, the magnetization is detected as a so-called stimulated echo. ESEEM – Electron Spin Echo Envelope Modulation If the time interval τ of the pulse sequence shown in Figure 18.46 is gradually increased, and the amplitude of the Hahn-echo is detected at each step, one can observe an oscillating amplitude of the echo, which is known as electron spin echo envelope modulation (ESEEM). The cause of this modulation is transitions between the nuclear spin levels. A plot of the echo amplitude versus τ yields a time trace from which the corresponding oscillation frequencies can be obtained by Fourier transformation. From these frequencies, weak couplings of nuclei, which are too small to be observed by cwEPR, can be determined and thus more detailed information about the structure of the environment of the paramagnetic center can be obtained. A disadvantage of this pulse sequence is that due to the rapid T2 relaxation the line width is very large, resulting in overlapping lines. In addition, higher harmonic frequencies occur, and the observed redundant sum and difference frequencies make it difficult to interpret the spectra. These problems can be avoided by using the stimulated echo based three-pulse ESEEM. In this experiment, the time interval T in Figure 18.47 is gradually increased and the echo amplitude of the stimulated echo is recorded as a function of T. Because the amplitude of the stimulated echo decays with the slower relaxation time T1, a longer time window for monitoring the oscillation is available and thus the obtained frequency lines are narrower than with the twopulse ESEEM. An example of a three-pulse-ESEEM is an experiment on the electron transfer protein bo3 ubiquinol oxidase from Escherichia coli. During the electron transfer reaction, the ubiquinone is converted into an ubisemiquinone anion radical (UB• ), which enabled an EPR investigation into how the UB• is structurally bound. In Figure 18.48 the corresponding three-pulse ESEEM is shown in both the time and the frequency domains. In the Fourier transformed spectrum, there are four lines at 0.95, 2.32, 3.27, and 5.2 MHz, which were assigned to a 14 N-nucleus near the UB• . The occurrence of these four lines can be explained as follows: Since 14 N has a nuclear spin of I = 1, each of the two electron spin states ms = ±½ split into three levels due to the 14 N hyperfine coupling (Figure 18.49). If the nuclear Zeeman interaction (interaction between the magnetic moment μI of the nuclear spin and B0) is half as large as the hyperfine coupling, then both interactions cancel out in one of the two mS levels. The energy separation in this mS level is then determined by the quadrupole interaction that occurs for nuclei with I > ½. This quadrupole interaction is defined by the quadrupole coupling constant Q and the asymmetry parameter η. Both parameters are very sensitive to the distribution of charge in a molecule and the molecular structure. Between the three nuclear spin levels in each of the two ms-levels there are three nuclear spin transitions. The three nuclear spin transitions ν0, ν+, and ν are particularly intense in three-pulse ESEEM. The double-quantum transition νdq from the other mS-level, however, is of low intensity and broad and the two single quantum transitions νsq1 and νsq2 are often not detected. From the frequencies belonging to the four lines, the isotropic 14 N hyperfine coupling constant, the 14 N quadrupole coupling constant, and the asymmetry parameter η could be calculated. With these data and from the two-dimensional spectrum (Figure 18.51b below) and

Figure 18.45 (a) Magnetization M along B0; x, y, and z are Cartesian coordinates and B0 is oriented along z; (b) M after the 90° pulse; (c) dephasing during the time interval τ; (d) inversion of the spins by the 180° pulse; (e) refocusing of the spins in the time interval τ following the 180° pulse; (f) Hahn-echo at time 2τ after the 90° pulse.

476

Part II: 3D Structure Determination

Figure 18.46 Two-pulse sequence to generate a Hahn echo (HE).

Figure 18.47 Three-pulse sequence for generating a stimulated echo (SE).

Figure 18.48 Three-pulse ESEEM spectrum and structural formula of UB• in the QH-binding pocket of the ubiquinol oxidase. (a) Time domain spectrum and (b) frequency domain spectrum after Fourier transform of (a). Source: reproduced with permission from Grimaldi, S. et al. (2001) Biochemistry, 40, 1037–1043. With permission, Copyright  2003, American Chemical Society.

Figure 18.49 Splitting scheme for a coupled spin system with s = ½ and I = 1, for the case that the hyperfine interaction Aiso and the nuclear Zeeman interaction νN cancel each other out (Aiso = 2νN).

isotope labeling it was concluded that the UB• is bound only with the C1 carbonyl group via a strong hydrogen bond to a nitrogen atom in the amino acid backbone while the other carbonyl group is not bound to the protein. This finding of an asymmetric binding of UB• helped in understanding the directional electron transfer in this protein.

Figure 18.50 HYSCORE pulse sequence.

HYSCORE – Hyperfine Sublevel Correlation Experiment The hyperfine sublevel correlation experiment (HYSCORE) is a two-dimensional cross-correlation experiment, which can be used to identify nuclear spin transitions (lines) belonging to the same nucleus. The experiment is carried out such that, between the last two 90° pulses of a stimulated echo sequence, a 180° mixing pulse is introduced (Figure 18.50). Then, both the time interval t1 and the time interval t2 are varied. After a two-dimensional Fourier transform a two-dimensional frequency domain spectrum is obtained. In this spectrum, cross correlations occur between those peaks that belong to nuclear spin transitions in different ms levels but from the same nucleus. In the case of ubiquinol oxidase, HYSCORE was used to determine whether the four lines in the three-pulse ESEEM (Figure 18.48) originate from the nitrogen nucleus. The

18 Magnetic Resonance Spectroscopy of Biomolecules

477

Figure 18.51 (a) Theoretical HYSCORE spectrum. (b) HYSCORE spectrum of UB• in the QH binding pocket of b03 ubiquinol oxidase from Escherichia coli. For a better overview only one of the four quadrants is shown. Source: reproduced with permission from Grimaldi, S. et al. (2001) Biochemistry, 40, 1037–1043;  2003, American Chemical Society.

corresponding HYSCORE spectrum is shown in Figure 18.51b. One can clearly see the crosscorrelations between the four lines indicating that indeed they can be assigned to a single nitrogen nucleus. ENDOR – Electron Nuclear Double Resonance Electron nuclear double resonance (ENDOR) experiments can be used to observe weak and strong hyperfine couplings of near and distant centers. There are several ENDOR variants that can be carried out both as cw- or pulse-experiments. In a Davies ENDOR experiment (Figure 18.52a), the magnetization is inverted with a 180° pulse (from +z to z), and after a mixing time T, the magnetization is detected again with a 90°τ-180°-τ-echo sequence. During the time T, a 180° radio frequency pulse is applied (180° pulse for the nuclear spins) and the echo amplitude is measured as a function of the radio frequency. If nuclear spin transitions are induced by the radio frequency, then lines occur at the respective radio frequencies in the ENDOR spectrum. Another pulsed ENDOR experiment, Mims ENDOR (Figure 18.52b), is based on the stimulated echo sequence, wherein the radio frequency pulse is irradiated after the second 90° pulse. This pulse sequence has a higher sensitivity for small hyperfine couplings. Both pulse sequences are static with respect to the time axis, that is, no time interval is changed during both experiments. Only the frequency of the radio wave is changed while monitoring the amplitude of the detected echo. Figure 18.53b shows a Mims ENDOR spectrum of a 31 P nucleus in a phospholipid. Amongst others, ENDOR experiments are interesting for three reasons: 1. The resolution is very good, which is why even small hyperfine couplings and hyperfine anisotropies can be resolved. 2. The number of peaks is reduced compared to the cw-EPR spectra. If N magnetically inequivalent 1 H nuclei are contributing to a cw-EPR experiment, then the number of lines is 2N, while in the corresponding ENDOR experiments only 2N peaks will be present. In the case of N magnetically equivalent 1 H nuclei a cw-EPR spectrum will show N + 1 lines, whereas an ENDOR spectrum shows only two lines.

Figure 18.52 ENDOR-pulse sequence: (a) Davies sequence and (b) Mims sequence. RF stands for radio frequency and MW for microwave frequency.

Figure 18.53 (a) Structural formula of the phospholipid and (b) Mims ENDOR spectrum of the 31 P nucleus and simulation of the spectrum (below); ν(31 P) is the free Larmor frequency of the 31 P-nucleus (6 MHz). From the splitting of 33 kHz and the line width, the distance between the 31 P-nucleus and the electron spin was determined to be 1 nm. This corresponds to an extended conformation of the lipid. The small signal-to-noise ratio indicates that the distance of 1 nm is at the upper distance limit of the method. Source: adapted from Zänker, P.P, Jeschke, G., and Goldfarb, D. (2005) J. Chem. Phys., 122, 024515, 1-11. With permission,  2005 American Institute of Physics.

478

Part II: 3D Structure Determination

3. In ESEEM experiments, the echo modulation depth tends to go to zero at high microwave frequencies/magnetic fields. In contrast, in ENDOR the nuclei are even better separated at high microwave frequencies/magnetic fields (based on their nuclear Larmor frequencies).

Figure 18.54 Four-pulse PELDOR sequence; νA and νB denote the two different microwave frequencies.

PELDOR – Pulsed Electron Double Resonance With the pulsed electron double resonance (PELDOR) sequence (Figure 18.54), the dipolar coupling between two unpaired electrons A and B can be selectively detected. Here, the detection sequence 90°-τ-180°-t180° is applied to electron A using a microwave frequency νA. This leads to a refocused spin echo, which can be detected after the time interval t-τ after the last pulse. Within the time interval T, a pump pulse with the microwave frequency νB is applied, which inverts the spin of electron B in the same molecule. When the two unpaired electrons are coupled, the inversion of the electron B results in a change of the magnetic field at the electron A and thus a change in the echo amplitude. Moving the pump pulse within the time interval T leads to an oscillation of the observed echo amplitude. The frequency of this oscillation is the frequency of the electron–electron coupling νAB = J + D; J can usually be neglected for distances rAB > 1.5 nm, which in turn makes νAB depend only on D. This dipolar frequency depends only on the distance between the two electrons as well as the orientation-term (1 3cos2θ) (Section 18.2.6). Since the measurement is performed in frozen solution and usually without orienting the sample, all orientations of θ are detected. This yields in the frequency domain the so-called dipolar Pake pattern from which the frequency for θ (90°) can be read off and with which the distance rAB can be directly calculated. With this pulse sequence, distances of up to 160 Å between two spin centers can be measured. An example is shown in Figure 18.55, where the time domain spectrum for a DNA molecule labeled with two nitroxides is presented. The figure also shows the frequency domain spectrum obtained by Fourier transformation (termed Pake spectrum or Pake pattern). The intense line at 6.8 MHz corresponds to the orientation θ = 90°. From this value a distance rAB of 19.5 Å can be

Figure 18.55 (a) Reaction scheme for the covalent attachment of nitroxides to DNA or RNA; (b) PELDOR – time domain spectra for a series of five DNA duplexes in which the distance between the two nitroxides increases from 1 to 5; (c) Fourier transform PELDOR spectrum (Pake spectrum) of DNA 1; (d) correlation of PELDOR distances and molecular dynamics simulations for a number of DNA and RNA duplexes. Source: Schiemann, O. et al. (2004) J. Am. Chem. Soc., 126, 5722–5731. With permission, Copyright © 2004, American Chemical Society.

18 Magnetic Resonance Spectroscopy of Biomolecules

479

calculated (Section 18.2.6), which matches very well with the theoretically expected distance of 19.3 Å. Comparison between PELDOR and FRET Another spectroscopic method, which can be used to measure distances in the nanometer range, is FRET (fluorescence resonance energy transfer). A comparison between FRET and PELDOR shows that the methods are complementary.

FRET, Section 16.7

 FRET provides distances of biomolecules in liquid solution. PELDOR, on the other hand, is performed in frozen solution.

 For FRET measurements a single molecule is sufficient, while concentrations in the micromolar range are required for PELDOR.

 FRET can observe distance changes in a time-resolved manner. PELDOR observes frozen distance distributions.

 In a PELDOR experiment, the coupling mechanism between both spin centers can be  

resolved and the size of J can be determined. In FRET experiments, the mechanism of fluorescence quenching is not always clear; often, reference measurements are needed. Calculation of the distance from the Pake spectrum is parameter-free. For the analysis of FRET measurements, assumptions about the orientation parameter κ must be made. Different labels are used for FRET and PELDOR. Large chromophores are frequently used in FRET. These are attached to the biomolecules via very flexible linkers. EPR labels are small and can be attached close to the surface, sometimes by rigid linkers (e.g., for DNA or RNA). Thus, it is easier to correlate the measured distances to the structure of the biomolecule. On the other hand, rigid linkers have a higher propensity to change the structure of the molecule under investigation. In reality, for both cases care should be taken to avoid the induction of structural changes.

18.2.8 Further Examples of EPR Applications In the previous sections several examples were presented for the determination of structural elements by EPR methods. This section describes three examples for the determination of mobilities, pH values, and binding constants. Quantification of Spin Sites and Binding Constants The intensity of the EPR signal depends, inter alia, on the number of electron spins in a sample. However, since many other factors influence the signal intensity, the spin number cannot be directly determined from the signal intensity in a straightforward manner. The number of spins can only be determined by comparison with a reference sample. For this, the unknown sample and the reference sample must be measured under exactly identical conditions. For technical reasons it is difficult to do so and thus the error for such experiments is normally about 15%. Nevertheless, in this way the number of spins per biomolecule can be determined if the concentration of the biomolecule is known. In addition, if the EPR spectra of a protein-bound and protein-free paramagnetic center are different, the number of binding sites, and the dissociation constants of the binding sites, can be determined. An example of this approach is described below, namely, the binding of Mn(II) to a catalytically active RNA, the minimal hammerhead ribozyme (Figure 18.56).

Figure 18.56 (a) Secondary structure of the minimal hammerhead ribozyme. The arrow indicates the cleavage site in the phosphodiester backbone. (b) Binding isotherm obtained from the EPR titration. The open circles are the experimental data while the solid line is the fit using the formula given in the graph. Source: Kisseleva, N. et al. (2005) RNA, 11, 1–6. With permission, Copyright  2005 by RNA Society.

480

Part II: 3D Structure Determination

Figure 18.57 (a) Structure of the protonated/unprotonated nitroxide. In the protonated nitroxide, the mesomeric form I is energetically unfavorable due to the repulsion between the two positive charges. Thus, form II predominates, where the electron spin is located on the oxygen. (b) L-Band (1.3 GHz/40 mT) cw-EPR spectrum of the nitroxide at three different pH values. The difference in 14 N hyperfine coupling can be clearly seen. The pKa value of the nitroxide is 4.6. Therefore, at pH = pKa = 4.6 both forms are present in a 1 : 1 ratio. Source: Sotgiu, A. et al. (1998) Phys. Med. Biol., 43, 1921–1930. With permission, copyright  1998 IOP.

When small quantities of Mn(II) are titrated to a buffer solution containing the minimum hammerhead ribozyme, a much smaller cw-EPR signal is obtained than if the same amount of Mn(II) is titrated into a buffer solution without ribozyme. The reason for this is that Mn(II) bound to the ribozyme yields very broad EPR lines, such that the EPR spectrum of the Mn(II)/ ribozyme complex cannot be observed at room temperature. Thus, the concentration of bound Mn(II) can be calculated from the signal intensity originating from Mn(II) free in solution. By plotting the ratio of bound Mn(II) against the amount of free Mn(II), a binding isotherm is obtained. In this way, the dissociation constant for ribozyme-bound Mn(II) was determined to be 4 μM. Local pH Values Nitroxides, which contain an amino group in or on the ring system, are often used as pH probes. The principle of such pH sensors is based on the protonation of the amino group in an acidic solution, whereby a positive charge is produced close to the nitroxide. The positive charge causes a shift in the spin density from the nitrogen to the oxygen atom, and thus a reduction of the 14 N hyperfine coupling and the g-tensor (Figure 18.57a). Depending on the pH, the cw-EPR spectrum is then a superposition of the spectra of the protonated and deprotonated nitroxide. From the intensity ratio of the two spectra, the concentration ratio of the two forms and thus the pH can be obtained. If such a nitroxide is bound to a biomolecule, the pH can be measured in the local environment of the biomolecule. A similar dependency of A and g on the polarity can be used to distinguish the membrane interior from the membrane surface. Mobility Nitroxide spin labels are also frequently used to obtain information on the mobility of biomolecules. This method makes use of the fact that the spectrum of the nitroxide changes depending on its rotational freedom. If the nitroxide is free in its rotational movement, an isotropic three-line spectrum is obtained at X-band (Figure 18.58, upper spectrum). If the rotation is completely frozen, one observes an anisotropic spectrum as shown in Figure 18.58 (bottom) (also compare with nitroxide at 180 GHz, Figure 18.42). Depending on the degree of rotational freedom of the nitroxide, the spectrum changes gradually from isotropic to anisotropic. A measure of the rotational freedom is the rotational correlation time τrot, which can be easily calculated from the EPR line shapes and intensities: τrot ˆ 6:5  10 Figure 18.58 Influence of the rotational correlation time on the cw-X-band EPR spectrum of the nitroxide Tempol. In the case of free rotation, the three hyperfine lines of the nitroxide are split by Aiso = 1.7 mT. If the rotation is completely frozen, the nitroxide is dominated by the anisotropic hyperfine coupling constant Az of 3.7 mT. Source: Weber, S., Wolff, T., and von Bünau, G. (1996) J. Colloid Interface Sci., 184, 163–169. With permission, Copyright  1996 Academic Press.

10

sffiffiffiffiffi h0 ΔB h1

! 1

(18.36)

Here, h0 is the intensity of the central line, h1 is the intensity of the low-field line, and ΔB is the width of the central line in Tesla. The Tesla is the unit for the external magnetic field B (actually the magnetic induction); 1 T = 104 Gauss. If a nitroxide is covalently bound to a biomolecule, the freedom of rotation of the former is limited by the freedom of rotation of the latter. The measured rotational correlation time of the nitroxide is thus a measure of the mobility of the biomolecule. However, it is difficult to separate the τrot value of the biomolecule from the measured τrot value, which has a residual contribution from the mobility of the label independent of the host biomolecule. It is simpler to determine

18 Magnetic Resonance Spectroscopy of Biomolecules

481

Figure 18.59 Binding of the TAT protein to the nitroxide-labeled HIV TAR RNA. (a) Reaction scheme for the labeling of RNA with a nitroxide (top) and the secondary structure of the TAR RNA with the nitroxide located at the blue uridine (below). (b) The cw-X-band EPR spectra of TAR RNA spin-labeled at U23 without (black) and with (blue) the bound TAT protein. The broadening of the spectrum and the decrease in the intensity of the low-field line for the TAR-TAT-complex can be clearly seen. By the analysis of multiple spectra, in which the nitroxide is bound to different positions on the RNA, statements about the influence of the dynamics of the RNA on the binding of the protein could be made. Source: Edwards, et al. (2002) Chem. Biol., 9, 699–706. With permission, Copyright  2002 Cell Press. Published by Elsevier Ltd.

relative differences. The binding of ligands to RNA can be followed in this way by EPR spectroscopy (Figure 18.59).

18.2.9 General Remarks on the Significance of EPR Spectra In most cases it is difficult to derive a clear statement or structure from a single EPR spectrum. To obtain reliable results, one can proceed as follows:

 vary the sample conditions (e.g., temperature, solvent, etc.);  modify the biomolecule biochemically (e.g., protein mutants, spin label positions, isotope labeling);

 record cw-EPR spectra at different microwave frequencies to resolve g-values and spectra of    

various radicals; this also allows the hyperfine coupling contribution to be separated from gtensor contributions; combine several pulse-EPR/ENDOR-methods to select and assign individual spectral hyperfine contributions; simulate the EPR spectra, to obtain the EPR parameters; translate the EPR parameters into structural data by comparison with results from quantum chemical methods (such as DFT), model systems, and literature data; combine with further spectroscopic methods.

18.2.10 Comparison EPR/NMR As a final point, the two complementary magnetic resonance methods EPR and NMR are compared.

 NMR spectroscopy investigates the spin of magnetic nuclei in predominantly diamagnetic

 



samples and because protons, nitrogen, or carbon nuclei occur everywhere in the biomolecule the method can elucidate the overall structure of the investigated biomolecule on an atomic level. EPR spectroscopy on the other hand detects the spin of unpaired electrons (paramagnetic samples) and observes only the local environment of the spin center. This local restriction of EPR also means that there is no size restriction for the biomolecule to be studied, while NMR is currently limited to biomolecules with a mass of roughly 80 KDa. The sensitivity of EPR spectroscopy is higher (nanomol) than that of NMR spectroscopy (millimol). This is due to the larger magnetic moment of the electron (μe/μH = –1838), which makes the Boltzmann population difference larger for EPR (1.1 × 10 3 compared to 1.1 × 10 6 at 3.4 T and T = 300 K). Due to the larger magnetic moment, the relaxation processes are faster in EPR. This is why the time scale for pulsed EPR experiments is nano- to microseconds as opposed to

482

Part II: 3D Structure Determination

 

   

milliseconds to seconds for NMR. This and the need for the use of microwaves also mean that the technical requirements are higher for EPR. The faster relaxation leads to broader lines in EPR (MHz versus Hz). The electron–nucleus spin coupling in EPR is larger than nuclear–nuclear spin coupling in NMR due to the greater magnetic moment of the electrons (MHz versus Hz). For this reason, larger distances between the unpaired electron and a nucleus (to 10 Å) or the electron and another unpaired electron (up to 80 Å) can be determined by EPR. The theoretical effort needed for the simulation of EPR spectra is significantly larger than for high-resolution liquid state NMR spectra due to the anisotropic contributions and the very fast relaxation. The translation of EPR and also NMR parameters into structural conclusions using quantum chemical calculation methods (such as DFT) is still complicated. This is particularly significant for transition metals for which often only trends are obtained. Both NMR and EPR can be carried out in either liquid buffer solutions or in frozen solutions/ powders. This is to say that no single crystals are needed for either method. Both spectroscopic methods can provide insight into the dynamics of biomolecules.

Acknowledgements We thank Yaser NejatyJahromy for careful reading of the manuscript.

Further Reading Section 18.1 Bax, A. (2003) Weak alignment offers new NMR opportunities to study protein structure and dynamics. Protein Sci., 12, 1–16. Blumenthal, L.M. (1970) Theory and Application of Distance Geometry, Chelsea, Bronx, New York. Cavanagh, J., Fairbrother, W.J., Palmer, A.G. III and Skelton, N.J. (1996) Protein NMR Spectroscopy, Academic Press. Creighton, T.E. (ed.) (1992) Protein Folding, Freeman, New York. Croasmun, W.R. and Carlson, R.M. (eds) (1994) Two-Dimensional NMR Spectroscopy, VCH-Verlagsgesellschaft, Weinheim. Derome, A.E. (1987) Modern NMR Techniques for Chemistry Research, Pergamon, Oxford. Dingley, A.J., Cordier, F., and Grzesiek, S. (2001) An introduction to hydrogen bond scalar couplings. Concepts Magn. Reson., 13, 103–127. Ernst, R.R. (1992) Kernresonanz-Fourier-transformationsspektroskopie (Nobel-Vortrag). Angew. Chem., 104, 817–952. Ernst, R.R., Bodenhausen, G., and Wokaun, A. (1987) Principles of Nuclear Magnetic Resonance in One and Two Dimensions, Clarendon Press, Oxford. Evans, J.N.S. (1995) Biomolecular NMR Spectroscopy, Oxford University Press. Fernandez, C. and Wider, G. (2003) TROSY in NMR studies of the structure and function of large biological macromolecules. Curr. Opin. Struct. Biol., 13, 570–580. Friebolin, H. (1998) Ein- und Zweidimensionale NMR-Spektroskopie, VCH-Verlagsgesellschaft, Weinheim. Goldman, M. (1988) Quantum Description of High-Resolution NMR in Liquids, Clarendon Press. Karplus, M. and Petsko, G.A. (1990) Molecular dynamics simulations in biology. Nature, 347, 631–639. Pain, R.H. (ed.) (1994) Mechanisms of Protein Folding, Oxford University Press, Oxford. van de Ven, F.J.M. (1995) Multidimensional NMR in Liquids, VCH-Verlagsgesellschaft, Weinheim. Van Gunsteren, W.F. and Berendsen, H.J.C. (1990) Moleküldynamik-computersimulationen: methodik, anwendungen und perspektiven in der chemie. Angew. Chem., 102, 1020. Wüthrich, K. (1986) NMR of Proteins and Nucleic Acids, John Wiley & Sons, Inc., New York. Weltner, W. (1983) Magnetic Atoms and Molecules, Dover Publications, New York.

Section 18.2 Atherton, N.M. (1993) Principles of Electron Spin Resonance, Ellis Horwood, New York. Berliner, L.J. (ed.) (1998) Spin Labeling: The Next Millennium, Biological Magnetic Resonance, vol. 14, Kluwer Publishing, Amsterdam.

18 Magnetic Resonance Spectroscopy of Biomolecules Dikanov, S.A. and Tsvetkov, Y.D. (1992) Electron Spin Echo Envelope Modulation (ESEEM) Spectroscopy, CRC Press, Boca Raton, FL. Eaton, G.R., Eaton, S.S., and Berliner, L.J. (eds) (2000) Distance Measurements, Biological Magnetic Resonance, vol. 19, Kluwer Publishing, Amsterdam. Kaupp, M., Bühl, M., and Malkin, V.G. (eds) (2004) Calculation of NMR and EPR Parameters, WileyVCH Verlag GmbH, Weinheim. Poole, C.P. (1983) Electron Spin Resonance – A Comprehensive Treatise on Experimental Techniques. Wiley-Interscience, New York. Schweiger, A. and Jeschke, G. (2001) Principles of Pulse Electron Paramagnetic Resonance, Oxford University Press, Oxford. Misra, S.K. (ed.) (2011) Multifrequency Electron Paramagnetic Resonance, Wiley-VCH Verlag GmbH, Weinheim. Weil, J.A., Bolton, J.R., and Wertz, J.E. (1994) Electron Paramagnetic Resonance: Elementary Theory and Practical Applications, Wiley-Interscience, New York.

483

Electron Microscopy Harald Engelhardt Max-Planck-Institut für Biochemie, Am Klopferspitz 18, 82152 Martinsried, Germany

Modern microscopic techniques produce images of small organisms, tissues, single cells, organelles, membranes, macromolecular assemblies, isolated macromolecules, and of small molecules down to atoms (Figure 19.1). The beginning of microscopy dates back to the seventeenth century when Antoni van Leeuwenhoek (1632–1723) in the Netherlands and Robert Hooke (1635–1703) in England built their first simple instruments and initiated the development of light microscopy. The scientific investigation of optical systems and of its resolution limit by Ernst Abbe (1840–1905) in Germany, the improvement of the microscopic illumination by August Köhler (1866–1948), the development of better glass materials, and Robert Koch’s famous microbiological investigations led to a blooming of microscopy in science. In the 1930s, the Dutch physicist Frits Zernike (1888–1966) devised the phase contrast microscope, which allowed

19

Light Microscopy, Chapters 7, 8

Figure 19.1 Biological structures and suitable microscopy techniques. The resolution limits of the human eye, light microscopes, and the transmission electron microscope are indicated. Fluorescence microscopes show the position of fluorophores where the optical near field microscopies (e.g., SNOM), the stimulated emission depletion (STED), photoactivated localization microscopy (PALM), and the stochastic optical reconstruction microscopy (STORM) overcome the classical resolution limit of light microscopes. Scanning electron and scanning probe microscopies provide surface structure information, light and transmission electron microscopy allow three-dimensional imaging.

Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

486

Part II: 3D Structure Determination

Atomic Force Microscopy, Chapter 20

X-Ray Strucrure Analysis, Chapter 21 NMR Spectroscopy, Chapter 18

researchers to investigate unstained cells and tissue of low contrast and whose basic principle also applies to electron microscopy. Köhler also introduced the first fluorescence microscope (1908), an instrument that gained more significance in cytological research after the development of stronger light sources. Based on the increasing success of molecular genetics and the fusion of the green fluorescent protein to other macromolecules it became possible to identify and localize cellular protein complexes, and this success enormously stimulated the field of cellular biology in the second half of the 1990s. At this time, fluorescence microscopy experienced another revolutionary breakthrough by overcoming the classical resolution limit that the refraction law and the wavelength of light dictate. The new techniques allow the detection of single fluorescence markers in cells with a spatial accuracy of about 30 nm or better. Several instrumental solutions were introduced, such as STED (stimulated emission depletion microscopy), invented by Stefan Hell (∗1962), or PALM (photoactivated localization microscopy), and STORM (stochastic optical reconstruction microscopy). The latter stimulate fluorescence labels locally and calculate the center of the fluorescence signal afterwards. None of these fluorescence microscopes provides genuine structural information, yet, they locate labeled molecules and display their distribution with unprecedented spatial precision (super-resolution). The scanning probe microscopies (SPM) in the optical near-field (scanning near-field optical microscopy, SNOM; scanning near-field infrared microscopy, SNIM) are also techniques that detect objects of “submicroscopic” dimensions and can even record spectroscopic information in the nanometer range independently of the wavelength of the light. The field of “nano-optics” has important applications in material science. The “apertureless” microscopes make use of scanning force microscopy (SFM; atomic force microscopy, AFM) that was introduced by Gerd Binnig (∗1947) in 1986, only a few years after he and Heinrich Rohrer (1933–2013) had invented the scanning tunneling microscope (STM), a novel type of microscopy for imaging surfaces. Several variants of SFM are also useful for biological structure research and are complementary to applications in light and electron microscopy. When Louis de Broglie (1892–1987) realized that electrons can be understood as waves with wavelengths far below one nanometer, he determined the theoretical basis for a microscope operating with electrons. Max Knoll (1897–1969), Ernst Ruska (1906–1988), and Bodo von Borries (1905–1956) developed the first transmission electron microscope (TEM) in 1931, and only two years later they obtained images with much higher resolution than light microscopy. Today it is possible to resolve single atoms in radiation-resistant objects such as alloys. Manfred von Ardenne (1907–1997) built the first scanning electron microscope (SEM) in 1937, an instrument that produces impressive surface images of high depth of field. Biological objects such as proteins that are very radiation-sensitive require preparative and technical efforts before imaging with quasi-atomic resolution (0.3 nm). The two-dimensional (2D) crystalline bacteriorhodopsin from halobacteria and the photosynthetic light-harvesting antenna complex LHII of the photosynthetic membranes from chloroplasts were the first biological specimens whose three-dimensional (3D) atomic structures were solved by means of electron microscopy (electron crystallography). In addition to X-ray crystallography and NMR spectroscopy, electron microscopy is the third method with which we can determine the spatial structure of macromolecules. Electrons interact particularly strongly with the object and, in contrast to the other methods, also provide images of single molecules. Automated data acquisition and novel electron detectors enable us now to collect data of single protein complexes of sufficient quantity and quality to achieve quasiatomic resolution. Single particle electron microscopy is indeed the only way to resolve the 3D structure of macromolecular complexes that are too big for NMR spectroscopy and too flexible for X-ray crystallography and do not form crystals. Wolfgang Baumeister (∗1946) and his team developed an approach that made the 3D structure of native, shock-frozen (vitrified) cells accessible to TEM at macromolecular resolution. The 3D reconstruction of the cytoskeleton in intact cells from the slime mold Dictyostelium discoideum signaled the breakthrough of cryo-electron tomography (CET) in 2002. Meanwhile, CET has granted unprecedented insights into the macromolecular organization of microorganisms and eukaryotic cells and has opened up new and exciting perspectives in structure research in situ. In 2014 Baumeister’s laboratory introduced a new tool (phase plate) for phase contrast enhancement in electron microscopy that has the potential to become standard equipment. Cryo-electron tomography now integrates

19 Electron Microscopy

487

molecular and cellular structure research, which before were separate biological disciplines.

19.1 Transmission Electron Microscopy – Instrumentation TEM received its name from the imaging mode. The electrons that transmit the object are used for imaging, similar to the function of light in optical microscopy. The basic construction of electron and optical microscopes are indeed similar, except that electron magnetic lenses are used instead of glass lenses and the electron microscopes are much bigger (Figure 19.2). The lenses consist of iron-sheathed coils that create a magnetic field upon current flow and direct the field towards the inner space. Electron magnetic lenses are always converging lenses. Correcting spherical and chromatic aberrations is thus challenging, yet, in recent years, correction systems have improved the image formation in high-resolution electron microscopes in particular. The electron source is a cathode that emits electrons. An anode with a potential difference to the cathode of 105 V or more accelerates these electrons. The condenser lens (often two lenses after each other) focuses the electron beam in the object plane where it transmits the object and is expanded by the object lens. Its focus length can be adjusted by variation of the lens current so that changing the magnification does not require other objective lenses. One or more projection lenses between the object and the imaging plane further expand the beam and lead to an increase in magnification. The image is rendered visible on a fluorescence screen (which may be inspected by a binocular loupe with tenfold magnification) and recorded on film plates in the conventional way or, nowadays, by means of special digital cameras. The microscopes allow for magnification between 100-fold and about 106-fold. However, for macromolecules and biological structures a primary magnification of 100 000-fold is sufficient. The electron microscope column must be evacuated so that electrons only interact with object atoms but not with gas molecules. Microscopy in a vacuum necessitates preparation of the biological objects (fixation, dehydration, embedding in plastic material or freezing for

Figure 19.2 Beam path in the light microscope (LM) and in the transmission electron microscope (TEM). The principal system of lenses is the same in both microscopes, the TEM usually contains a second condenser and projection lenses in addition. Condenser and objective apertures limit the illumination and blank out strongly diffracted regions of the beam, thereby enhancing the contrast. The image is observed in the image plane by eye (LM) or on a fluorescence screen (TEM) and can be recorded by a camera. An external energy filter separates electrons of different energy and enables electron energy loss spectroscopy (EELS) analysis and electron spectroscopic imaging (ESI) (see also Figures 19.9 and 19.12). X-Rays that are created by interaction of beam electrons with object electrons are either shielded or can be recorded by a detector.

488

Part II: 3D Structure Determination

cryo-electron microscopy). If not doing so, the specimens would dehydrate in the microscope and become damaged, and the vacuum would break down. The specimens are placed on copper grids (sometimes made of gold or nickel) with meshes of 30–100 μm (Figure 19.3). The grids are covered by a thin (5–10 nm) carbon film, which supports isolated macromolecules or ultrathin sections. These films are deposited on surfaces of freshly broken mica by heat evaporation of graphite in a vacuum chamber. The film is then floated onto the surface of water and transferred to the grids. For cryo-electron microscopy these films contain holes of 0.5–2 μm in diameter. Comparable grids are commercially available ready for use (Lacey® and Quantifoil® grids; Figure 19.3b). Very recently, we saw the introduction of gold grids with regularly perforated (1 μm) gold membranes instead of carbon films; these are extremely stable in the electron beam and work well for highresolution imaging. The holes are filled by water, which becomes thin ice films after freezing, and they thus contain the biological objects (macromolecules, cellular components, or organelles, cells). Carbon-coated grids are hydrophobic and have to be made hydrophilic prior to the application of material dissolved or suspended in water. Ionized gas molecules – produced in evacuated chambers with glow discharge (plasma cleaner) – render carbon film temporarily wettable. As a side effect, the plasma destroys impurities and cleans the grid surface. The grids are mounted on an object holder and inserted into the microscope (Figure 19.3a). Frozen specimens that are examined in the cold ( 180 °C or less, cryo-electron microscopy, see Sections 19.2.1 and 19.3.3) must be continuously cooled by liquid nitrogen in a controlled manner to avoid recrystallization of the amorphous ice above 150 °C. The object holder can be tilted around one axis or different axes (ideally being set perpendicularly to the first one), and the specimen can be inspected from different projection angles, which is necessary for 3D-reconstruction in electron tomography (Section 19.5).

Figure 19.3 Object holder, grids, and plunger for biological cryosamples. (a) Object holder with a mechanical shield to protect the frozen sample from contaminations. The holder can be rotated around its longitudinal axis by about ±70° to record different projections of an individual object for 3D reconstruction (tilt series). (b) Grids (diameter 3 mm) of various designs for TEM. They are usually covered with a 5–10 nm thick carbon film for small biological specimens. (c) Carbon films with holes for cryosamples. A thin ice film containing the biological structures spans the holes. Bar indicates 20 μm. (d) Plunger for vitrification of biological samples for cryo-electron microscopy. Tweezers hold the grid and inject it into a liquid cryogen (ethane) cooled by liquid nitrogen to about 180 °C. The high cooling rate vitrifies water and prevents the growth of ice crystals. The biological structures are preserved in a close-to-live state.

19.2 Approaches to Preparation Biological specimens must be sufficiently thin so that the strongly interacting electrons can pass through them. Intact cells or tissue preparations are usually chemically fixed, dehydrated, and embedded in special resins (Epon® ) or in material that polymerizes in the cold (e.g., Lowicryl® ). Ultrathin sections (100 nm) are cut from these polymer blocks by means of ultramicrotomes. Fixation, staining, and thin sectioning has been the standard procedure in cell biology for about five decades, and we owe most of our knowledge of cell architecture to this technique. We will not go into further details here since shock-frozen and untreated biological samples are currently replacing chemically fixed and stained specimens. The cryotechniques preserve the native structure and organization of macromolecules in the cellular context and open up new perspectives in cytological research (Figure 19.4). Section 19.2.1 describes the preparation of intact cells in amorphous ice and various thinning approaches for cryo-electron microscopy. Electron microscopy of isolated cell wall fragments, membranes, proteins, and other macromolecular complexes does not require thin sectioning; these objects are already thin enough. In addition to cryopreparation, Sections 19.2.2–19.2.4 describe common contrasting and labeling procedures that are used for a quick inspection of soluble macromolecules or for special applications.

19.2.1 Native Samples in Ice If the inner structure of molecules, of macromolecular assemblies, or of intact cells is to be investigated, the object itself must be imaged and not the distribution of staining material. We therefore avoid any chemical fixation and contrast enhancement and look at the native object in aqueous solution that has been physically “fixed” by rapid freezing. Thin objects are directly applied to the grid, blotted to remove excess water until only a thin film is left, and shock-frozen by plunging the grid into a cryogen (liquid ethane or ethane–propane; Figure 19.3d). The high cooling rate (105 K s 1) prevents water from forming ice crystals that would destroy the biological object. The ice remains amorphous; it is vitrified in a similar way as cold molten glass, as Jacques Dubochet (∗1941) showed in 1981. To control conditions such as temperature

19 Electron Microscopy

489

Figure 19.4 Preparation of biological samples. Cells and tissues are either treated by chemical fixation, embedding, and ultrathin sectioning for electron microscopy or for enhanced structural integrity by high-pressure freezing and freeze substitution. A close-to-live preparation is accomplished by vitrification, cryosectioning, or focused ion beam (FIB) milling. Thinner samples (cells, organelles, or isolated macromolecular complexes) are either thinned in the FIB or directly imaged in the TEM. Procedures for vitrified specimens are shaded in gray. The methods of freeze fracturing and negative staining are indicated but are not complete.

or light intensity, one can use a plunger with an integrated incubation chamber for automated blotting. Isolated protein complexes may be re-suspended in solutions of glucose or similar compounds (trehalose, tannic acid) if they have to be stabilized. The frozen grid is transferred to the microscope sample holder under liquid nitrogen and is imaged at 180 °C or less (Section 19.3.3). Since the contrast of protein (specific density 1.4 g cm 3) in ice (1 g cm 3) is low, it is advantageous to use grids with carbon films containing holes in order not to obliterate the weak contrast with other material (Figure 19.3c). However, cryopreparation of isolated protein complexes is standard and has replaced negative staining specimens for 3D structure determination. Some eukaryotic cells, bacteria, archaea, and viruses as well as macromolecular complexes can be vitrified on grids and in most cases also directly inspected in a microscope (Figure 19.4). However, eukaryotic and multicellular specimens are often too thick to be vitrified by simple plunging. The heat transfer in the center of bigger samples (thickness >10 μm) is too slow to prevent the crystallization of water. However, solutions under pressure crystallize less quickly and the slower cooling rate suffices to vitrify samples 200–300 μm thick. High-pressure freezing instruments generate up to 0.2 GPa (2000 atm) at liquid nitrogen temperature. Moreover, the cell suspensions can be supplemented by anti-freeze compounds such as dextran, which is osmotically inert and reduces the ability of water to form ice crystals. The drawback, however, is that these additives mask polysaccharides of the cell surface. Cryofixed cells or tissue may be freeze-fractured, freeze-etched, and metal-coated (replica technique; Figure 19.4) to investigate the cell surface or fracture faces. Alternatively, they can be fixed at about 100 °C, dehydrated, stained, and embedded in material that polymerizes in the cold. In cases were chemical fixation should be avoided, vitrified cells can be left untreated and are sectioned in the cryo-ultramicrotome. The cell material is frozen in a small copper tube and this is mounted on a cryo-ultramicrotome in an atmosphere of liquid nitrogen ( 150 °C). The tube is trimmed so that the ice block is free

Vitrification Glass (vitrum) is an amorphous material that does not form crystals when it solidifies after melting. Water molecules on the other hand form ice crystals whose particular structure and density depend on the temperature and pressure during the freezing process. Currently, we know of 19 different ice forms. Very fast freezing to temperatures below -140 °C at normal pressure lets water assume an amorphous, vitrified state with a density of 0.94 g cm 3 (low density amorphous ice, LDA). The other types of amorphous ice have a rather high density (1.17 and 1.26 g cm 3) and cannot be generated from liquid water. LDA is more similar to liquid water than any other form of ice.

490

Part II: 3D Structure Determination

Figure 19.5 Ultrathin sections of Mycobacterium smegmatis. Preparation by (a) conventional fixation, dehydration, and embedding in epoxide resin (Epon), (b) high-pressure freezing, freeze substitution, and embedding in Lowicryl, and (c) high-pressure freezing and cryomicrotomy without any chemical treatment, (d) a higher magnification of (c). The lipid bilayer is only visible in cryosections; these are compressed in the cutting direction so that the originally round cross section of the bacterial cell becomes oval. Scale bars indicate 100 nm (a)–(c) and 50 nm (d). Source: parts (a) and (b) courtesy of Christopher Bleck, Basel, Switzerland.

from the surrounding copper and thin sections can be produced. Cryosectioning is not a routine approach, and it consists of up to 30% compression of samples in the sectioning direction. Despite several artifacts, cryosections provide insight into details of the cellular architecture that are usually lost during dehydration and plastic embedding (Figure 19.5). A new development for thinning of frozen biological material is focused ion beam (FIB) micromachining. The vitrified samples are mounted in a scanning electron microscope (SEM) that is also equipped with an ion gun (gallium). The focused ion beam removes material from the surface of the object until it is thin enough for imaging in the TEM (300–500 nm). To save time and gallium, for ion milling the specimens should not be thicker than 5–10 μm. Appropriately thinned specimens are free of artifacts and will not be deformed. An alternative approach ablates thin layers of biological material and images the surface of the sample block by SEM. When this procedure is repeated multiple times, the series of images, aligned and consecutively put together produce a 3D cube of the object. This technique is called FIB-SEM or “slice and view” and can be applied to cells and cell assemblies. Another approach uses an integrated microtome to smooth the surface instead of an ion beam; this is particularly applied to large embedded specimens such as tissues (mouse brain).

19.2.2 Negative Staining Negative staining by heavy metal salts is a very simple and quick method to image isolated proteins, fibrillar assemblies, membranes, and similar objects at a resolution of about 2 nm. Negative staining is readily used to inspect preparations in terms of purity and homogeneity and to get a first impression of the object structure. A droplet of the sample (2–5 μl) is applied to a carbon-coated grid that has been made hydrophilic by glow discharge. After 15–60 s most of the liquid is blotted and the grid is washed with pure water, buffer, or salt solution (10 mM) and stained by means of a heavy metal salt. Common stainings are 2% (w/v) solutions of uranyl acetate, phosphotungstic acid, or ammonium molybdate, amongst others. The compounds differ in contrast, radiation sensitivity, the applicable pH range, and their ionic characteristics. The metal salt covers the surface, fills holes and indentations of the macromolecules, and it is the distribution of the metal that is finally imaged (Figure 19.6). The metal coat is much more radiation-resistant than the biological material and preserves the spatial structure after drying with moderate irradiation. Negatively stained specimens may be stored for weeks to months. However, negative staining is usually not suitable for whole cells and bigger objects.

19 Electron Microscopy

491

Figure 19.6 Schematic illustration of different contrasting methods for imaging macromolecules in the transmission electron microscope. Contrasting with a heavy metal leads to different density distributions in the EM image and provides structural information on individual components of a sample. Only images of ice-embedded native specimens show contrast originating from the object itself.

19.2.3 Metal Coating by Evaporation A common method used to investigate intact cells or tissues in a microscope is by freezing, cutting, or fracturing them in the vacuum, sublimation of ice by freeze etching (at 80 °C), and contrasting the surface by evaporating heavy metal at an angle of 30–60°. The metal coat (1–2 nm Pt/C or other) is stabilized by 10–20 nm carbon (90°). Treatment with aggressive acid solutions removes the biological material, and only the remaining replica is inspected in the microscope. The grain size of the evaporated metal limits the resolution to about 2 nm, but complexes of macromolecules in membranes and cellular surfaces are detectable. Freeze fracturing was a common method in cytology and was used to obtain spatial information on cellular surfaces. The approach lost its importance with the introduction of cryo-electron tomography in structural research (Section 19.5.3). Direct metal evaporation onto membranes or isolated protein assemblies that are adsorbed on carbon-coated grids is also possible. Since air-drying would destroy the unprotected biological specimens, they are freeze-dried to gently remove water and then contrasted in the cold, as in the freeze-etching approach. The metal is only deposited on the surface, pointing towards the evaporation source, and it renders the other regions invisible (Figure 19.6). The orientation of membranes or regular objects (2D crystals or S-layers) can thus be evaluated. Images of unidirectional metal-coated objects must be interpreted in a different manner than negatively stained samples since the density distribution resembles a landscape with hills and valleys that is transformed into a pattern of light and shadow (“metal shadowing”). The gray values of the image correspond to the first derivative of the surface function of the object, so that the original function, the surface relief, can be generated by mathematical integration. However, surface relief reconstructions cannot provide the intrinsic 3D structure of objects – this is the domain of tomographic approaches. Two special metal evaporation approaches, that is, rotational “shadowing” and decoration, are particularly suited to identifying regular structures of macromolecular complexes and fibrillar assemblies (e.g., of actin filaments or flagella). Decoration effects occur if only a limited amount of metal is evaporated so that there is no coherent metal coat. Even if the object

492

Part II: 3D Structure Determination

Figure 19.7 Labeling of proteins with antibodies. (a) Localization of the ATP synthase A1 subcomplex in an ultrathin section (after freeze substitution and embedding in Epon) of Ignicoccus hospitalis. The primary antibodies are labeled by secondary ones carrying gold clusters and having been enlarged by silver coating to 30–60 nm in diameter. The antibodies identify the enzyme in the outer membrane of the archaeon. (b) Immunolabeling of α- and (c) β-subunits of the negatively stained 20S proteasome from Thermoplasma acidophilum. The antibodies are clearly visible and identify the α-subunits in the outer and the β-subunits in the inner rings of the enzyme. Scale bars indicate 0.5 μm (a) and 20 nm (b) and (c). Source: part (a) courtesy of Reinhard Rachel, Regensburg, Germany.

Immunological Techniques, Chapter 5

temperature is below 100 °C the evaporated metal clusters can still diffuse and reach preferred locations on the surface. These sites are now decorated and the regular arrangement is clearly detectable. To obtain pure decoration but no shadowing effects it is advisable to deposit the metal at an angle of 90° onto the surface. Noble metals (Ag, Au, Pt) are particularly appropriate for decoration and often find different molecular sites. Metal coating of isolated macromolecules is more complicated than negative staining and less rewarding than electron microscopy of native and vitrified specimens.

19.2.4 Labeling of Proteins Negative staining and metal coating are non-selective contrasting methods. To identify and localize proteins amongst a wealth of other macromolecules we need specific labels such as monoclonal or polyclonal antibodies or other specific compounds that are coupled to gold clusters to achieve contrast in electron micrographs. Thin sections, and particularly cryosections prepared for immunological purposes (method according to K.T. Tokuyasu), may be labeled with secondary antibodies bearing gold clusters (5–20 nm) that bind to the primary antigen-specific antibody and identify its position with about 20–30 nm accuracy (distance between gold cluster and antigen; Figure 19.7). Subsequent silver coating enhances the size and contrast of small gold clusters. Antibody labeling of isolated macromolecules does not require secondary enhancement by gold since the target protein as well as the antibody – and thus the specific contact regions – are visible with much higher spatial accuracy. Such experiments are particularly suited for identifying the position of subunits in heterooligomeric protein complexes (Figure 19.7).

19.3 Imaging Process in the Electron Microscope The appropriate interpretation of images of a biological object does not only depend on its preparation and contrasting history but also on the imaging process in the electron microscope. It is thus useful to get some insight into the principles of image formation. It is sufficient to discuss basic physical concepts here; the comprehensive theory is described in specific textbooks.

19.3.1 Resolution of a Transmission Electron Microscope Louis de Broglie (1892–1987) introduced the relationship between the wavelength λ and the momentum p of a moving mass. His equation can be transformed for electrons with nearly relativistic speed c (in the EM about 200 000 km s 1): 1 λ ˆ hc pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2E 0 E ‡ E2 E ˆ Ub e0 λ  3:7Ub

0:6

(19.1)

19 Electron Microscopy

493

where Planck’s quantum of action h is 6.63 × 10 34 J s, E0 is the rest energy (8.19 × 10 14 J), and E is the kinetic energy of electrons. The latter depends on the accelerating voltage Ub (eV) in the microscope and the elementary charge e0 (1.60 × 10 19 C) as given in Equation 19.1. The wavelength of electrons above 50 000 eV can be assessed by empirical approximation (given in nm). The wavelength amounts to 0.0037 nm for 100 000 eV electrons and is thus smaller than the wavelength of visible light by a factor of 105. Ernst Abbe attributed the resolution limit d of the light microscope to the physical parameters in Equation 19.2, which is also valid for the EM: dˆ

λ 2nsin α

(19.2)

Here the refractive index n is about 1, and α denotes half of the aperture angle (beam width) of the objective lens. This angle identifies the region of the diffracted beam that the objective lens can capture. Only these electrons transmit information about the object; those that do not interact with the specimen are blind to the object and belong to the reference beam. To obtain images with visible structures of size d (Abbe used a periodic pattern with the characteristic distance d) the lens must at least record a beam of the first diffraction order. Since the diffraction angle Θ is related to the inverse of d (Θ  λ/d), the limiting angle α for the objective should be as large as possible (α → π/2). The term A = n sin α, that is, the numeric aperture, is of the order of 1 for objectives in light microscopes, but only 0.01 for electron microscopes. Resolution Hermann von Helmholtz (1821–1894) introduced an alternative analysis of the imaging process in optical instruments and derived another definition of resolution. The beam, originating from a luminescent object, is diffracted by a lens of finite size that creates a diffraction disc (Airy disc) in the focus plane instead of a distinct diffraction spot. Two object points are still distinguishable if the diffraction maximum of one point falls into the first minimum of the second one (criterion of John Rayleigh, 1842–1919). Using this approach for a microscope with aperture A = n sin α and replacing the viewing angle by the corresponding distance d, we obtain the formula d = 0.61(λ/(n sin α)). The resolution limits according to Abbe and von Helmholtz differ slightly by a numerical factor, and we also have to take into consideration that the nominator in Equation 19.2 depends on the illumination of the object, that is, parallel or oblique, and can thus achieve values between A and 2A. The determining variables for resolution are wavelength and the numerical aperture. Using a small aperture that increases the image contrast leads to reduced resolution; the microscopist apparently sees better but recognizes less.

Owing to the small wavelength and despite the drawback of a small aperture, electron microscopes have a physical resolution of 0.2 nm with accelerating voltages of 100–300 keV. High voltage transmission electron microscopes (HVTEMs), equipment that is used in material science, are operated at voltages up to 106 eV. The corresponding wavelength of electrons (0.9 pm) thus allows for the resolution of single atoms.

19.3.2 Interactions of the Electron Beam with the Object Imaging and Information Content of Diffracted Electrons in the TEM We distinguish between two kinds of interactions that electrons experience while passing through the object. The two kinds add to the imaging process in different ways and can be separated in certain microscope models. The electrostatic interaction of a beam electron with an atomic nucleus of the object results in a deflection of the electron path. The deflection is strong if the negatively charged electron comes close to the positively charged nucleus, if the nucleus charge is high (high atomic number elements, staining with heavy metal) and if the velocity of the electron is low (the velocity is a function of the accelerating voltage). The energy of the beam electrons remains constant if the deflection angle is not too high and the electrons experience elastic scattering. The mean scattering angle is about 0.1 rad (6°). Part of the electron beam hits the aperture, is excluded from the imaging process, and creates the scattering contrast (Figure 19.8). The drastic suppression of scattered electrons thus creates high contrast in the image but limits the aperture angle and reduces the resolution of structural details (Equation 19.2).

Figure 19.8 Interaction of beam electrons with the nucleus of object atoms. Strongly scattered electrons are shielded by the aperture and give rise to the scattering contrast. Electrons interacting with the object are decelerated by the object potential and show a phase shift of the wave front with respect to the reference beam. Interference of the scattered electrons with the non-interacting reference beam creates the intrinsic phase contrast in the transmission electron microscope (caused by lens aberrations). The scanning transmission electron microscope (STEM) is equipped with an electron detector instead of the objective aperture and records the strongly scattered electrons for imaging. The signal can be used for electron microscopical mass determination of macromolecular structures.

494

Part II: 3D Structure Determination

Figure 19.9 Interaction of beam electrons with the electron shell of object atoms. The beam electrons transfer energy to object electrons and this energy loss causes an increase of the wavelength. The beam electrons can be separated according to their energy in an energy filter and used for elemental mapping and analysis. The energy transfer leads to an excitation of object electrons and the emission of X-rays that are also indicative for the elemental composition of the sample. Emitted (secondary) electrons from (and beam electrons reflected by) the object surface are employed for imaging in scanning electron microscopy.

Mass Spectrometry, Chapter 15

Since the interaction between beam electrons and the object is relatively strong, it is necessary to limit the thickness – or the mass density – of the specimen. This is generally the case with protein complexes, biological membranes, and ultrathin sections of embedded cells. In cryo-electron microscopy voltages of 200 keV are preferred for the examination of isolated macromolecules since they are usually embedded in a thicker layer of ice. Bigger objects such as intact cells (thickness  0.5 μm) in cryo-electron tomography (Section 19.5.3) can only be imaged at higher voltages (300 keV) without massive scattering and shielding of electrons. Interactions of the beam with the electrons of the object have multiple, partly undesirable consequences for imaging. One effect is that the beam electrons are also scattered by Coulomb forces, but the typical scattering angle of 10 5 rad is much smaller than with electron–nucleus interactions. When accelerated electrons hit the electron shell of atoms they lose energy. The kinetic energy is reduced by ΔE and the wavelength increases correspondingly (Equation 19.1; Figure 19.9); this effect is termed inelastic scattering. The formerly almost coherent electron beam becomes incoherent and shows a spectrum of wavelengths after object transition. Since the diffraction depends on the wavelength (Θ = λ/d) the optical system produces differently sized projections of the object and superposes them in the final image (which is equivalent to chromatic aberration of glass lenses). The structures become blurred and reduced in contrast, which is particularly problematic for objects with a high mass density, such as large frozenhydrated macromolecular complexes, viruses, cell organelles, or intact cells. In general, thick ice layers considerably increase the proportion of inelastically scattered electrons. It is, however, possible to separate electrons of lower energy from the elastically scattered ones by means of an energy filter (Section 19.3.5). Electron Energy Loss Spectroscopy The energy loss of beam electrons (ΔE  2000 eV) correlates with the energy uptake by object atoms and thus contains information on the interacting elements. The spectral analysis of inelastically scattered electrons reveals the elementary composition of the object, and since these electrons can also be used for imaging we can record images of the element distribution (elementary map). Spectra are obtained by electron energy loss spectroscopy (EELS) and images by electron spectroscopic imaging (ESI) or electron spectroscopic diffraction (EDI). For this purpose the microscope must be equipped with a magnetic prism that sorts electrons according to their energy and filters those that are not required for imaging. EELS is usually applied to thin sections of cells or tissues; it is not suitable for single molecules and vitrified specimens. Mass Determination in the Scanning Transmission Electron Microscope The strongly elastically scattered electrons that are shielded by the object aperture also contain information about the biological object since they were deflected according to the number of protons in the nucleus and the amount of corresponding atoms in the object. These electrons can be utilized for imaging if we replace the aperture with an electron detector (Figure 19.8) in the scanning transmission electron microscope (STEM). Provided the elementary composition of the biological specimen is known, which is the case for proteins to a good approximation, we can interpret the signal intensity as a measure for the mass of the object and determine the molecular mass of protein complexes and other biological structures. The sample must not be embedded in any other material and is investigated in a native, freeze-dried state. It is sufficient to record the signal of several hundred (or more) particles for a reasonable statistical analysis with an accuracy of about ±5%. STEM mass determination of isolated proteins is thus not superior to mass spectroscopy. However, the method does not depend on the size and structure of the material and it is possible to determine the mass of large and arbitrarily formed heterooligomeric and multimeric complexes of macromolecules. Typical tasks for STEM mass determination are the stoichiometric analysis of protein complexes, the mass per length of fibrillar structures such as flagella and microtubules, and the mass per area or per structural feature of membranes, 2D crystals, or other 2D assemblies. Mass mapping is only possible with the STEM technique. Scanning Electron Microscopy and Analytical Electron Microscopy Any interaction of electrons between beam and object retains the energy and momentum of the entire system. This means that the energy loss of beam electrons increases the energy of object electrons. They are excited and occasionally even expelled from the electron shell. This process is accompanied by

19 Electron Microscopy

emission of electromagnetic radiation, heat, and electrons from the irradiated material. These secondary electrons are used for imaging in the SEM. The electrons originate from the surface of the object and create images of cells and small organisms with impressive depth of sharpness. The resolution power of the conventional SEM is limited (10 nm) and is usually too low for imaging protein complexes. However, low voltage scanning electron microscopes (LVSEMs) are able to record small structures. Electrons having been removed from the electron shell leave behind a gap that is filled by electrons from a higher energy level. The energy difference is emitted by radiation where electron gaps in the K shell entail particularly energy-rich radiation, that is, X-rays. These contain information on the elementary composition of the material, similar to the EELS spectra of the beam electrons. Energy dispersive X-ray spectrometers (EDSs) detect signals from elements above the atomic number 10 (EELS spectra are sensitive down to atomic number 4). The method of X-ray microanalysis is usually applied with ultrathin sections and inorganic samples.

19.3.3 Phase Contrast in Transmission Electron Microscopy Thin biological specimens that mainly consist of elements of low atomic number (H, C, N, O) behave as weak phase objects in the microscope. The object potential (equivalent to the refractive index in light microscopy) decelerates scattered electrons, and the electron wave experiences a small phase shift Δϕ with respect to the non-interacting reference beam when leaving the object (Figure 19.8). The intensity remains (almost) unmodified. Our eye, digital cameras and photo-emulsions cannot record phase differences and the object would actually be invisible. Looking at the difference between the weakly phase-shifted and the reference wave, we obtain a wave of identical wavelength, smaller amplitude, and a phase shift of about π/2 or λ/4 with respect to the reference. If it were possible to shift the reference wave by π/2 too, so that the amplitudes would be aligned, the waves would interfere and produce a detectable amplitude modulation. In light microscopy, the diffracted and the reference beams are conducted through a glass plate (phase plate) that is of different thickness for the reference beam (the beams are separated from each other in the back focal plane where the phase plate is located) and creates the desired phase shift. The non-diffracted beam is, moreover, attenuated, which enhances the amplitude modulation, that is, the phase contrast. The situation is more complicated in the TEM. Here, it is sufficient to mention that the spherical aberration of the objective lens (characterized by the parameter CS) generates phase contrast as a function of the scattering angle and that it can be adjusted by varying the focus. The ideal adjustment is a weak underfocus, known as the Scherzer focus after Otto Scherzer (1909–1982) who carried out the theoretical calculations. However, the phase shift and thus the contrast, is neither constant nor ideal over the complete scattering range, and it is particularly low with Scherzer focus conditions. The focus-dependent contrast transfer function (CTF) describes the contrast contributions for structural details of size d (corresponding to spatial frequencies related to 1/d; Section 19.4.3). The phase contrast may be strong or weak, become zero, or even change its sign. EM images contain more or less completely or correctly transmitted object information as a function of focus. Thus, the appropriate interpretation of images always requires analysis and ideally the correction of the CTF (Section 19.4.3).

19.3.4 Electron Microscopy with a Phase Plate If there was a phase plate for electron microscopes analogous to the one in light microscopy, one could adjust ideal focus conditions at maximum contrast. The physicist Hans Boersch (1909–1986) had already introduced the basic ideas in 1947, but its realization was hampered by technical problems. Only recently have technical developments changed the situation. The basic principle is to apply an electric potential to the reference beam (or to the scattered one) to create the required phase shift of π/2. This is obtained by an electrostatic potential in the center of the back focal plane for the unscattered electrons (Boersch phase plate) or by a thin carbon film with a small central hole, leaving the reference beam unchanged (Zernike type phase plate; Figure 19.10a). While both solutions and variants thereof were realized, they did not find their way into everyday applications. In particular, biological EM has not used

495

496

Part II: 3D Structure Determination

Figure 19.10 Phase contrast by phase plates in transmission electron microscopy. (a) Scheme of the optical path in the microscope. The undiffracted (zero) beam is focused in the back focal plane below the object lens. The Zernike type phase plate possesses a central hole so that the zero beam can pass through whereas the diffracted beam crosses the phase plate material (carbon film) and experiences a phase shift due to the positive inner potential of the material. The ideal result is a phase shift of π/2 compared to the zero beam (positive phase contrast). The Volta phase plate is a continuous (carbon) film heated to >200 °C to avoid contamination during irradiation. The focused zero beam induces a negative surface (vacuum) potential (Volta potential) that overcompensates for the positive inner potential of the material and effectively shifts the phase by ideally π/2. The object is imaged with positive phase contrast. (b) Electron microscopic projection of part of a native frozen-hydrated and unstained worm sperm taken without a phase plate and (c) with a Volta phase plate illustrating the increase of contrast. The black dots are gold markers. Bar indicates 200 nm. Source: courtesy of Maryam Khoshouei, Martinsried, Germany.

electrostatic phase plates, and the thin film Zernike-type phase plate suffers from charging and produces fringing in images, an unavoidable diffraction effect of the sharp edge of the central hole. The effect can be lessened by image processing, but it still affects the image quality. A novel type of thin film phase plate, the Volta potential phase plate, exploits the electric (surface or vacuum) potential created by the high intensity of the reference beam in a contamination-free carbon film (Figure 19.10). This principle has only recently been discovered, and the formation of the electric potential is not yet fully understood. However, this tool avoids fringing (there is no hole), contamination (by heating), is stable, reusable, and does not require complicated alignments – an important advantage for routine and frequent applications in cryo-EM and cryo-electron tomography. The resulting phase contrast is remarkable, so that this technique will likely become a standard application in biological EM of unstained and frozen material (Figure 19.10b).

19.3.5 Imaging Procedure for Frozen-Hydrated Specimens Cryo-electron microscopy is the most important development in microscopic structure research of biological material. The technique enables us to examine molecules and cellular components at high resolution in a close-to-native state. However, chemically untreated and unstained samples are very radiation-sensitive and are rapidly destroyed upon irradiation. The electrons create molecular radicals that readily react with other molecules and cleave chemical bonds, leading to mass loss and eventually to structural destruction. The low temperature of liquid nitrogen ( 196 °C) lessens the reaction rate and renders the biological object six times more resistant to radiation than at room temperature. Cooling with liquid helium (4 K) promises an even higher cryoprotection factor but unfortunately presents us with problems created by the loss of friction in frozen hydrated specimens. Resolution-limiting radiation damage is indicated by bubble formation in the frozen sample (Figure 19.11). To avoid this effect the total electron dose should not exceed 100 e Å 2, whether it is applied to a single projection or to a complete tilt series for 3D reconstruction (Section 19.5). In the latter case, the tolerable electron dose must be shared by all projections and adjustment procedures. Although it is technically possible to keep the dose arbitrarily low for each projection and to protect the specimen from any damage, one needs a sufficiently high

19 Electron Microscopy

signal-to-noise ratio for subsequent image analysis and processing (Section 19.5.3). The only way between Scylla and Charybdis here is to use most of the electron dose for data recording and do all the adjustments at other object sites. In cryo-electron microscopy this process is automated. The microscopist avoids direct inspection of the object on the fluorescence screen and instead records images by means of a camera. The appropriate procedure is to search the area of interest at low magnification, recording a test image with the desired magnification close to the object site, and determining the focus and other parameters offline. The program calculates the actual values for the object site and adjusts the imaging parameters of the microscope accordingly. The micrograph is recorded automatically, and the electron beam is deflected immediately after in order to minimize irradiation. In cases that require many (hundreds to thousands) images of single particles, the program selects a new site and thereby scans large areas of the grid. If using different projections from a single object, the program actuates the sample holder, turns it by a given angle, centers the object site again, adjusts the focus conditions, records the image, and continues accordingly until the series of tilt angles (tilt series) is complete. In this way, almost the entire electron dose is available for image recording. Projections of thick, ice-embedded samples contain a significant amount of information originating from inelastically scattered electrons that attenuates the image contrast considerably (Section 19.3.2). The use of energy filters in the mode of zero-loss filtering excludes undesirable electrons that experienced energy loss. Only elastically scattered ones can pass, resulting in improved image quality (Figure 19.12). The contrast enhancement is physically independent of the contrast created by phase plates. The techniques are complementary and they both contribute to optimizing image contrast and quality.

19.3.6 Recording Images – Cameras and the Impact of Electrons The last step in image formation is the detection of the imaging electrons. In recent years, digital cameras have replaced film plates, and they are now the usual medium for recording electron micrographs. CCD (charged coupled device) cameras record electrons in a scintillator layer that emits photons instead. By repeating this process, these photons produce a shower of new photons and thus increase the signal while also broadening it at the expense of signal intensity of high-resolution information. The camera imprints another characteristic on the recorded image, namely, the modulation transfer function (MTF), which modulates (attenuates) the signal of structures according to its spatial frequency. Recently, the construction of direct electron detectors yielded two important advantages. Firstly, the incoming electron is recorded in the respective pixel of the detector and not spread over a couple of neighboring pixels, that is, the spatial resolution is high. The MTF is considerably increased for higher spatial frequencies, and small structural details can be detected much better. Secondly, the detector is very fast and allows images to be read out in milliseconds, much faster than with CCD cameras. Recording movies (a number of frames) instead of one single (final) image revealed that the objects are moving by beam-induced effects. These movements blur images taken by conventional

497

Figure 19.11 Cryo-electron microscopy of vitrified cells of the archaeon Pyrodictium abyssi imaged with (a) low and (b) high cumulative electron dose. The cells are embedded in ice between the bars of the carbon support. One of the cells contains a protein crystal (enlarged and displayed together with its power spectrum (PS) in the inset). The heavily irradiated image shows clearly attenuated spots in the PS and bubbles (bright) within and outside of the cell. They indicate massive beam damage of the biological material and ice. The cumulative dose must therefore be kept below a critical threshold to prevent detectable beam damage in the object. Source: courtesy of Stephan Nickell, Martinsried, Germany.

Energy filter Electrons of various energy states differ in their frequency and wavelength (“color”). They are individually diffracted by the electron lenses and project into slightly different positions in the final image. The projected structures are thus of varying size and attenuate the sharpness and contrast upon superposition in the final image. Electromagnetic energy filters widen the electron beam according to the spectrum of electron energies, and an adjustable aperture selects “monochromatic” electrons of identical wavelength for imaging or analysis. Energy filters correspond to color filters in optical microscopy.

498

Part II: 3D Structure Determination

Figure 19.12 TEM images of vitrified lipid vesicles ((a) and (b)) and of frozenhydrated enzyme complexes tripeptidyl peptidase II from Drosophila melanogaster, (c) and (d). (a) Electron micrograph without energy filtering; elastically and inelastically scattered electrons contribute to the image. (b) Only elastically scattered electrons formed the phase contrast image. Inelastically scattered electrons with lower energy were filtered out. The multiply nested vesicles are now clearly visible. (c) The enzyme complexes are poorly detectable in the original micrograph because of the low contrast and signal-to-noise ratio. (d) The macromolecules in different orientations with higher magnification. Classification of equivalent projections are shown in Figure 19.19. Source: part (b) courtesy of Rudo Grimm, Martinsried, Germany; part (d) courtesy of Beate Rockel, Martinsried, Germany.

cameras and also destroy high-resolution information. But the shifts can now be corrected by aligning the frames of a series and adding the frames to a well-resolved and unblurred image. The new detectors push the quality of appropriate single-particle reconstructions to quasiatomic resolution, a “quantum step” in electron microscopic imaging. The first step in electron microscopy – recording images of the biological sample – is now complete. The second part deals with data analysis and image processing for 3D reconstruction.

19.4 Image Analysis and Processing of Electron Micrographs Electron micrographs contain the recordable signal of the object, although obliterated by unstructured noise and modulated by the contrast transfer function of the microscope and the signal transfer characteristics of the camera. The smaller the objects and the higher the expectations for the resolution of molecular details are, the more the contributions of noise and hardware effects affect the visibility of structures. Noise and effects of transfer functions must thus be determined and ideally separated from the desired signal. This is the business of image analysis and the processing of electron micrographs.

19.4.1 Pixel Size Digital EM images are usually 2048 × 2048 (“2k”), 4096 × 4096 (“4k”), or more pixels in size. The beneficial primary magnification of the microscope depends on the pixel size of the camera and the desired resolution of the image. Harry Nyquist (1889–1976), Claude E. Shannon (1916–2001), and others described the minimal condition for the resolution of structural details

19 Electron Microscopy

in digital images. A structural element can be regarded as being resolved if it is defined by at least two pixels (in one dimension). If Pc is the pixel size of the camera and M the primary magnification of the microscope (including corrections for the optical distance to the camera) the pixel size on the object level is Po = Pc/M and the resolution limit d = 2Po. The pixel should not exceed 1=3 or 1=4 of the desired resolution to compensate for some loss upon interpolations (Equation 19.3) and to cope with the limited sensitivity of cameras (Section 19.3.6): Po ˆ

Pc d  M 3

(19.3)

The visibility of tiny structures depends on the physical resolution of the microscope (Equation 19.2), on the parameters of image recording, the type and quality of the camera, and on the delicate preparation of the biological object (fixation, staining). The shape of negatively stained protein molecules can be resolved to 2 nm. Parts of this size consist of about 30 amino acids (NAA) with a mass of 3400 Da (MPROT). These values, as well as the domain volume Vd (nm3), can be assessed with Equation 19.4. The factors 3.9 or 7.6 (nm 3) and 430 or 840 (Da nm 3) derive from the protein density (1.41 g cm 3) and the average mass per amino acid (110 Da) in proteins. The average protein density depends on the protein size and increases to about 1.5 g cm 3 as the protein size changes from 30 to 10 kDa. Often, the molecular water layer on proteins is part of the volume determinations and reduces the average density to 1.37 g cm 3, a value that is commonly applied in calculations: NAS  3:9d 3  7:6V d MPROT  430d 3  840V d

(19.4)

Vitrified proteins yield structural information to 0.5 nm or even better after image processing. Thus, the secondary structure and in ideal cases even the density of amino acid residues is resolved. Cameras with a typical pixel size of 15–20 μm suggest a primary magnification M of 30 000–100 000. Higher magnifications are usually not necessary and would increase the radiation burden, which is related to M2.

19.4.2 Fourier Transformation Many operations for image analysis and processing are not performed with real data but with its Fourier transform (FT). It is used, for example, to judge the quality of the contrast transfer function (CTF) and to analyze structural regularities of objects. The FT is also required for improving the signal-to-noise conditions in low contrast images and for certain steps in averaging and 3D reconstruction. It is helpful to know what characteristics a Fourier-transformed image has, but it is not necessary here to introduce the mathematics developed by Joseph Fourier (1768–1830). Let us consider a one-dimensional image, that is, a single pixel line of a micrograph, representing the intensity curve of an object. Fourier’s insight was that almost all continuous curves can be represented by the sum of (infinitely many) basic sine functions of varying frequencies (Figure 19.13). To distinguish the sine oscillations along a distance in space (e.g., along the x-direction of an image) from the oscillations in time (frequency) we use the term spatial frequency. The inverse or reciprocal of the spatial frequency, the wavelength, identifies a certain distance d in the real image, and the amplitude of the sine wave is related to the density oscillation along d. Mathematically, the sine function has the amplitude 1. To adjust it to the corresponding density value in the image we need to multiply it by a factor. These factors describe the density distribution (i.e., the gray values) of the object’s structure and are known as structure factors. Sine functions with small spatial frequencies (large wavelengths or values of d) belong to large object structures, and functions with high spatial frequencies (small values of d) to small structural details. Obviously, the macrostructure (e.g., the body of a hedgehog) is characterized by large structure factors and the fine-structure (e.g., the stingers) by much smaller amplitudes of the corresponding sine functions. If an image is to be analyzed to maximum resolution the very small structure factors have to be separated from contributions of the superposed high-frequency noise. The amplitudes of sine waves are, however, not sufficient to exactly describe the curve of a structural density. Moreover, we have to know the exact location of the origin of each sine wave. The position of the corresponding density curves defines where the object (the hedgehog) is located and the substructures (the stingers) are arranged. Since the trigonometric function is

499

500

Part II: 3D Structure Determination

Figure 19.13 Fourier analysis (Fourier transformation) of a one-dimensional object (a, curve in black). The superposition of three sine functions (broken lines in red) approximately traces the original structure (continuous line in blue). The object could be exactly reconstructed by (infinitely) many sine functions with increasing spatial frequencies, appropriate amplitudes, and phase shifts. Plotting the amplitudes and phase shifts of all constructive sine functions against the spatial frequency produces the Fourier transform of the original image (b). The transform has basically the same size as the original image, but it contains equivalent Fourier data symmetrically to the center that are related according to Friedel’s law (Georg Friedel, 1865–1933). This relation is clearly defined and so it is sufficient to show only one half of the Fourier transform.

periodic, the largest difference between the origin of a particular sine wave and a common reference point (e.g., the origin of a coordinate system in the center of an image) is the wavelength of the corresponding sine function. This deviation is called the phase shift (–π  Δϕ  π), often (but incorrectly) referred to as phase. The object function is thus fully described by the sum of elementary sine waves with (continuously) increasing frequency νi and the corresponding individual amplitudes ki and phase shifts Δϕi. A description of an image with an infinite number of sine functions is impractical. Since we usually record digital images of n pixels in size (e.g., in the x-direction) we subdivide the object function F into n pixels with n/2 spatial frequencies, that is, basic sine waves. The smallest frequency ν possesses a period of exactly 1 with a wavelength λ = n pixels corresponding to the image size S (S = n pixels where ν = S/λ). The highest frequency (Nyquist frequency) is characterized by the smallest possible wavelength with λ = 2 pixels. The object function F is the sum of all possible (finite) sine functions (Equation 19.5). The pixel number s indicates a position in the image of size S, and x denotes the relative position (x = s/S). The value F(x) describes the density (gray value) of a pixel at curve position x and is calculated by: F …x † ˆ s xˆ S

P

ki  sin…2πv i x ‡ Δϕi †

(19.5)

The Fourier analysis of an image is nothing other than the calculation of all the structure factors ki and phase shifts Δϕi that are necessary to scale the sine waves with frequencies νi and to place them with respect to the origin of the coordinate system (the image). Fourier transforms are therefore data (“images”) that consists of two parts. One contains the amplitudes (structure factors), the other the phase shifts of the sine functions plotted against the spatial frequency ν. Since ν is related to the reciprocal of distance in real space, Fourier transforms are representations of images in reciprocal (Fourier or frequency) space. The Fourier transform F(x,y) of a two-dimensional image requires two-dimensional sine functions (x,y) of course, but they are essentially operated the same way (Figure 19.14).

19 Electron Microscopy

501

Figure 19.14 (a) Examples of two-dimensional sine functions (organized in a 2D lattice) with lower and higher spatial frequencies, varying orientations, and different amplitudes. The right-hand column shows the real images of the functions (bright areas indicate positive, dark areas negative values); the left-hand column shows the corresponding power spectra (PS). The spots in the PS characterize the spatial frequencies and the orientations of the lattices in the x–y-direction. (b) Examples of the synthesis of images through superposition of basic sine waves. The upper image contains the information from the first two images in (a), the other images increasingly more sine functions with different orientations and higher spatial frequencies. (c) Examples of symmetric and antisymmetric (non-periodic) images (column on the left), their Fourier transforms (FT, central columns), and power spectra (column on the right). The FTs consist of a symmetric (real) and an antisymmetric (imaginary) part each (see Equation 19.6). The Fourier data represent negative (dark) and positive values (bright); zero corresponds to a medium gray. The FT of the symmetric image possesses Fourier coefficients unequal to zero only in the real part, the FT of the antisymmetric Yin Yang symbol only in the imaginary part. The image in the middle contains symmetric and antisymmetric features that can be separated by setting one or the other of the two FT parts to zero prior to back-transformation.

Usually, the FT is given in an alternative form, which is obtained by a simple trigonometric operation (Equation 19.6): ki sin…2πv i x ‡ Δϕi † ˆ ai cos…2πv i x † ‡ bi sin…2πv i x † ai ˆ ki sin…Δϕi † and bi ˆ ki cos…Δϕi †

(19.6)

The amplitudes and phase shifts can be calculated from the factors ai and bi. The FT of an image is again a two-part representation of the factors ai and bi as functions of spatial frequency. This representation has the advantage of possessing analytical properties. Since the cosine is a symmetric function with respect to the origin while the sine is an antisymmetric one, the part of the FT containing the factors ai characterizes the symmetric features and the other part with factors bi the antisymmetric features of the structure. The Fourier transform is a complex mathematical function (and usually written in an exponential form); the symmetric part is thus also named the real part and the antisymmetric one the imaginary part of the FT (Figure 19.14). Image processing systems make use of the fast Fourier transform (FFT), which is an efficient algorithm for transformation of digital (discrete) data of certain dimensions. It is often sufficient to know the structure factors (intensity, power) and their distribution in reciprocal space to get some idea of object characteristics and image quality. These are provided by the power spectrum (PS), that is, the product of the FT with its complex conjugate FT∗. The PS is the transform of the auto-correlation function of an image or the “cross-correlation” function of an image with itself (Section 19.4.4). The PS misses the phase information and contains the squared structure factors only. It corresponds to the light-optical diffractogram that is created by diffraction of coherent light (e.g., of a laser beam) passing through a micrograph of a structure (Figure 19.14). Calculations of Fourier transforms, power spectra, and correlation functions are standard operations in program system for the analysis, processing, and 3D reconstruction of images.

19.4.3 Analysis of the Contrast Transfer Function and Object Features The electron microscopic image is a function of the projected object structure and the contrast transfer function of the microscope that depends on the electron optical characteristics and the

502

Part II: 3D Structure Determination

actual adjustments of imaging parameters. The object structure (the density function) is convolved with the transfer function during the imaging process. Mathematically, this means that the functions are multiplied in Fourier space, and this means they can be analyzed in Fourier transforms of images. Contrast Transfer Function We already know that the diffraction of the electron beam by the object results in positive and negative phase contrast, depending on the diffraction angle, and that regions in-between are missing contrast, that is, the structural factors of corresponding spatial frequencies are (close to) zero (see Section 19.3.2). The transfer function is particularly clear in projections of a thin amorphous carbon film. Depending on the focus adjustments the power spectrum of such an image shows several bright rings separated by dark gaps. The bright regions represent the squares of the structure factors of transferred spatial frequencies. The dark rings denote gaps in information transfer (contrast). The intensity of corresponding spatial frequencies is very low, or even zero, and thus eliminated. This pattern is known as Thon rings (Figure 19.15). The structure factors of the object that falls into these gaps are eliminated as well, and the corresponding information is thus missing in the electron micrograph. Structure details represented by spatial frequencies falling into the region just beyond the first gap are imaged with inverted contrast and so on. Such effects can hardly be identified by eye in projections of biological specimens, but a series of images taken with different focus values illustrates the effects (Figure 19.15). There is an ideal focus level (moderate defocus) that guarantees a continuous transfer of contrast to a spatial frequency of 1 nm 1 and that is advantageous for biological electron microscopy of (stained) protein complexes. This focus level is close to the absolute contrast minimum and it requires some training to prefer it to the allegedly clear, contrasty but heavily defocused adjustment. If we try to obtain reconstructions with maximum resolution we have to correct for the contrast reversal of certain spatial frequencies afterwards. The Fourier coefficients between the first and the second gap (and third and fourth gap, etc.) are “flipped” by multiplication by 1, and images with different focus levels fill the missing information of the unavoidable gaps (producing different CTFs). Two typical imaging aberrations, that is, astigmatism and drift, are easily identified in power spectra (and should encourage the microscopist to discard those data). Astigmatism is the effect of different focus states in perpendicular image directions. It stems from distortions of the electromagnetic field in the lenses of the microscope. The Thon rings become ellipses and hyperbolas close to in-focus (“Gauss focus”) situations. The gapless transfer of the signal to a certain spatial frequency (casually referred to as resolution) may therefore change dramatically in different directions of the image. Such micrographs typically exhibit streak artifacts. Objects or specimen holders that drift while the micrograph is being recorded produce blurred images in the direction of movement and Thon rings with partially reduced intensities. It is an essential part of quality control to evaluate the transfer function and then select suitable electron micrographs for further processing. Object Characteristics The Fourier transform of an amorphous structure, for example, of single protein molecules, is not very informative for the interested observer (Figure 19.14). However, the situation differs for regularly arrayed macromolecules in 2D lattices. Here, the same amplitudes and phase shifts occur as often as molecules exist in the 2D crystal, and they accumulate to a significant spot (reflex) at corresponding spatial frequencies in the PS. Moreover, the frequencies describing the regular structure must be related to each other, that is, the higher frequencies are always whole-number multiples of the lowest frequency since all the sine waves must have identical relative positions with respect to all molecules in the lattice. Sine functions that do not hit identical reference positions of each molecule with the same phase (and thus with the same intensity) are not suited to describe the structure of the crystal and are blanked out; their structure factors are zero. This is the case for most frequencies and only a few constructive ones are left, which are identified by a number of regularly arrayed diffraction spots in the PS (Figure 19.14). Irregularly distributed data inbetween originate from other, non-crystalline contaminants and noise. The arrangement of diffraction spots yields information on the crystal structure such as the orientation and type of the lattice (i.e., tetragonal, hexagonal, and others), the lattice constant (the periodic distance between molecules in the crystal), and the characteristic angle between the lattice vectors

19 Electron Microscopy

503

Figure 19.15 (a) Focus series of a negatively stained protein complex (diameter 12 nm). The images were taken with strong underfocus (maximum contrast), moderate underfocus, in focus condition (Gauss focus with minimum contrast), and strong overfocus. The grain pattern is particularly dominant in strong defocus conditions. (b) The corresponding power spectra illustrate the effect of the contrast transfer function that varies with focus. Dark gaps between the bright Thon rings denote missing structure information. These gaps already occur in regions of low spatial frequencies, that is, with relatively large object structures, in cases of strong defocus. The first gaps in the series are located at spatial frequencies of (3.2 nm) 1, (1.4 nm) 1, (60° or to < 60°). Rotational symmetry also reduces the region of missing data from a missing wedge to a missing pyramid (Figure 19.20). Two-dimensional crystals of macromolecules are usually not ideally regular. It is thus profitable to average the unit cells by correlation methods and to then combine the averages in Fourier space. Averaging and interpolation of lattice lines efficiently remove superimposed noise. Biological objects with several or many identical macromolecular units in regular arrangements – for example, capsids of viruses and phages or in helical structure such as flagella, pili, and other protein filaments (Table 19.1) – are suited for averaging approaches. If the geometrical positions of repetitive units are known, for example, the helical arrangement and the number of units per helical turn, it is possible to define the projection geometry for all units. The great advantage of these macromolecular structures is that, in principle, a single projection is sufficient to calculate the 3D reconstruction since one micrograph already contains various projections of the macromolecule. The same goes for virus capsids. In addition, it is of course possible to combine a couple of 3D data sets for a final, better-defined reconstruction (Figure 19.22).

Tomography All tomographic reconstruction approaches share the same principles, that is, projection of the object density by transmission imaging, combining projections from different directions, and calculation of a three-dimensional data cube. The methods are non-invasive and provide insight into an intact object. Essentially, all electron microscopical 3D reconstruction approaches are tomographic ones. However, it has been vernacularized to speak of tomographic reconstructions of individual, noncrystalline objects (complex protein assemblies or intact cells) that were projected under different angles in a tilt series.

19.5.3 Electron Tomography of Individual Objects Individual biological objects such as macromolecular assemblies, membranes, amorphous viruses, ultrathin sections of cells, cell organelles, or intact cells require tilt series data for 3D reconstruction. This application is known as electron tomography, with reference to computer tomography of macroscopic specimens or probands. Three-dimensional reconstructions of single molecules and regular structures, such as described in the previous sections, always include averaging steps of identical particles to increase the quality and completeness of data for the final 3D model. But the reconstruction process itself – that is, filling the Fourier space with data and back-projection or equivalent approaches – is independent of averaging. It is also possible to reconstruct non-repetitive, individual objects (Figure 19.20b). Indeed, the specific feature of electron microscopy is its ability to image individual structures down to the subnanometer range. To do so, (cellular) cryoelectron tomography has to solve two incompatible challenges. On the one hand, we should record many projections over the whole angular range with maximum signal-to-noise ratio to achieve the best resolution of small structures. On the other hand, it is necessary to minimize the

19 Electron Microscopy

513

Figure 19.22 Three-dimensional reconstruction of a negatively stained 2D protein crystal, the surface layer (S-layer) from the Gram-positive bacterium Sporosarcina urea. (a) Averages of the tilt series projections with non-equidistant tilt angles from 0° to 78°. (b)–(g) Horizontal sections through the 3D reconstruction in contour line presentation. The 3D volume contains four unit cells with the lattice constants a = b = 13 nm and the lattice angle of 90°. The sections show the protein density from the outer surface of the layer (b) to the inner surface (g) in positions at 2.6, 1.2, 0.3, 0.3, 1.2, and 2.5 nm with respect to the central plane. (h) and (i) Surface-rendered 3D models showing the inner and the outer surface of the protein layer. For vertical sections through the central protein domain see Figure 19.20.

total electron dose so as not to just destroy the structures that we want to see in the reconstruction! Fractionation of the total dose over all micrographs of the tilt series results in a corresponding loss of the S/N ratio with the consequence that structures may not be detectable among the obliterating noise. The S/N ratio of individual projections must be so high that they are aligned with each other (Section 19.3.3). The mechanical accuracy of the goniometer is not good enough to avoid image shifts upon tilting. These unavoidable errors have to be corrected for afterwards. Small gold clusters (colloidal cold) that are added to the sample before freezing and provide high contrast anchor points facilitate the correlation alignment. Now, the S/N of the structures of interest is not limiting for alignment and may even be low. The two conditions – maximally allowed cumulative electron dose and minimal contrast (especially in images taken without a phase plate) – limit the possible number of projections of a tilt series and thus the achievable resolution of an object of thickness D. Cellular objects should not be thicker than 0.5 μm, which means that cryoelectron tomography applications are restricted to viruses, a number of prokaryotes, and flat areas of eukaryotic cells in the first place. However, cryo-microtomy and ablation of material by ion beam treatment (FIB) offer ways to investigate sections or subvolumes of bigger biological cells in a close-to-native state (Section 19.2.1). Reconstructions of individual objects miss the opportunity to minimize noise by averaging. Other criteria are required to separate structures from uncorrelated noise. One approach is nonlinear anisotropic diffusion. This algorithm exploits the moderate variation of voxels belonging to a continuous structure, which is different from the randomly varying and uncorrelated values of noise. The missing wedge of data also creates problems in tomograms (Section 19.5.2). One effect is that flat (thin) structures such as membranes become blurred or even invisible in the zdirection. Accordingly, the reconstructed top and bottom regions of cells are incomplete. To reduce the region of missing data in Fourier space one could rotate the grid by 90° in the microscope after recording a tilt series and take another series. The missing wedge then becomes a missing pyramid and the data is more complete (Figure 19.20). The final result of a 3D reconstruction is a data cube that contains the density distribution of the reconstructed cell or section. The interpretation of these complex and detail-rich structures requires special approaches for analysis and visualization that allow the researcher to identify and visually enhance distinct structures (Figure 19.23 and Section 19.6).

514

Part II: 3D Structure Determination

Figure 19.23 (a) Original projection and reconstructed 3D data of the virus Herpes simplex, (b) of Mycobacterium bovis BCG, and (c) of a part of the eukaryotic cell Dictyostelium discoideum. All the organisms were imaged by cryo-electron tomography without any chemical fixation and staining and reconstructed by the approach of filtered back projection. The images show the 0° projection, one (central) x–y slice of the reconstruction and the surface-rendered 3D model. The model of the mycobacterium shows the lipid bilayer structure of the inner and outer membranes and material in the periplasmatic space, that is, cell wall polymers. (d) Surface model of the tripeptidyl peptidase II from Drosophila melanogaster as obtained from singleparticle cryo-electron microscopy; the protein dimers from X-ray structure determination are fitted into the electron density. This hybrid approach allows us to calculate a quasi-atomic model of the entire 40-meric enzyme complex. Source: Courtesy of Kay Grünewald (a), Ohad Medalia (c), and Beate Rockel (d), Martinsried, Germany.

19.6 Analysis of Complex 3D Data Sets 19.6.1 Hybrid Approach: Combination of EM and X-Ray Data Cells contain large and heterogeneous protein complexes that may be purified without losing their integrity but that do not crystallize for X-ray structure determination. Examples are the 26S proteasome, the spliceosome, polysomes, the bacterial flagellar motor, the nuclear pore complex, centriole structures, cilia, the cytoskeleton network, and many others. However, in many cases it is possible to isolate the subunits and to determine their atomic structures individually. These subunits can then be fitted into the density of the intact complex as obtained

19 Electron Microscopy

515

from single particle or tomographic reconstructions and can be used to calculate a pseudoatomic model. This hybrid approach often is the only successful way to get deeper insight into the structure and functional conformations of protein complexes; the enzyme TPP II is a typical example (Figures 19.21 and 19.23). There are two approaches for fitting, namely, rigid body docking and flexible fitting. Rigid body docking is suitable for moderate resolution structures (1–3 nm) where the best place and orientation of subunits is determined by correlation methods. Higher resolution structures often reveal local discrepancies between the atomic structure and the reconstructed model that derive from real conformational differences. Molecular dynamics calculations allow adaptation of the atomic structure in such a way that it fits into the structure of the whole complex. Meanwhile, there are explorations of approaches to model structures consisting of many parts or structurally unknown components, such as the nuclear pore complex. Here, the combination of biochemical data and information from other sources yields additional criteria to define restricting conditions for modeling.

19.6.2 Segmenting Tomograms and Visualization Tomograms of cells or cellular segments contain the signatures of many (theoretically all) macromolecules and cellular structures one would like to analyze and identify in 3D presentations. Due to the high protein concentration in living cells (80–400 mg ml 1), creating the phenomenon of macromolecular crowding, it is difficult or even impossible to visually assign density data in tomograms to distinct structures. It is thus necessary to demarcate structures of interest such as membranes, filaments, or macromolecular protein complexes from each other. They are segmented and usually highlighted by colors in 3D models. Figure 19.23 shows EM data and the corresponding 3D models of segmented tomograms from a virus, a bacterial cell, and a segment of a eukaryotic cell. The simplest means of segmentation is to define a gray value that represents the border of a molecular surface and to blank out all voxels below this threshold. This is the common method of rendering 3D models of single molecules (Section 19.6.1), but it is not suitable for complex structural assemblies such as biological cells. Here, we need criteria to identify voxels that are correlated and belong to a coherent structure. There are automated procedures that recognize membranes and filaments even in noisy 3D data, but it will sometimes be necessary to refine segmentations by hand. Segmentation of single protein complexes in the crowded cytoplasm is not possible and so we need more powerful approaches.

19.6.3 Identifying Protein Complexes in Cellular Tomograms If protein complexes are located closely together in cells or if they are part of a supramolecular structure we use the approach of template matching to identify and localize them in tomograms (Figure 19.24). For this purpose, we need a 3D model of the protein complex of interest. Models from X-ray crystallography or single particle reconstructions are suitable data. Cross-correlation of the tomogram with the template yields the position and the orientation of the molecules of interest. The correlation is a function of the two 3D data sets and the six degrees of freedom for the position (x,y,z-coordinates) and the three Euler angles. The required computing power is challenging. However, the approach is feasible; experiments have localized and identified different molecules in phantom cells (lipid vesicles) (Figure 19.24). These experiments showed that a resolution of at least 2 nm is necessary to identify the target molecules reliably and to minimize false positive hits. This is a challenging proposition, but the new electron detectors and improved imaging conditions promise fruitful scenarios. The 3D distribution and arrangement of ribosomes in bacterial cells have already been investigated (Figure 19.24), the native structure of polysomes (poly-ribosomes) and inactive pairs of hibernating ribosomes have been studied in situ, and recent research has identified active and inactive conformations of the 26S proteasome in neuronal cells. Tomograms of cells are actually unique structures, but they contain several redundant proteins that may be extracted from the 3D data set after localization and identification and that

Macromolecular crowding The high concentration of macromolecules in cells means that about 30% of the cytoplasmic volume is occupied by protein complexes and other big biological molecules. The characteristic distance between neighboring macromolecules is only 10–20 nm, that is, the size of many protein complexes. Big complexes can thus not occupy these intermolecular zones – only a rather limited volume remains available for them. There is literally a shortage of space in crowded cells. This situation causes dramatic equilibrium shifts for reactions that increase or decrease the occupied volume (including the surrounding space) of macromolecules. Amongst these processes are isomerization and folding reactions or oligomerization and dissociation of protein complexes. This is the reason why supramolecular complexes and protein assemblies are more stable in cells than in the diluted environment in vitro. Some of these (hypothetical) complexes can thus only be observed in intact cells.

516

Part II: 3D Structure Determination

Figure 19.24 (a) Detection of macromolecule complexes in a 3D data set (tomogram) of a cell is performed by cross-correlating the reconstructed data with a 3D template (template matching). The correlation determines the spatial position (x, y, z) and orientation (three Euler angles) of the molecules. The elaborate process is calculated in a parallel computer. (b) Positions and orientations of the protein complexes proteasome (bright) and thermosome (gray) in a “phantom cell” (lipid vesicle). (c) Projection of the vitrified bacterium Spiroplasma melliferum, (d) slice of the tomogram that shows large protein complexes in the cytoplasm, and (e) image of the correlation function obtained from the tomogram and the template 70S ribosome; bright spots reflect the identified position and orientation of ribosomes in the 3D data set of the cell. (f) 3D model of the cell containing models of the ribosome at places and in orientations as determined by template matching. Size of image 600 nm. Source: parts (c)–(f) courtesy of Julio Ortiz, Martinsried, Germany.

may be averaged after classification (subtomogram averaging). The appeal and the scientific potential of this process is that the complexes remain in the natural environment, in a native and untouched state, and that their functional interactions with surrounding proteins can be visualized. The nuclear pore complexes are one example: they are immobilized by freezing in different situations of translocating the cargo. By cumulating many subtomograms and sorting them into the right order it was possible to obtain a “movie” of the translocation process.

19.7 Perspectives of Electron Microscopy About 70 years after its invention in 1931 electron microscopy opened up new perspectives for molecular structural biology and cytology by establishing advanced cryotechniques, 3D reconstruction, and visualization methods. About a decade later, two technical (r)evolutions mark another “quantum step” in electron microscopy – new electron detectors, improved resolution, and an employable phase plate ameliorated contrast. Two lines of applications delineate the fascinating future of electron microscopic structure research. Single particle analysis of protein complexes deals with many projections, reaching up to millions. With the use of direct electron detectors it is now realistic to aim for the atomic resolution of complexes, especially of those that cannot be tackled by other methods of structure research. While it is possible to obtain atomic models of rigid protein complexes with less than

19 Electron Microscopy

105 single particles, an even more fascinating perspective is to collect as many projections of flexible complexes in various conformational states as possible, to then classify them, and to sort the different structures into a consecutive order of transformation. The result is a quasitime- or process-resolved series obtained from “four-dimensional” electron microscopy. The first examples were the translocation of tRNA between different binding sites in ribosomes and the visualization of the flexibility of the nuclear pore complex in native nuclei. Investigations at higher resolution showed conformational changes of the 26S proteasome that led to an atomic model of functional transitions. The development of cryo-electron tomography paved the way for 3D models of intact cells or sections thereof in a close-to-life state. Even at a moderate resolution of 3–5 nm we can detect intracellular structures that cannot be observed in conventionally fixed and embedded preparations. This immediately shows the significance of CET for the investigation of pathological effects in cells. The improved detectors, the correction of beam-induced movements, and the impressively enhanced image contrast created by a phase plate mark a further quality step. We can realistically expect to interpret cellular tomograms at the level of about 1 nm in the near future. A single tomogram contains the signatures of hundreds of proteins and supramolecular complexes, that is, a wealth of 3D information of the cellular proteome. We assume that many proteins interact in the cytoplasm and temporarily form supramolecular complexes under the conditions of macromolecular crowding in cells. Such complexes tend to dissociate in diluted environments and cannot be isolated as stable structures. Only cryo-electron tomography enables us to visualize macromolecular aggregates in situ and to investigate the structural network of different macromolecules. Once the proteins in tomograms are identified and their positions and orientations known, it is possible to dock atomic models into the electron densities and to create a pseudo-atomic map of macromolecular structures and their interactions in individual cells. There is still some way to go until medium-sized protein complexes or even small ones can be unambiguously identified, but we have taken the first steps.

Further Reading Frank, J. (2006) Electron Tomography: Methods for Three-Dimensional Visualization of Structures in the Cell, 2nd edn, Springer, Berlin. Frank, J. (2006) Three-Dimensional Electron Microscopy of Macromolecular Assemblies: Visualization of Biological Molecules in their Native State. Oxford University Press, Oxford. Hawkes, P.W. and Spence, J.C.H. (2007) Science of Microscopy, vol. I, Springer, New York. Reimer, L. and Kohl, H. (2008) Transmission Electron Microscopy. Springer Series in Optical Sciences, vol. 36, Springer, New York. Williams, D.B. and Carter, C.B. (2009) Transmission Electron Microscopy. A Textbook for Materials Science, 2nd edn, parts 1 to 4, Springer, New York.

A public domain program for the analysis and processing of microscopic images: NIH, ImageJ: Image Processing and Analysis in Java. http://imagej.nih.gov/ij/

517

Atomic Force Microscopy Daniel J. Müller1 and K. Tanuj Sapra2 1 2

ETH Zürich, Biosystems Science and Engineering, Mattenstrasse 26, 4058 Basel, Switzerland Department of Biochemistry, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland

20

20.1 Introduction The year 1986 ushered in a new era of imaging and manipulation of hard and soft matter at the nanoscale, turning theoretical (or undreamed of) possibilities into practical realities. The invention of a conceptually simple yet very powerful device, the atomic force microscope (AFM), provided a tool to zoom into the molecular scale thereby revolutionizing nanotechnology. The last three decades have witnessed great strides in AFM technology; it is now a routine to map objects with a resolution of up to a few angstroms (Å), manipulate them with high-precision, and at the same time quantify their physical, chemical, and biological properties. The AFM belongs to the family of scanning probe microscopes (SPMs), which utilize a sharp probe as a scanning “stick” and a handle to manipulate (e.g., pick, drop, remove) objects at the nanoscale. The SPM microscopy technique relies on specific interactions between the probe and the object; in many cases the interactions can be tailored to a specific sample or the application. The detection system in SPM microscopes includes optical signals in scanning near field microscopy (SNOM), tunnel currents in the scanning tunneling microscope (STM), ion currents (scanning ion conductance microscope, SICM), or magnetic interactions in the magnetic force microscope (MFM). The detection versatility has enabled the development of more than 20 different measurement applications for the scanning probe microscopy of inorganic and organic samples. As explained in Section 20.2, the AFM detects interaction forces between a sharp atomic or molecular probe and the object. A major advantage of the AFM is that sensitive biological samples can be investigated in their natural aqueous milieu under defined conditions; for example, specific pH, ion compositions, and temperature, which simulate the physiological conditions. The signal-to-noise ratio of the AFM compares superlatively to the optical and electron microscopes. However, a thorough understanding of the molecular interactions between the AFM cantilever tip and the sample (Section 20.3) and optimum sample preparation (Section 20.4) are imperative prerequisites for achieving a signal with low noise to enable molecular mapping of soft biological matter at sub-nanometer resolution. Because the energy imparted by a cantilever tip while scanning a surface (Section 20.3) is of the order of thermal energy (3.5kBT), individual biomacromolecules can be scrutinized with high precision in vitro or in situ without compromising their structural and functional integrity. In contrast, photons of wavelength 300 nm possess energies  3150kBT, which is sufficient to break covalent bonds of organic molecules and is detrimental to protein structure. Section 20.5 provides a few examples of sub-structural (high-resolution) AFM imaging of biological cells, single proteins, nucleic acid polymers, and sugar chains under native conditions. The fast recording of imaging sequences allows direct observation of the molecular machineries at work, and the dynamics of the macromolecular complexes in a cell. Imaging Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

Electron Microscopy, Chapter 19

520

Part II: 3D Structure Determination

with AFM extends beyond simple topographic mapping; information on the physical and chemical properties of a system can also be obtained by functionalizing the cantilever probe. Of direct relevance to studying biological mechanisms is the monitoring of various biological interactions, for example, cell–cell adhesion, interaction forces between individual receptor– ligand complexes, chemical groups, or even observing the convolution and unfolding processes of individual proteins (Section 20.6). In addition, the AFM is gaining importance as an analytical tool; for example, to determine the mechanical properties (flexibility, stability) of biological and synthetic polymers, and also biological macromolecules. It is becoming increasingly evident that the AFM offers unprecedented high spatial resolution unmatched by any bioanalytical or biophysical approach. Thankfully, the potential of scanning probe microscopy in bioanalytical applications is far from being exhausted.

20.2 Principle of the Atomic Force Microscope Inarguably, the most crucial functional element of the AFM is a sharp probe (or tip) made of silicon or silicon nitride (Si3N4) and engineered at the ventral end of a microscopic spring lever (also known as the cantilever); the whole system is designed to have a low spring constant (k  0.1 N m 1) (cantilevers with higher spring constants are produced for applications in material sciences). For the cantilever to function as a force sensor, the dorsal or the top surface of the cantilever is usually metal-coated onto facilitate and maximize reflection of the laser beam to a quadrant photodiode (Figure 20.1). The quadrants of the photodetector maintain a voltage difference, which to a first approximation depends linearly on the applied force. The inclusion of a piezoelectric transducer optimizes the raster scanning capabilities of the tip, which is able to inspect each defined point of the biological surface while detecting the interaction forces at those points (Figure 20.1). The cantilever acts as a force amplifying arm, that is, upon sensing a force the cantilever deflects and the signal is registered on the location-sensitive photodetector. The interaction forces consist of repulsive and attractive contributions, and are typically in the range 0.01–100 nN (Table 20.1). Table 20.1 Forces between the AFM tip and object. Force

Figure 20.1 Schematic representation of the atomic force microscope. An immobilized sample is moved by means of a triaxial (x, y, z) piezoelectric element under a sharp scanning cantilever probe. During this raster movement, the bending of the cantilever spring is measured by a laser beam reflected onto a photodetector. The voltage difference between the upper and lower segments of the photodetector (V = (A + B) – (C + D)) is a direct measure of the cantilever spring deflection, which is used to calibrate the spring constant and quantify the miniscule interaction forces.

Direction of force

Range

Pauli repulsion

Repulsive

Extremely short (0.2 nm)

Van der Waals interactions

Attractive

Very short (a few nm)

Electrostatic interactions

Attractive/repulsive

Short (nm to μm)

Capillary (probe in water)

Attractive

Long (μm to mm)

20 Atomic Force Microscopy

So far, several methods have been developed for scanning force microscopy. In the most commonly used scanning method – the contact mode – the cantilever tip is maintained in contact with the sample surface at a constant user-defined force. This is achieved by keeping the cantilever deflection (i.e., the force) constant by continuously monitoring and changing the distance of the sample surface from the cantilever tip. To ensure the integrity of the biological samples and that the surface structures are not irreversibly deformed, a scanning force of 0.1 nN is generally used (see Section 20.3). By regulating the height required for a constant contact force, the surface features of the sample are mapped point-by-point resulting in a surface topography (Figure 20.2a). Besides forces normal to the plane of the sample surface, lateral forces can deform or scratch away a soft biological object from its base during raster scanning in contact mode (Figure 20.2c). The undesirable sample distortions by normal and lateral interactions can be minimized by setting low contact forces (see Section 20.3). Dynamic imaging methods (e.g., TappingTM mode) offer an alternative approach to reducing the impact of the scanning tip on the object. In the TappingTM mode, the cantilever spring is stimulated to a sine-wave oscillation close to its natural resonance frequency. The oscillation is a means to ascertain that the tip touches the biological object only at the lower end of each cycle (Figure 20.2b), as a result minimizing the contact time and lateral interactions. Dynamic imaging procedures are therefore particularly suited for imaging weakly immobilized biological objects (e.g., cells or fibrillar structures). A crucial criterion for obtaining an unperturbed topography is to maintain the oscillation amplitude of the cantilever by controlling the distance between the cantilever tip and the sample surface.

521

Figure 20.2 Imaging with atomic force microscopy. (a) In the contact mode, the tip is in continuous contact with the biological sample and, therefore, follows the surface features of the sample even upon encountering an obstacle, resulting in deflection of the cantilever spring. To protect the sample from excess force, the contact force (measured by cantilever deflection) is maintained by changing the Z-position of the sample with the aid of a feedback loop. The error signal is used to map the sample topography. (b) In the dynamic mode, a cantilever is oscillated close to its natural resonance frequency with an amplitude of a few nanometers. Thus, the object is touched only briefly during the downward travel of the cantilever avoiding lateral interactions. The feedback control loop in this case is used to maintain the oscillation amplitude. (c) Lateral scanning of the AFM tip can deform a soft biological object. This can be prevented by precise adjustment of the contact force, the scan speed, and the feedback parameters. (d) Elasticity is determined by moving the cantilever tip against a soft object (e.g., cells). The ratio of the distance travelled by the cantilever into the sample and the cantilever deflection associated with this gives a measure of the sample elasticity.

Topography Three-dimensional illustration of a surface.

20.3 Interaction between Tip and Sample Akin to its siblings of the SPM family, the AFM can deliver resolution up to atomic dimensions. Albeit the achievable resolution depends crucially on the tip sharpness and the surface corrugations of the object, the finite radius of the cantilever tip limits the contouring of sharp and delicate features of an object. As shown in Figure 20.3a, the tip can broaden the lateral dimensions of an object. In contrast, structural periodicities with less defined edges can be reproduced with a higher accuracy (Figure 20.3b). In all these cases, the measured topography represents a nonlinear superposition of the object under investigation, the strength of the pronounced details being dependent on the corrugation of the object and the tip dimensions. Owing to the considerably soft nature of biological objects compared to the cantilever spring, contact forces of 1 nN exerted by a cantilever tip are sufficient to deform and denature proteins. As explained above (Section 20.2), minimizing vertical and lateral physical forces is key for scrutinizing biological samples reliably at a high resolution. It is therefore of utmost importance to understand the different interaction mechanisms between the tip and the surface. Consequently, the selection of suitable cantilevers, imaging procedures, and the optimization of regulating parameters (e.g., feedback loop, scanning

Figure 20.3 Superposition effects between the tip and the sample can distort the AFM topography. (a) Owing to its finite diameter (10 nm in most cases), a cantilever tip cannot trace very sharp edges. Therefore, the imaged topography (gray line) is a superposition between the tip and the sample surface. (b) In many cases, however, the periodicity of the biological structures can be resolved correctly irrespective of the tip diameter.

Corrugation waviness of a surface.

522

Part II: 3D Structure Determination

speed, image size) are key experimental parameters that require meticulous tuning. For example, the type of ions (i.e., monovalent or divalent) and their concentrations play a decisive role in increasing the resolution of a map; they are to be selected such that a repulsive (0.05 nN) and long-range (several nm) interaction is generated between the tip and the object. This strategy increases the contact force marginally with only a small fraction acting locally on the protein structures thus preventing sample deformation in most cases. However, this tuning of the interactions is associated with an increase in the gross force of the coupled cantilever–sample system, further suppressing the natural thermal resonance of the cantilever spring (Section 20.5). Advantageously, this reduction in the thermal noise increases the signal-to-noise ratio, resulting in a high resolution of the AFM topography.

20.4 Preparation Procedures Because the general working principle of the AFM is based on sensing and detecting molecular interaction forces, it is not necessary to metal coat or fluorescently label biological macromolecules or cells to identify them. The most important and often the only sample preparation step required is to immobilize the biological sample on an atomically flat support. This is a mandatory prerequisite for high-resolution imaging because the corrugation of an atomically flat surface hardly superimposes with the morphology of the soft biological sample and enables precise spatial control of the cantilever tip on the specimen. The sample preparation therefore requires finding a balance between the need to anchor the biological object while at the same time minimizing the interactions between the object and the support surface to ensure the mapping of an unaltered, native state (conformation) of the sample (protein). Sample supports used in optical and electron microscopy are also utilized for AFM applications. Examples include glass, muscovite mica, graphite, and metal (such as goldcoated) surfaces. Based on the desired application, each of these sample carriers can be tuned to have unique physical and chemical properties (e.g., surface charge or roughness). For example, mica, which is characterized by a layered crystalline structure, is ideally suited for the immobilization of proteins and nucleic acids. An adhesive tape can be used to peel off crystalline layers from the underlying surface to provide a relatively chemically inert, negatively charged, and an atomically flat surface. The most commonly applied immobilization strategies are based either on physical interactions between the biological sample and a chemically inert surface or on the covalent attachment of the sample to a functionalized (reactive) support. Physical adhesion is mainly achieved by shielding the repulsive electrostatic interactions between the biological object and the sample carrier. Thus, increasing the ion concentration leads to an attractive interaction between the sample and the support surface, which is sufficient to immobilize the biomolecules. For this purpose, the buffer solution containing the biological sample is applied directly onto a freshly split mica surface, the macromolecules allowed to adsorb firmly for a few minutes, and their surfaces mapped with a cantilever tip. For chemical coupling, the sample support is first functionalized with a chemical moiety, for example, glass with silanes or gold with thioalkanes, and in a final step the biomolecules are reacted with the chemically active surface. Because biological structures require water molecules for their structure–function dynamics, drying biological samples and imaging in air should be avoided as far as possible. During the drying process, biological molecules are exposed to tensile forces generated by the surface tension of water often leading to the collapse of the samples. Irrespective of the fact that drying artifacts can be minimized or avoided by certain vacuum sublimation procedures, biological samples, whenever possible, should be prepared and imaged in aqueous solutions to preserve their native structural and functional integrity.

20.5 Mapping Biological Macromolecules Using the AFM, surface structures and dynamic processes of a range of different biological samples can be observed under native conditions. These include biological macropolymers (e.g., proteins, DNA, RNA, and polysaccharides), supramolecular complexes (e.g., metaphase

20 Atomic Force Microscopy

chromosomes), as well as bacterial or cellular associations, and tissues of higher organisms. Currently, the highest resolution is achievable on isolated macromolecules immobilized on an atomically flat surface. The technique is capable of providing high-resolution topography of individual membrane proteins in their native lipid bilayer environment, while at the same time the resolution is high enough to monitor and discern subunit dynamics such as associated with helix and polypeptide loop movements during the opening and closure of membrane channels. Figure 20.4 shows the cytoplasmic surface of the purple membrane of the archaebacterium Halobacterium salinarum. This purple membrane consists of lipids and the seven transmembrane α-helical protein bacteriorhodopsin, a light-driven proton pump. The AFM topography of the native membrane surface clearly shows the natural crystalline arrangement of bacteriorhodopsin – individual bacteriorhodopsin molecules assemble as trimers, which further form a two-dimensional hexagonal lattice. The close packing allows the maximum density of proton pumps for light-driven energy production. The AFM topography evidently portrays the molecular (polypeptide loops and termini) variability of each bacteriorhodopsin assembly. Nevertheless, the statistical average of the individual particles reveals a representative structural snapshot of bacteriorhodopsin, and the calculated standard deviation is a measure of the conformational variation of the population. In addition, it is useful to compare the averaged AFM topography with the structural information obtained from complementary structural biology methods (e.g., X-ray crystallography, electron microscopy, or NMR). For example, superimposing the mean contour from AFM on the three-dimensional X-ray structure makes it possible to assign the surface details to the secondary structures, and to further study the structural dynamics of the protein under different conditions. (Figure 20.4) The AFM is also used to characterize the stoichiometry and the formation of membrane proteins in functional complexes. Figure 20.5a shows the Na+-ion driven rotors of the FOF1ATP synthase from Ilyobacter tartaricus; the number and arrangement of the subunits of the functional rotor can be unambiguously recognized. The high-resolution tool provided a straightforward approach to illustrate that the cation-fueled (H+ or Na+) FOF1-ATP synthases from different organisms are constructed of different numbers of subunits. An attractive although debatable speculation is whether the different ATP synthases evolved to adapt the number of rotor subunits to the membrane potential of the cell and thus regulate ATP synthesis for energy consumption. Thus, elucidating the cellular mechanism that controls rotor stoichiometry will advance our understanding of ATP synthesis. AFM is also applied for observing dynamic processes such as DNA transcription, protein diffusion, conformational changes of individual proteins, and even the emergence of molecular networks. The time required to record an AFM topograph is a decisive factor in determining the temporal resolution of the dynamic process. Depending on the system being investigated and the AFM setup, the recording time is between 1 s and 15 min. The so-called “gap junctions” or the communication channels from the rat liver cells are shown in Figure 20.5b and c. The individual connexins of the gap junctions show a nearly perfect hexagonal packing. The central pores of the individual hexamers are clearly discernible in the unprocessed topographs. In the presence of a signaling molecule such as calcium in the buffer solution, reversible closing of the channels can be directly observed (Figure 20.5c). The average surface structures of the channels

523

Figure 20.4 Cytoplasmic surface of the native purple membrane of Halobacterium salinarum. (a) The AFM topograph clearly reveals the assembly of single bacteriorhodopsin molecules into trimers; the trimeric units are further organized in a hexagonal lattice. The membrane topography was recorded in a physiological buffer at room temperature. (b) The diffraction pattern of the topography extends up to the 11th order (drawn circles), which suggests a lateral resolution of 0.49 nm. (c) The average topography (top) and the corresponding standard deviation (bottom) of the bacteriorhodopsin trimer allows correlation to structural data obtained by electron crystallography. Superimposed is the outline of bacteriorhodopsin molecules, as well as the positions of the seven transmembrane α-helices (A–G).

X-Ray Structure Analysis, Chapter 21 Electron Microscopy, Chapter 19 NMR Spectroscopy of Biomolecules, Chapter 18

524

Part II: 3D Structure Determination

Figure 20.5 Determining the protein complex assembly and function by AFM. (a) Na+-driven rotors of FoF1 ATP synthase from Ilyobacter tartaricus; the high resolution topography allows to characterize the stoichiometry and arrangement of the subunits of a functional rotor. (b) The extracellular surface of purified communication channels (gap junctions) from rat liver epithelial cells clearly shows hexameric proteins with an open central channel. The average and the standard deviation (SD) allow insights into the structure and its conformational flexibility. While the average profile reveals the channel entrance, the SD map assigns an increased structural flexibility to the central channel. (c) In the presence of 0.5 mM Ca2 + (at neutral pH) the central channel of the hexamer is closed. This mechanism is evident in the average hexamer structure, and the associated SD map depicts a loss in flexibility of the channel entrance in the closed state. All topographies were recorded in a physiological buffer at room temperature.

and their standard deviations provide insight into the structural variability and map the flexible regions of the protein. An open communication channel demonstrates structural variability and flexibility; these characteristics are lost upon channel closure. This observed relationship between flexibility and functional conformational change of the channel is not an exception, and has been determined previously for other proteins. Thus, to a large extent it can be generalized that the flexibility of structural features often correlates with their ability to perform functionally related conformational changes. The advent of the ultrafast AFM, capable of taking several hundred pictures per second, is a major breakthrough in AFM development, and will allow real-time observation of various dynamic processes under native conditions. In contrast to the sub-nanometer imaging of individual molecules, a maximum resolution of 50 nm is attained for AFM imaging of cells and tissues. The low resolution is partially owing to the flexibility and dynamic motion of living cells, but also due to significant roughness of the cell surface. Although the AFM still provides important physical insight into cellular function, the low resolution of cell surface imaging hinders identification of the observed structures on cells. To overcome this shortcoming, AFMs are often combined with modern light microscopy techniques. The combination makes it possible to correlate topographic information of a cell with its superficial structures (e.g., vesicles or the cytoskeleton, which need to be fluorescently labeled for simultaneous AFM and optical microscopy). In addition to the determination of structural features, the AFM permits the determination of various other physical parameters (e.g., the elasticity of a biological object). The cantilever spring deflects when the tip is pushed against a sample; this signal is analyzed to characterize the elastic properties of the sample. Importantly, an unambiguous interpretation of cell elasticity maps requires the identification of different structural components that contribute to the mechanical stability of the cell. From a morphological standpoint, the cytoskeleton thereby plays a crucial role, can be directly traced by tip-induced deformation of the flexible cell membrane. Alternatively, it is also useful to analyze the phase changes of an oscillating cantilever spring during the dynamic mapping of a cell. Phase modulation reflects changes in the elastic properties, charge, and roughness of a cell surface.

20.6 Force Spectroscopy of Single Molecules The ability of the AFM to detect forces with pN sensitivity helps to characterize the strength of biological and chemical bonds, and the behavior of individual molecules under mechanical stress. An attractive straightforward measurement is between ligands and receptors. This is achieved by first functionalizing the cantilever tip with the protein or the small molecule ligand, and then using the probe to interrogate the binding partner on a sample support. Using this approach it has been possible to detect specific binding and quantify molecular interactions

20 Atomic Force Microscopy

525

Table 20.2 Estimated rupture forces of chemical and biological bonds. Bond

Rupture force (pN)a)

Covalent

1000–2000

Cell–cell interaction

500

Biotin–avidin

200

Antibody–antigen

OH3 > OH4. HPAEC-PAD does not require any derivatization of the monosaccharides and thus is simpler and superior to other methods of monosaccharide analysis (Figure 23.19). With mammalian cell glycoproteins, acid hydrolysis in 2 N trifluoroacetic acid (4 h, 100 °C) with subsequent detection and quantification of the monosaccharides via HPAEC-PAD has proven optimal. Trifluoroacetic acid (TFA) has, compared to HCl or H2SO4, the advantage that it is volatile and removed during lyophilization or in a SpeedVac. At the given acid hydrolysis conditions, Nacetyl sugars (as GlcNAc or GalNAc) will be N-deacetylated and detected as the corresponding amino sugars (GlcNH2 or GalNH2). Since N-acetylneuraminic acid decomposes under these hydrolysis conditions, Neu5Ac must be released in a special and milder hydrolysis or enzymatically (e.g., by means of neuraminidase (sialidase)), and determined separately (see below). As is obvious from the N-glycan structures (Figures 23.9–23.11), defined monosaccharide molar ratios may be concluded from the regular biantennary, triantennary, and tetraantennary N-glycans with complete sialylation. The same applies for the high-mannose type N-glycans (Table 23.4). In some cases the molar GlcNAc/mannose ratio may even enable us to conclude the actual structure type: A GlcNAc/Man < 0.5 ratio generally points towards the presence of high-mannose type N-glycans, whereas at a ratio between 1 and 2 complex type N-glycans may predominantly (or exclusively) be expected. For rhEPO the quotient GlcNAc/Man = 2.0 indicates the presence of complex type N-glycans, where a high proportion of tetraantennary N-glycans may be expected. The finding of mannose (9 mole mole 1) points towards the presence of three glycosylation sites, as complex type N-glycan structures always carry three mannose units per N-glycan. Likewise, the GlcNAc content (18 mole mole 1) and Gal content (14 mole mole 1) indicate the presence of three tetraantennary N-glycans.

587

23 Carbohydrate Analysis Table 23.4 Molar ratios of monosaccharide components of N-glycans.

Complex type

High-mannose type

Erythropoietin

Tetraantennary

Fuc

GalNAc

GlcNAc

Gal

Man

Neu5Ac

GlcNAc/Man

Neu5Ac/Gal

0–1



6

4

3

4

2.00

1.00

Triantennary

0–1



5

3

3

3

1.67

1.00

Biantennary

0–1



4

2

3

2

1.33

1.00

Man5





2

5

0.40

Man6





2

6

0.33

Man7





2

7

0.29

Man8





2

8

0.25

Man9



2

9

0.22

3

0.8

18

14

9

11

The fact that approximately 1 mole per mole of GalNAc is detected for rhEPO indicates the presence of a single O-linked glycosylation site. Accordingly, 1 mole per mole of galactose should be attributed to the O-glycan (Section 23.2.2), which means that 13 mole per mole of galactose should still remain for the three tetraantennary N-glycans. The detection of GalNAc generally indicates the presence of O-glycans, because GalNAc is only in exceptional cases part of N-glycans. In the case of positive GalNAc findings, the presence of an O-glycan should be confirmed using another method. On the other hand, a negative GalNAc result always proves the lack of any O-glycosylation. Analysis of Sialic Acid Assessment of the sialylation status is of great interest, especially with biopharmaceutical glycoproteins such as EPO, because this determines the half-life of the glycoprotein in the blood circulation and thus its biological activity. Incompletely sialylated glycoproteins are recognized by receptors in the liver and removed from the blood circulation (clearance), and their biological activity is thus reduced. The production of therapeutic glycoproteins therefore demands confirmation of the intact and consistent glycosylation and sialylation from batch to batch for pharmacological safety reasons. In the thiobarbiturate method according to Warren (the so-called Warren test) the sialic acid residues are released from the glycoprotein by acid hydrolysis, oxidized by periodate, and stained by the addition of thiobarbituric acid. The color may likewise be generated by the reaction of neuraminic acid with resorcinol. The resulting colored complex is extracted from the aqueous medium with cyclohexanone or n-butanol/butyl acetate (1 : 5). In both cases the result is quantified via sialic acid standards carried in the reaction. The absorption of the respective color complex measured at 549 and 580 nm is, over a certain concentration range, proportional to the sialic acid content of the sample. Nowadays, these two staining methods have been largely replaced by the simpler and clearer analysis via HPAEC-PAD (see above). The bound sialic acid residues are liberated from the glycoprotein by mild acid hydrolysis (e.g., in 0.1 N H2SO4 or 2 N acetic acid for 0.5–1 h at 80 °C), or by incubation with the enzyme neuraminidase. The actual analysis is then performed via HPAEC-PAD, which not only enables the exact quantification using a calibration curve but also a simple differentiation between N-acetylneuraminic acid (Neu5Ac) and N-glycolylneuraminic acid (Neu5Gc) (structural formulas Figure 23.3). Compared to the N-acetyl group, the N-glycol group carries an additional OH group that is acidified by the adjacent CO group. This causes for Neu5Gc an enhanced interaction with the anion-exchange matrix and thus a greatly increased retention time. Neu5Gc does not occur in natural human glycoproteins, but does, however, for example, in bovine glycoproteins or monoclonal antibodies derived from mouse cell lines. The NeuGc content of human glycoproteins recombinantly expressed in animal cells is very low (in the case of rhEPO (CHO) less than 3%); nevertheless, for biopharmaceutical glycoproteins this provides an important quality parameter that needs to be controlled from batch to batch. Under the alkaline conditions of routine HPAEC-PAD, O-acetylated neuraminic acid residues lose their O-acetyl groups (ester hydrolysis). Its detection therefore requires neutral or weakly acidic eluents or (better) a mass spectrometric analysis of the corresponding glycopeptides (an example is given below in Figure 23.22c).

2.0

0.8

Man/Fuc

3.0

The data of a monosaccharide components analysis should not be overestimated! In general, the situation will be much more complicated, and usually no information on the nature of the oligosaccharide side chains is obtained from monosaccharide composition analysis.

588

Part III: Peptides, Carbohydrates, and Lipids

23.3.2 Mass Spectrometric Analysis on the Basis of Glycopeptides

Mass Spectrometry, Chapter 15

Figure 23.20 Amino acid sequence of EPO with GluC cleavage products E1–E13 (N-glycosylation at Asn24, Asn38, and Asn83; O-glycosylation at Ser126; disulfide bond in the two peptides E1 [(1–8)SS(160–165)] and E5).

Figure 23.21 Total ion current (negative ion mode) of the (glyco)peptides of an endo-GluC digest of EPO (CHO) separated by means of RP-HPLC/ESI-MS. The peaks are detected as single or double charged species. Peaks c, d, g, and i correspond to the glycopeptide fragments of the four glycosylation sites; the observed peak broadening or peak splitting results from the microheterogeneity of glycosylation. Source: according to: Kawasaki, N. et al. (2000) Anal. Biochem., 285, 82–91. With permission, Copyright  2000 Academic Press.

Direct mass spectrometric characterization of a glycoprotein is complicated by the microheterogeneity of glycosylation. Multiply and heterogeneously glycosylated proteins (such as EPO) lead to a broad, unresolved mass peak. However, on the basis of glycopeptide fragments, mass spectrometric methods (MALDI-MS Section 15.1, ESI-MS Section 15.2) in combination with HPLC are essential. They enable quick insight into the glycan composition per glycosylation site (microheterogeneity) and provide important sequence information via fragmentation experiments (MS/MS) (Section 23.3.4). For this purpose, the glycoprotein to be analyzed is digested with an appropriate protease (e.g., trypsin, chymotrypsin, endoproteinase LysC, endoproteinase GluC; cf. Section 9.5.1). Direct characterization of the (glyco)peptide pool via MS, however, is limited by a masking effect (ion suppression) the peptide signals exert on the glycopeptide signals. Therefore, the (glyco)peptide pool is usually separated via RP-HPLC and the individual peaks are analyzed via mass spectrometry, either online (LC/MS) or, after fractionation, offline (MALDI-MS, ESIMS). MS analysis of glycopeptides plays an important role, particularly in the characterization of O-glycans, since the free O-glycans are not readily accessible as such (Section 23.2.2). Trypsin cleaves peptide bonds of polypeptides C-terminal from lysine (K) and arginine (R). If, for example, EPO is digested with trypsin, two glycosylation sites (Asn24 and Asn38) remain on one and the same fragment (T5, amino acid residues 21–45). Endoproteinase GluC cleaves peptide bonds each C-terminal from glutamate (E) and occasionally aspartate (D). If you digest EPO with endo-GluC, the three N-glycosylation sites and the Oglycosylation site will be located on four different fragments (E5, E6, E10, and E12, Figure 23.20). The resulting 13 endo-GluC fragments can be separated via RP-HPLC (Figure 23.21) and allow the determination of the site-specific glycosylation (glycosylation per glycosylation site). Mass spectrometric analysis of the (glyco)peptides separated via RP-HPLC (see example of EPO, Figure 23.21) allows identification of the various glycosylation sites: The mere peptides can be assigned by comparing the experimental masses to the masses calculated from the respective amino acid sequence; the non-attributable peptide masses refer to an existing glycosylation (see endo-GluC digest of EPO, Table 23.5). The mass spectra of the glycopeptides particularly enable us to deduce the glycosylation per glycosylation site, which allows, via the molecular masses detected, assignment of the sitedirected microheterogeneity (see example of EPO, Figure 23.22). Thus, the mass analysis of

23 Carbohydrate Analysis

589

Table 23.5 Theoretical and experimental masses of EPO GluC peptides; fragments E5∗ (Asn24), E6∗ (Asn38), E10∗ (Asn83) and E12∗ (Ser126) are glycosylated. Source: according to Kawasaki, N. et al. (2000) Anal. Biochem., 285, 82–91. Peak in Figure 23.21

Glu-C peptide number

A

E8

B

E2

Amino acid residue (cf. Figure 23.20)

Average theoretical mass

M

56–62

729.4

728.4

9–13

602.4

601.5

m/z 1-

C

E6



38–43

D

E5∗

22–37

E

E1

(1–8)S-S(160–165)

1502.7

1502.1

F

E3

14–18

692.1

691.5

G

E12∗

108–136

H

E7

44–55

1571.8

1571.1

63–72

1114.6

1113.9

I

E10

J

E9



M2-

750.6

73–96

K

E13

137–159

2835.6

L

E11

97–107

2211.3

1417.0 2210.6

Figure 23.22 ESI-MS mass spectra of the four glycopeptides of GluC digestion of EPO (CHO) from Figure 23.21: peak c (a), peak d (b), peak g (c), and peak i (d). The assigned N-glycan structures are shown in Table 23.6. Source: according to: Kawasaki, N. et al. (2000) Anal. Biochem., 285, 82–91. With permission, Copyright  2000 Academic Press.

590

Part III: Peptides, Carbohydrates, and Lipids Table 23.6 Theoretical and experimental masses of EPO GluC glycopeptides; E5 (Asn24), E6 (Asn38) and E10 (Asn83) (peaks c, d and i of Figure 23.21). Source: according to Kawasaki, N. et al. (2000) Anal. Biochem., 285, 82–91. Ion in Figure 23.21 c

Amino acid residue (cf. Table 23.5)

N-Glycan structure (each with core-fucose)

E6 (38–43)

Asn38

Average theoretical mass

M

m/z 2-

c1

BiLac1NA2, TriNA2

3375.2

1686.8

c3

TriNA3

3666.5

1831.8

c4

BiLac2NA2, TriLac1NA2, TetraNA2

3740.6

1869.1

c6

TriLac1NA3, TetraNA3

4031.8

2014.5

c8

TetraNA4

4323.1

2160.2

TetraLac1NA4

4688.4

2343.2

c10 d

E5 (22–37)

M3-

1561.8

Asn24

d1

BiNA1

3750.7

1875.1

d2

BiNA2

4042.0

2020.3

1346.2

d4

BiLac1NA2, TriNA2

4407.3

2202.7

1468.2

d6

TriNA3

4698.6

2348.7

1565.6

d7

BiLac2NA2, TriLac1NA2, TetraNA2

4772.6

2385.7

1590.7

d8

TriLac1NA3, TetraNA3

5063.9

1687.3

d10

TetraNA4

5355.2

1784.5

TetraLac1NA4

5720.5

1905.7

d13 i

E10 (73–96)

Asn83

i1

BiLac1NA2, TriNA2

5388.5

1795.1

i3

TriNA3

5679.8

1892.2

i4

BiLac2NA2, TriLac1NA2, TetraNA2

5753.9

1917.6

i5

TriLac1NA3, TetraNA3

6045.1

2014.2

i7

TetraNA4

6336.4

2111.1

i10

TetraLac1NA4

6701.7

2233.4

i13

TetraLac2NA4

7067.1

2355.8

HPLC peak c of Figure 23.21 (GluC cleavage product E6 of EPO with glycosylation at Asn38) reveals 11, peak d (E5 with Asn24) 13, and peak i (E10 with Asn83) 14 different glycostructures, the major masses of which are shown in Table 23.6. Thereby, the individual ion peaks may cover different glycan isomers that cannot be identified in this glycopeptide analysis, but may only be assigned on the basis of the free N-glycans (cf. Section 23.3.5). Only for peak g (E12 with O-glycosylation at Ser126) do the m/z values refer to the structures NeuAc-GalGalNAc and NeuAc2-Gal-GalNAc (Figure 23.22c).

23.3.3 Release and Isolation of the N-Glycan Pool The procedure for glycan release depends on the glycoprotein to be analyzed and the degree of the desired information. Depending on the target, different enzymatic as well as chemical methods may be applied. Thus, the asparagine-linked sugar chains of glycoproteins may be released by glycopeptidases (PNGase F, PNGase A) or endoglycosidases (Endo-H, Endo-F) or by heating in anhydrous hydrazine for several hours. The N-glycans separated from the peptide backbone may be analyzed in chromatographic or electrophoretic manner or by MS or a combination of these methods.

23 Carbohydrate Analysis

591

Note: For the release of O-glycans, there is currently only a single enzyme, O-glycanase from Diplococcus pneumoniae – but this enzyme cleaves, because of its high substrate specificity, only Galβ1,3GalNAcα-Ser/Thr. In addition, an ideal chemical release process does not exist. However, Oglycans can usually be released by reductive alkali-catalyzed β-elimination (alkali borohydride reaction); the free glycans are reduced in situ to avoid the formation of degradation products. But this reaction is not specific so that in parallel 10–20% of the available N-glycans will be released. Thereby, the free glycans lose their reactive reducing ends, so that these are no longer available for the easy introduction of a UV or fluorescent label. However, the reduced O-glycans may be easily identified by mass spectrometry, so that their analysis today is usually carried out by MS.

The enzyme peptide-N4-(N-acetyl-β-glucosaminyl) asparagine amidase (PNGase F from Flavobacterium meningosepticum) practically cleaves all Nglycosidically linked sugar chains (except those bound to the amino or carboxy terminus of a polypeptide). (Also excluded are those N-glycans that at the proximal GlcNAc carry α1,3 (instead of α1,6)-linked fucose, as found in plant and insect cells but not in animal cells; such N-glycans are released by PNGase A). However, the enzymatic process sometimes requires optimized reaction conditions – such as a prepend proteolytic digestion of the glycoprotein (e.g., with trypsin or chymotrypsin). The peptide fragments thus formed are conformationally more flexible, and thus enable easier access of the enzyme to the individual glycosylation sites. The addition of an appropriate detergent (e.g., Triton X-100, Tween 20, CHAPS, or sodium dodecyl sulfate) facilitates the cleavage of the sugars (the detergent will unfold the protein and thus likewise facilitate the access of the enzyme to the glycosylation sites). PNGase F cleaves the N-glycosidic bond between the “proximal GlcNAc” (the anchor sugar) and the amino acid asparagine (Figure 23.23). While here the N-glycans are released in unaltered form, the anchor molecule asparagine (Asn) will be modified to aspartic acid (Asp). The removal of the sugar residues by PNGase F will usually cause a decrease in molecular weight, which is evident in SDS-PAGE (see example of EPO, Figure 23.15). The released sugar chains can be extracted with (freezer-cooled) ethanol (10%). Phenolic extraction should be repeated to obtain complete denaturation of the proteins until no interphase is detectable. Since phenol is in part soluble in water, the phenol dissolved in the aqueous phase is removed by extracting the aqueous phase with a chloroform/isoamyl alcohol mixture. For purification of RNA, the phenol should be buffered in water or low pH buffers (“acidic phenol”). DNA contaminants are much more soluble in acid phenol and can be removed more efficiently. The nucleic acids purified by phenolic extraction can be precipitated by ethanol as described below (Section 26.1.3).

26.1.2 Gel Filtration Gel Filtration/Permeation Chromatography, Section 10.4.1

Reversed Phase Chromatography, Section 10.4.2

Gel filtration methods can also be used for the purification of DNA (and RNA) solutions (the most common are Sephadex G50 or G75 and Bio-Gel P2). The purification effect is based on size exclusion allowing the separation of certain nucleic acid contaminations. Huge DNA molecules elute much faster (usually in the void volume of the column) than smaller, low molecular weight contaminants that become trapped in the pores of the column material and therefore elute at later time-points of the chromatography. The nucleic acid containing solution is loaded onto the gel filtration column and the column is eluted with buffer. The eluate is collected in fractions and can be tested for nucleic acid content. Gel filtration columns can be self-made (Figure 26.2) using a glass Pasteur pipette, which can be sealed with glass wool or a glass beads. When purifying only very small amounts of DNA, the glass pipette should be treated with silane prior to purification as DNA sticks to the glass. Many companies offer suitable gel filtration or purification columns at relatively low cost. Loading volume and elution volume are predefined by the supplier. In contrast to regular gel filtration columns, spin columns use centrifugal forces to apply and elute the fractions, and are therefore much faster. Instead of gel filtration, the principle of reversed-phase chromatography can also be applied. The nucleic acid solution is loaded onto the column material at low salt concentrations and is eluted with high salt buffer (e.g., Elutip columns).

26

Isolation and Purification of Nucleic Acids

667

Figure 26.2 Gel filtration for purification of DNA solutions. (a) Gel filtration columns can be made out of Pasteur pipettes and are filled with equilibrated column material. Depending on the size of the DNA, Sephadex G50, Sephadex G25, or Sephacel materials are used. (b) Using the molecular sieve effect small molecules, contaminating nucleotides, or salts are withheld, while large DNA molecules are not detained and will elute first from the column. A tentative column profile is depicted. The fractions containing the DNA can be analyzed by OD determination, ethidium bromide staining, or when purifying radioactive DNA by radiation analysis. The maxima are closer together when small DNA molecules are purified.

26.1.3 Precipitation of Nucleic Acids with Ethanol The most common method for the concentration and further purification of nucleic acids is by precipitation with ethanol. In the presence of monovalent cations, DNA or RNA form an ethanol-insoluble precipitate that can be isolated by centrifugation. Monovalent cations are usually supplied by addition of sodium acetate or ammonium acetate. Ammonium acetate is used to reduce co-precipitation of free nucleosides. However, ammonium ions can inhibit the activity of certain enzymes, for example, T4 polynucleotide kinase. For some applications, RNA is precipitated in the presence of lithium chloride. Lithium chloride is soluble in ethanol and will not be precipitated together with the nucleic acids. Chloride ions can act as inhibitors for several reactions, therefore precipitation with chloride ions should only be used in certain circumstances. In the laboratory, the nucleic acid solution is adjusted to the desired salt concentration using a higher concentrated stock solution of, for example, sodium acetate. To this solution the 2.5–3fold volume of ethanol is added and incubated, depending on the nucleic acid to be precipitated, at room temperature or 80 °C and centrifuged (Figure 26.3). The precipitated salt can be

Figure 26.3 Precipitation of nucleic acids with ethanol. (a) To the aqueous nucleic acid solution a 2.5-to 3-fold volume of absolute ethanol (or in the case of isopropanol 0.5–1-fold volume) is added and the nucleic acids are precipitated by centrifugation. A colorless pellet is usually visible on the bottom of the tube. (b) High molecular weight genomic DNA can be precipitated on the phase interface by cautious overlay of the aqueous phase with ethanol. The DNA can be visualized by winding it on a sterile rod.

668

Part IV: Nucleic Acid Analytics

The carrier has to be chosen so that the material does not interfere with the subsequent reactions or applications. For example, tRNA will also be phosphorylated by T4 polynucleotide kinase and should not be used as carrier if the subsequent reaction involves phosphorylation reactions. Glycogen can interact with DNA–protein complexes.

removed by washing the pellet with 70% ethanol. In contrast to the precipitated DNA, most salts are soluble in 70% ethanol. The nucleic acid pellet is dried briefly and re-solved in buffer or water. Nucleic acids can also be precipitated by the addition of 0.5–1 volume of isopropanol. This protocol is advantageous if the volume of the reaction should be kept to a minimum. Sodium chloride is precipitated better in isopropanol. Isopropanol is less volatile than ethanol; the nucleic acid pellet needs to be washed carefully with 70% ethanol. Precipitation of Small Amounts using Carrier Material As very low concentrations of RNA or DNA (150 kb) is not precipitated but purified using dialysis or extraction with 2-butanol. Extremely high molecular weight DNA can be analyzed by melting the cells or tissue to be analyzed into agarose and all the following steps are then performed in these agarose blocks. Lysis of Cell Membranes and Protein Degradation A fundamental step during isolation of genomic DNA is the proteolysis of cellular proteins by proteinase K. Simple extraction of the proteins using phenol is not sufficient. In addition, genomic DNA is complexed with histones and histone-like proteins, which cannot be removed completely by phenolic extraction. The optimal incubation temperature for proteinase K is between 55 and 65 °C. The enzymatic performance is optimal at 0.5% sodium dodecyl sulfate (SDS). Incubation of the starting material with proteinase K containing buffer is in many cases sufficient to disrupt and lyse the cell membranes. The addition of RNAse removes the contaminant RNA. In many cases, the cells or tissue need to be disrupted mechanically before protease digestion. Homogenizers disrupt the cells with blades that rotate at high frequencies. A so-called French press or ball mills can also be used. A French press uses high pressure to disrupt the cells. Ball mills contain very fast moving small beads made of glass or steel. If the DNA is obtained from tissue, the tissue is shock frosted using liquid nitrogen and then pulverized to achieve a homogenous mixture. The lysis of bacterial walls is achieved by lysozyme and yeast cell walls are disrupted by zymolase or lyticase specifically degrading yeast cell walls. The enzymes are inactivated during protease K treatment. Table 26.2 Enzymes and lysis reagents for isolation of genomic DNA. Nucleic acid origin

Lysis by

Subsequent treatment

Eukaryotic cell cultures

Sodium dodecyl sulfate (SDS)

Proteinase K

Tissue

Sodium dodecyl sulfate/proteinase K

Proteinase K

Plants

SDS or N-laurysarkosin

Proteinase K

Yeast (Saccharomyces cerevisiae; Schizosaccharomyces pombe)

Zymolyase or lyticase

Proteinase K

Bacteria (Escherichia coli)

Lysozyme

Proteinase K

669

670

Part IV: Nucleic Acid Analytics

Purification and Precipitation of Genomic DNA

The proteinase K is inactivated and removed by phenolic extraction. The precipitation of genomic DNA after addition of ethanol can be easily observed: the DNA precipitates at the interphase between the water and ethanol and can be rolledup as filaments on a sterile rod (Figure 26.3). Genomic DNA should be dried very cautiously to avoid solubility problems later on. The genomic DNA can be dissolved by incubation for several hours at 4 °C. Genomic DNA isolated by ethanol precipitation is of sufficient purity for most applications. Phenolic extraction and subsequent ethanol precipitation will result in an average molecular size of around 100–150 kb. This size is sufficient for the generation of DNA libraries using the bacteriophage λ-vectors and for Southern blot analysis. The construction of cosmid libraries requires DNA fragments of at least 200 kb, and so extraction with organic solvents cannot be used. The proteinase K and remaining proteins are denatured using formamide and removed by dialysis using collodion bags. This method avoids shear forces and results in high molecular DNA fragments (>200 kb). A fast and easy isolation method is the lysis of cells and denaturation of the proteins with guanidinium hydrochloride. The DNA is isolated by ethanol precipitation. This method yields genomic fragments with an average size of 80 kb and can be used for Southern blot or PCR analysis. Additional Purification Steps Genomic DNA can be purified using CsCl gradient centrifugation. During centrifugation RNA contaminations are pelleted and removed completely. The CsCl density gradient centrifugation is described in Chapter 1 and in Section 26.3. Commercially available kits use the anion exchange method for the isolation and purification of genomic DNA. The kits are available with columns containing the anion exchange material using gravity or spin-columns using centrifugal forces. This purification method does not need any organic extraction; however, the DNA is subject to shear forces so that high molecular weight DNA cannot be isolated using this method. The genomic DNA isolated by the anion exchange columns can be used for Southern blot, PCR, and next generation sequencing. Polysaccharide contaminants can be removed by treatment with CTAB (cetyltrimethylammonium bromide, Figure 26.5). This purification step is essential when genomic DNA is isolated from plants or bacteria, as they contain high levels of polysaccharides. CTAB complexes polysaccharides and removes the remaining proteins. By addition of chloroform/ isoamyl alcohol the complexed polysaccharides are precipitated in the interphase. An important factor is the NaCl concentration: if the concentrations is below 0.5 M, the genomic DNA will also precipitate in the presence of CTAB.

Figure 26.5 CTAB (cetyltrimethylammonium bromide or hexadecyltrimethylammonium bromide). The quaternary ammonium salt can act as cationic detergent.

26.3 Isolation of Low Molecular Weight DNA 26.3.1 Isolation of Plasmid DNA from Bacteria Plasmids, that is, extrachromosomal mostly circular DNA occur naturally in microorganisms. Plasmids consist of 2 to more than 200 kb and fulfill various different genetic functions. In daily laboratory business, plasmids consist of defined genetic elements (replication origin, resistance gene, and polylinker – Figure 26.6). These so-called plasmid vectors are essential tools for a huge variety of applications. The methods described below deal exclusively with the isolation of bacterial plasmid vectors. Plasmids are grown in bacteria using antibiotic selection. The plasmids contain at least one selection gene for resistance to certain antibiotics, for example, the bla-gene coding for β-lactamase enables bacteria that carry the gene to grow in ampicillin-containing media. In addition, plasmids contain a bacterial origin of replication for propagation of the plasmid in the bacterium. The kind of origin of replication determines the copy number of a plasmid in the bacterium (Table 26.3.).

26

Isolation and Purification of Nucleic Acids

671

Table 26.3 Origins of replication of commonly used plasmid vectors and copy numbers. Plasmid

Origin of replication

Gene of resistance r

Copy number

r

pMB1

Amp , Tet

pUC

pMB1

Amp

r

500–700

pBluescript

pMB1

Ampr

300–500

pGEM

pMB1

Amp

r

300–400

pVL 1393/1392

ColE1

Ampr

pBR322 and derivatives

pACYC pLG338

15–20

>15 r

r

Chloramphenicol , Tet

p15A

r

Kan , Tet

pSC101

r

10–12 ca. 5

Plasmids are classified as low copy (copy number < 20) and high copy plasmids (copy number > 20). The copy number of the plasmids is a major determinant for the yield of a plasmid from a bacterial culture. Most plasmids contain a mutated version of the pMB1 origin of replication derived from ColE1 multi-copy plasmids of the Enterobacteriaceae family. The isolation steps can be divided into growth and lysis of the bacteria and isolation and purification of the plasmid DNA. Bacterial Culture For the isolation of plasmids, derivatives of the Escherichia coli strain K12 are used. The strain is considered biologically safe as it is missing pathogenic genes (e.g., factors relevant for adhesion and invasion, toxins, and certain surface molecules). Not all Escherichia coli K12 strains are equally useful for plasmid production. Good host strains are, for example, DH1, DH5a, and XL1 Blue. Certain strains like HB101 and JM100 express a high amount of endonucleases and carbohydrates that are detrimental to the plasmid isolation procedure. To avoid mutations and unwanted DNA recombinations, strains deficient of recombinase A (recA-), like XL1 Blue and its derivatives, are preferred. Bacteria are grown in liquid culture using autoclaved Luria-Broth (LB, contains yeast extract, Bacto tryptone, and sodium chloride) in the presence of the antibiotic. The amount and quality of the added antibiotic is important for the plasmid yield. Ampicillin is temperature sensitive and should not be added to hot autoclaved medium. According to good microbiological practice, the broth is inoculated with a single bacterial colony, which is first grown in a small volume of media and then diluted to the needed volume. According to the amount of culture volume the DNA preparations are classified as “mini-” (1–10 ml), “midi-” (25–100 ml), or “maxi-” (>100 ml) preparations. The yield of low copy plasmids can be increased by the addition of chloramphenicol to the media (see below). Chloramphenicol is also used for the isolation of high copy plasmids as it keeps the number of bacteria and thus the amount of bacterial debris low. Plasmid containing a ColE1 origin of replication can be amplified selectively compared to the bacterial genome. During the logarithmic growth phase an inhibitor of translation (e.g., chloramphenicol) is added to the bacterial culture. Chloramphenicol inhibits the synthesis of the Rop (repressor of primer) protein. This protein accounts for the control of the copy number of plasmids. Inhibition of this protein results in an increased replication of the plasmid (relaxed replication).

Lysis of Bacteria Many different methods are available for the lysis of the plasmid containing bacteria (Table 26.4). The method of choice depends on the type and use of the plasmid to be Table 26.4 Methods used to reveal bacteria. Method of digestion

Means of analysis

Comment

Alkaline

SDS/NaOH

Koch lysis

Lysozyme/100 °C

Quick and easy. The most suitable method for large plasmids and low-copy plasmids Endonuclease A is inactivated completely

Lithium method

LiCl/Triton X-100

Quick and efficient, not suitable for large plasmids (>10 kb)

SDS lysis

Lysozyme/SDS

Frequently used for large plasmids (>15 kb)

Figure 26.6 Composition of a typical plasmid vector for cloning and amplification of DNA fragments. The fragment of interest is cloned to the artificial multiple cloning site (MCS). T7 and T3 are promoters that are recognized specifically by RNA polymerases of the T7 and T3 bacteriophages and are used for RNA synthesis of the cloned fragments. AmpR depicts a gene for selection that renders bacteria containing the vector resistant to ampicillin: the bacteria can grow in ampicillin containing medium. The replication of origin (ori) is necessary for the autonomous replication of the plasmid. The ori region enables double-stranded replication of the plasmid whereas a second origin of replication (e.g., f1 single-stranded phages) permits singlestrand replication (ss-ori).

672

Part IV: Nucleic Acid Analytics

Figure 26.7 Principle of the alkaline lysis of bacteria for the isolation of plasmid DNA. (1) The bacteria are lysed using SDS and the DNA is denatured by NaOH. (2) The solution is neutralized by the addition of sodium acetate. Denatured proteins and chromosomal DNA are precipitated together with the potassium salt of the dodecyl sulfate. Low molecular plasmid DNA remains in solution and renatures. (3) The insoluble complexes are separated by centrifugation and the plasmid DNA can be isolated. Source: adapted according to: Micklos, D.A. and Freyer, G.A. (1990) DNA Science. A First Course in Recombinant DNA Technology, Cold Spring Harbor Laboratory Press and Caroline Biological Supply Company, Cold Spring Harbor.

It is important that the RNase A does not contain any contaminating DNases. This can be achieved by incubation of the RNase A solution at 95 °C. RNase H is a very stable enzyme that renatures after heat treatment to yield an active enzyme whereas DNases are permanently inactivated.

isolated. The most common method is alkaline lysis (Figure 26.7). The bacterial culture is centrifuged and the pellet resuspended in a buffer containing EDTA. EDTA complexes bivalent cations (Mg2+, Ca2+) that are important for the structural integrity of the bacterial walls. The buffer can also contain RNase A to degrade most of the bacterial RNA at this first step. The bacterial suspension is lysed completely by the addition of SDS and NaOH. SDS functions as detergent, solubilizing the phospholipids and proteins of the bacterial cell walls. Sodium hydroxide denatures proteins, chromosomal, and plasmid DNA. The timespan of the incubation of the solution under alkaline conditions is important for the quality of the plasmid DNA. Too long an incubation time leads to irreversible denaturation of the plasmid; too short an incubation results in incomplete lysis of the bacteria and low plasmid yield. Completely denatured plasmid DNA can be detected using gel electrophoresis on an agarose gel. Denatured plasmid DNA has a higher mobility than superhelical plasmid DNA and is stained less by ethidium bromide. The lysate is neutralized with potassium acetate buffer. Potassium dodecyl sulfate has a much lower solubility in water than sodium dodecyl sulfate and precipitates at high salt concentrations present in the lysate. Denatured proteins, high molecular weight RNA and denatured chromosomal DNA, and cellular debris form insoluble complexes in the presence of potassium dodecyl sulfate and will be co-precipitated with the potassium dodecyl sulfate. The smaller plasmid molecules remain in solution and renature upon neutralization of the solution. The insoluble debris is centrifuged and the supernatant can be processed further. For some applications the purity of the DNA solution is sufficient and the plasmid DNA can be precipitated with ethanol or isopropanol and washed with 70% ethanol.

26

Isolation and Purification of Nucleic Acids

This quick and easy method is useful for the preparation of many plasmids simultaneously and is used to check cloning efficiency. Many different single bacteria colonies are inoculated in a small volume of media and successful cloning is checked by digestion of the plasmid DNA with restriction enzymes if the plasmid contains the desired insert. Commercially available kits are based on the alkaline lysis principle. The plasmid DNA is purified as described below by anion exchange chromatography before precipitation with ethanol. As well as alkaline lysis, bacteria can be lysed thermally (boiling lysis). The bacterial cell walls are broken down by addition of lysozyme and lysed bacteria are heated for a short time. The debris is pelleted and the plasmid DNA can be isolated by ethanol precipitation. This method does not inactivate completely endonuclease A present in some E. coli strains (HB101, endA+). The plasmid DNA should therefore be purified by phenolic extraction prior to precipitation. Other (more uncommon) methods are incubation with the non-ionic detergent Triton X100 (Figure 26.8) in the presence of lithium chloride or lysis by SDS and lysozyme. The latter method (without the addition of NaOH) is used when high molecular plasmids need to be isolated. High molecular weight plasmids cannot be renatured completely in the presence of NaOH.

673

Figure 26.8 Non-ionic detergent Triton X-100.

Lysozyme: Abundant hydrolase found in saliva and tear fluid. Lysozyme hydrolyzes the 1,4-β-linkages between N-acetylmuramic acid and N-acetyl-D-glucosamine present in bacterial cell walls.

Purification of DNA by Anion Exchange Chromatography

In general, commercially available columns are used for purification by anion exchange. The positive charge is provided by protonated diethylammoniumethyl (DEAE) groups. The negatively charged DNA is bound at lower salt concentrations (750 mM) to the column material. Proteins and degraded RNA are not bound to the column material under these conditions. The column material is washed with buffer containing a higher salt concentration (1 M) to elute traces of bound protein or RNA. DNA does not elute from the column under these conditions. The DNA is eluted at even higher salt concentrations (1.25 M). The exact buffer conditions are dependent on the column material and supplier. Table 26.5 depicts an overview of expected yields using anion exchange columns. Several purification protocols have been set up for the removal of endotoxins prior to the purification by anion exchange. The lipopolysaccharides adherent to the bacterial membranes are treated with detergents (n-octyl-β-D-thioglucopyranoside, OSPG) to remove binding proteins. Then, the lipopolysaccharides are removed using columns loaded with polymyxin B. This antibiotic binds lipopolysaccharides very efficiently. Ultrapure DNA with very low toxin content can be obtained by repeated purification using a CsCl density gradient. Purification of DNA by Density Gradient Centrifugation

Ultrapure, high yield DNA can be obtained by centrifugation using a CsCl density gradient. Due to the significantly increased quality of the commercially available anion exchange kits, the method of density gradient centrifugation has lost its relevance and will therefore only be summarized briefly. The isopycnic centrifugation of DNA molecules within a CsCl density gradient is performed in the presence of ethidium bromide. The mechanism and thermodynamic aspects of ethidium bromide intercalation will be discussed in Chapter 27.2. Plasmid DNA and chromosomal DNA

Table 26.5 Approximate DNA yield after anion exchange purification. The yield of high copy plasmids is approx. 2–5 μg ml − 1, of low copy plasmids 0.1 μg ml − 1. Vector

Plasmid type

Bacterial culture (ml)

Yield (μg)

pUC, pGEM

High copy

25

50–100

pUC, pGEM

High copy

100

300–500

pBR322

Low copy

100

50–100

pBR322

Low copy

500

100–500

Anion Exchange Chromatography, Section 10.4.7

With the described purification method it is possible to co-purify certain lipopolysaccharides present in almost all Gram-negative bacteria. The presence of these so-called endotoxins is critical when transfecting DNA in sensitive cells or cell lines. Endotoxins can reduce transfection efficiency and can result in stimulation of protein synthesis or activation of the innate immune or the complement system.

674

Part IV: Nucleic Acid Analytics

Figure 26.9 Purification of plasmid DNA by CsCl density gradient centrifugation in the presence of ethidium bromide. Using vertical rotors, the gradient is generated parallel to the axis of the rotor. After stopping the rotor and centrifugation, the gradient tips but keeps its layers so that the plasmid is visible as circular bands. Due to the different densities, superhelical form I and forms II and III (open form, linear form) can be separated within the gradient. RNA–ethidium complexes are pelleted on the wall of the centrifuge tube. To isolate the plasmid, the sealed tube is ventilated using a needle. The plasmid is obtained by puncturing the tube with a second cannula below the plasmid band. The DNA containing solution is aspirated with a syringe. If genomic DNA was present, it can be found above form I because of its lower density. For ultrapure DNA, the CsCl density gradient purification is repeated.

can be distinguished by their different densities with intercalated ethidium bromide. Ethidium bromide intercalates into double-stranded DNA, preferentially linear or nicked plasmid DNA, and to a lower extent into covalently closed-circular plasmid DNA. The resulting differences in density are used to separate the different molecular forms of DNA (Figure 26.9). The buoyant density of RNA is higher than the maximal density of CsCl, resulting in RNA/Ethidium bromide pellets. RNA separation is achieved using Cs2SO4. The ethidium bromide is removed by repeated extraction of the DNA solution with n-butanol. Any remaining traces of ethidium bromide are completely removed by phenolic extraction. The high concentration of CsCl can be removed by dialysis of the DNA against TE buffer or water. The DNA can also be diluted to low concentrations and then precipitated using ethanol.

26.3.2 Isolation of Eukaryotic Low Molecular Weight DNA Yeast Plasmids The isolation of ultrapure yeast plasmids is difficult. In practice, total DNA is isolated. Since yeast plasmid contains a yeast origin of replication and a bacterial origin of replication pure yeast plasmids are isolated after re-transformation into E. coli as contaminating chromosomal yeast DNA cannot be replicated in bacteria. Hirt Extraction Low extrachromosomal DNA like plasmid DNA or viral DNA from cell or tissue cultures is isolated using a protocol established by B. Hirt 1967 for the isolation of polyomavirus DNA from murine cells. The cells are lysed using 0.5% SDS and adjusted to 1 M NaCl. The mixture is incubated over night at 0 °C and centrifuged. The supernatant contains low molecular weight DNA and can be purified using proteinase K and phenolic extraction.

26.4 Isolation of Viral DNA 26.4.1 Isolation of Phage DNA Bacteriophage λ and others are widely used as vectors for phage display, as reporter phages, or for cloning of genomic libraries. No other cloning system allows insertion of high molecular weight DNA fragments (10–20 kb); it can also be used very conveniently for high throughput screening. It may be necessary to isolate and analyze the DNA fragment inserted into the phage genome.

26

Isolation and Purification of Nucleic Acids

Proliferation of Phages Bacteriophages are proliferated using liquid culturing of E. coli. The choice of host strain is dependent on the bacteriophage strain. The bacteria are grown in the log phase using maltose. Maltose induces expression of the bacterial receptor (lamB) for bacteriophage λ. The bacteria are harvested and the culture is adjusted to a certain density using a buffer containing Mg2+ (λ-diluent, SM media). The bacterial cell number is determined photometrically: absorption of the culture is measured at 600 nm (blank is pure media). 1 OD equals approx. 8 × 108 bacteria. The Mg2+ ions stabilize phage particles and Mg containing media is used for phage proliferation (NZCYM media). For optimal proliferation of bacteriophages, the initial ratio of phages to bacteria is important. If the number of phages outweighs the number of bacteria in the initial phase, the bacteria are lysed completely and no phage proliferation can occur, and the yield will be very low. If the bacterial culture is infected initially with a low number of phages, the bacteria will overgrow a possible phage infection and the phage yield is also low. The initial ratio needs to be determined for phage and bacterial strain. A good ratio is found when complete lysis takes more than 8 h. Complete lysis of a bacterial culture can be detected by the sudden clearance of the turbid bacterial culture and the sudden occurrence of lysed bacterial debris.

Isolation of Phage Particles To remove bacterial RNA and DNA, RNases and DNase is added to the bacteriophage cultures. The phage DNA will remain intact, protected by the intact phage capsid. The phage particles are isolated by ultracentrifugation (100 000g). The purity of the phages is, for most applications, sufficient. Phages form a colorless to light brown pellet that is resuspended in TE. Phage particles are very sensitive to complexing agents that decrease the Mg ion concentration. Resuspension of the pellet in EDTA containing buffers destabilizes the capsid and facilitates later lysis. In some protocols the particles are precipitated using poly(ethylene glycol) (PEG). If a higher purity of the phages is desired they can be purified by CsCl density gradient centrifugation. Isolated and purified phage particles are lysed by proteinase K and the protein components of the capsid are degraded. The DNA is purified by phenolic extraction or anion exchange chromatography (Section 26.1). Phage DNA is a high molecular linear DNA (45–50 kB) that should be handled with care. For many applications, phage DNA can also be obtained with commercial kits and the needed fragments amplified by PCR.

26.4.2 Isolation of Eukaryotic Viral DNA The diversity of eukaryotic viruses requires several additional adapted strategies for isolation of their nucleic acids. Two general purification methods can be distinguished. In infected cells the viral DNA is present as extrachromosomal DNA (e.g., adenoviruses, polyomaviruses, SV40, papillomaviruses, baculoviruses). The viral DNA can be isolated from the infected cells via Hirt extraction (Chapter 26.3.2) in high yield and sufficient purity for many applications. Because some viruses contain high molecular weight DNA the same precautions should be taken as with any other high molecular weight DNA. The Hirt extraction method does not yield highly pure virus DNA and the viral DNA might be modified differently than in the virus particle (proteins bound, covalent modifications, circular or non-covalently closed DNA). Highly pure, native viral DNA can be obtained by purification of the viral particles. In most cases, infected cell release newly synthesized virus particles into the medium. The viral particles can be pelleted by ultracentrifugation (approx. 100 000g) and purified using CsCl gradient centrifugation. The viral shell is lysed specifically depending on the virus type. Usually, viruses are incubated with proteinase K followed by phenolic extraction. Using this method, proteins bound to viral DNA, like the terminal protein bound to adenoviral DNA or chromatin-like structures of Polyoma or SV40 nucleic acids, are destroyed. In some cases, mild alkaline lysis is sufficient to isolate native viral DNA. Commercial kits use silica-membrane based matrices or anion exchange chromatography. The viral DNA is isolated using cell free liquids (supernatants of blood plasma) as the methods do not allow the separation of viral and cellular DNA. Using spin columns or 96-well filter plates blood or samples can be analyzed by (RT-)PCR for the presence of viral DNA (or RNA), for example, of HBV, HCV, and HIV on a high throughput basis.

675

DNA library (genomic or cDNA library): Genomic DNA libraries contain the whole genome of an organism but split into smaller fragments that can be handled and cloned. The genome is fragmented enzymatically and then cloned into suitable vector systems, like the bacteriophage λ genome. Beside the genomic libraries, cDNA libraries represent the mRNA spectrum of a cell or organism. The cDNA is generated by reverse transcription of the mRNA.

676

Part IV: Nucleic Acid Analytics

26.5 Isolation of Single-Stranded DNA 26.5.1 Isolation of M13 Phage DNA Filamentous phages like M13, f1, or fd possess single-stranded covalently closed circular DNA (approx. 6.5 kb). Cloning of foreign DNA into the phage genome allows the isolation of singlestranded DNA of the desired sequence with high yield. The phage M13 infects exclusively E. coli (e.g., JM109, JM197) by intrusion using the sex pili of the bacteria that are coded on the F episome. The phage is converted into the replicative form (RF), a double-stranded version of the phage. Infection of the bacteria with M13 does not result in lysis as with bacteriophage λ infections but only in diminished growth rates. The single-stranded version of the M13 genome is isolated by isolation of the phage particles; the replicative form can be purified from the bacterial pellet. M13 phages are isolated by poly(ethylene glycol) precipitation or by anion exchange chromatography. Commercial kits are available for the isolation of M13 to allow high throughput isolation. The purification is based on silica-gel membranes. At high salt conditions, single-stranded DNA binds with higher affinity to this material than do double-stranded DNA or proteins.

26.5.2 Separation of Single- and Double-Stranded DNA Single-stranded and double-stranded DNA can be separated from a complex mixture by hydroxyapatite chromatography. Hydroxyapatite, a crystalline form of calcium phosphate (Ca5(PO4)3(OH)), is bound preferentially by double-stranded DNA and with much lower affinity by single-stranded DNA or RNA. Binding of double-stranded DNA is performed using phosphate containing buffer at high temperatures (60 °C). At these conditions the singlestranded DNA does not bind to the column and will be found in the void volume. The doublestranded DNA can be eluted by increasing the phosphate content of the buffer. A critical factor of this purification method is the high phosphate content of the obtained nucleic acids as this interferes with nucleic acid precipitation. The fractionated nucleic acids are concentrated first with sec-butanol and then desalted by gel filtration.

26.6 Isolation of RNA Working with RNA requires even more care than working with DNA. RNases are in contrast to DNases very stable, do not need any co-factors, and cannot be inactivated completely by autoclaving. Only ultrapure buffers should be used for isolation of RNA. RNases can be inactivated by treating the buffers with diethyl pyrocarbonate (DEPC) (Figure 26.10). DEPC inactivates RNAses by covalent modification of the histidine residue in the active center of the enzyme. Buffers containing free amino groups cannot be treated with DEPC. DEPC is toxic due to its modifying properties. Excess DEPC needs to be inactivated by autoclaving as DEPC present during RNA purification will modify the bases of the RNA (carboxyethylation of the adenines and, seldom, guanines). DEPC is degraded to carbon dioxide and ethanol. The use of gloves and sterile, RNase free plastic ware is essential for handling RNA. Glass ware should be decontaminated by heat treatment at 300 °C. For many experiments, RNase inhibitors can be added but these inhibitors can only inactivate low contents of RNases (Table 26.6). In addition

Figure 26.10 Chemical formula and mechanism of DEPC (diethyl pyrocarbonate) treatment. DEPC inactivates RNases by covalent modification of amino groups and histidines. DEPC degrades to ethanol and carbon dioxide upon heating and autoclaving.

26

Isolation and Purification of Nucleic Acids

Table 26.6 Frequently used RNase inhibitors. RNase inhibitor RNasin

Diethyl pyrocarbonate

Vanadyl ribonucleoside complexes SDS, sodium deoxycholate

Protein from human placenta Forms non-covalent equimolar complexes with RNases Cannot be used at denaturing conditions Used for buffer treatment Covalent modification Needs to be inactivated Transition state analog that binds RNases and inhibits their activity Cannot be used for cell free translation systems Denaturation

β-Mercaptoethanol

Reduction

Guanidinium thiocyanate

Used in connection with cell lysis Denatures RNases reversibly Used in denaturing agarose gels Covalent modification

Formaldehyde

to these commonly used RNase inhibitors, optimized protein- or antibody-based inhibitors specific for certain RNases are commercially available.

26.6.1 Isolation of Cytoplasmic RNA In contrast to DNA localized in the nucleus, most RNA molecules are located in the cytoplasm. Cytoplasmic RNA is composed of various RNA species, such as classical, long-known ribosomal RNA, transfer RNA, and messenger RNA. With recent new technologies like deep sequencing and tiling arrays, new RNA species have been identified. A majority of the human genome is transcribed while only an estimated 2% of these RNAs will be translated to proteins. The so-called non-coding RNAs constitute a new group of RNA molecules with various functions, many of which have yet to be discovered. Non-coding RNAs with sizes above 200 nt are classified as long ncRNAs, whereas miRNA (micro), piRNA (PIWIinteracting), and siRNA (small interfering) belong to the group of small ncRNAs. Due to the very heterogeneous nature of the RNAs, various isolation and purification protocols and commercial kits are available. For some applications, like Northern blots or RTPCR or ribonuclease protection assays, the isolation of cytoplasmic RNA is sufficient. Minor contamination of the RNA preparation with genomic DNA can be excluded by the use of proper controls. A RT-PCR reaction, for example, can be performed without the use of reverse transcriptase. The PCR result should be negative if no genomic DNA is present. In the presence of contaminating genomic DNA the PCR will be positive even without reverse transcriptase treatment. The use of intron/exon spanning primers is recommended. With these primer pairs, only spliced mRNA will yield fragments of the correct size. For some applications it can be useful to enrich or purify the mRNA out of the total RNA. Cultivated Cells The plasma membranes of the cells are lysed with a non-ionic detergent (Nonidet P40) while keeping the cell nuclei intact. The nuclei are separated and the proteins in the cytoplasmic fraction are degraded using proteinase K. The RNA can be purified by phenolic extraction. If cells have been transfected with plasmid DNA, the cytoplasmic RNA can be contaminated with episomal DNA, which can be removed by digestion with RNase-free DNases. Tissue and Cultivated Cells The nuclease activity of a tissue can be very high. Therefore, the tissue is frozen immediately in liquid nitrogen. Cells are lysed and proteins completely denatured using the chaotropic salt guanidinium thiocyanate. β-Mercaptoethanol and L-lauryl-sarcosine (Figure 26.11) are added to prevent degradation of the RNA. Cells or tissue are also often lysed using phenol. Since RNases are not completely inactivated by pure phenol, a mixture of acidic phenol: chloroform: isoamyl alcohol (Section 26.1) is used. Most methods and kits use the combination of both reagents for efficient and more convenient denaturation of proteins and inactivation of RNases. The RNA can be purified using anion exchange in a similar manner to DNA purification. For RNA, adapted buffer conditions are used to bind and elute the RNA from the column.

677

678

Part IV: Nucleic Acid Analytics

Figure 26.11 (a) N-Lauryl-sarcosine and (b) guanidinium thiocyanate.

Contaminating DNA can be removed by digestion (also on-column) with DNases. RNA can also be purified using CsCl density gradient centrifugation. RNA–ethidium bromide complexes pellet due to their higher density and can so be separated from genomic DNA. If the RNA needs to be purified as a band, higher density gradients (using Cs2SO4) need to be performed (Section 26.3.1). Most commercial kits are based on silica technology use solid phase extraction (SPE). RNA (and also DNA) can be bound to filters or columns consisting of silica particles, glass fibers, or glass beads (Section 26.7). The RNA is bound to the silica-material in the presence of high-salt chaotropic buffers (in most cases this buffer is provided during lysis of cells using guanidinium thiocyanate buffer). The RNA is washed and eluted from the matrix with low salt buffers; in many commercially available kits, the eluent is RNase free water. These recent technologies enable researchers to obtain high quality RNA simply and quickly in high throughput quantities. All RNA isolation kit sellers offer specialized protocols and kits for all kind of RNA sources and applications. For some applications, for example, next generation sequencing, it is useful to remove the major part of the ribosomal RNA before sequencing. This will reduce material cost due to unnecessary sequencing of contaminating ribosomal RNA. Ribosomal RNA depletion kits are based on hybridization of the ribosomal RNA to oligonucleotide probes specific for ribosomal RNA. The hybridized ribosomal RNA:DNA strands are then bound to beads and removed from the solution, for example, by magnetic separation (Section 26.7).

26.6.2 Isolation of Poly(A) RNA Nearly all eukaryotic mRNA species contain long adenine-rich regions on their 3´ termini. These poly(A) tails are used to purify mRNA from cytoplasmic RNA. Column or bead material are coupled to single-stranded thymidine rich short DNA fragments (oligo(dT)). The poly(A) tails hybridize to the oligo(dT) strands and are bound to the column material (Figure 26.12). Contaminating non-poly(A) containing RNAs can be easily removed by washing the column.

Figure 26.12 Isolation of poly(A) RNA using a oligo(dT) column. Total RNA (cytoplasmic RNA) is loaded to the column. RNA with a poly(A) tail is bound by hybridization of the adenines to the oligo (dT) residues to the column whereas all other molecules are collected in the flow through. The poly(A) RNA is eluted using conditions that destabilize the dT:rA hybrids.

26

Isolation and Purification of Nucleic Acids

679

To ensure optimal hybridization and loading of the column, the starting RNA material needs to be denatured. For optimal yield, the starting material can be applied to the column several times. The poly(A)-RNA is bound to the column at high salt concentrations (500 mM NaCl or LiCl) and purified poly(A) RNA is eluted with water. These conditions destabilize the dT:rA hybrids. Low cost oligo(dT) columns can be prepared by coupling of oligonucleotides (dT12–18) to activated column material. For more convenience, commercial kits are available in different formats.

26.6.3 Isolation of Small RNA In recent years significant research has focused on small non-coding RNAs like miRNAs, siRNAs, or piRNAs with sizes lower than 200 nt. These RNAs are purified from tissues, cells, or extracellular vesicles like exosomes. Many of the RNA isolation and purification protocols developed for longer RNAs are not optimal for small RNAs, for example, ethanol precipitation is not very efficient for small RNAs and many protocols need to be adapted. It is important for a good recovery of the small RNAs to include an acid phenolic extraction at the beginning of the isolation protocol. Only if tissue or cells or exosomes are denatured completely using acidic phenol : chloroform : isoamyl alcohol is the yield of small RNAs sufficient. Individual purification protocols depend on the column material and kit used and specific enrichment of small RNAs can be achieved by a combination of different separation techniques and buffer conditions. Tailored isolation kits are accessible to the research community for the purification of small RNAs from different sources

26.7 Isolation of Nucleic Acids using Magnetic Particles In recent years, the demands on nucleic acid purification protocols have increased dramatically regarding speed, costs, yield, purity, and format. A lot of scientific questions require isolation of a huge number of samples simultaneously, for example, for expression profiling or SNP analysis (single nucleotide polymorphism). The development of automated high throughput isolation protocols was mandatory. Certain protocol steps cannot be transferred easily to automated liquid handling systems (e.g., centrifugation). New protocols needed to be developed. The isolation of nucleic acids can easily be automated using magnetic particles. Beads with paramagnetic (will be magnetized by an external magnetic field) or magnetic properties are used. Applications of this technique are very general and have advantages compared to the conventional separation protocols. The material is not subject to shear forces as no centrifugation steps are necessary and the use of organic reagents is obsolete. The magnetic beads are loaded with the nucleic acids and brought to an external magnetic field. The beads and bound nucleic acids are retained in the magnetic field while decontaminating material can be washed away (Figure 26.13). If used manually, the beads are often retained in a column placed in a magnetic field. In automated liquid handling systems, the magnetic field is usually provided by a magnetic plate, on which the 96-well plate containing the beads is placed. For the isolation of DNA, silica coated magnetic beads are used as DNA binds in the presence of chaotropic reagents to glass surfaces (Section 26.5). Using solid phase reversible immobilization (SPRI), DNA is loaded reversibly to magnetic beads modified with carboxyl-groups in the Figure 26.13 Principle of magnetic bead isolation. The nucleic acids in cell or bacterial lysates are bound specifically to the magnetic particles. By applying a magnetic field the beads are fixed and the contaminations can be washed away. After the washing steps the nucleic acids are eluted from the magnetic beads. All protocol steps can be performed on automated systems. The isolation protocol and the kind of bead depend on the type of nucleic acid to be purified.

680

Part IV: Nucleic Acid Analytics

presence of high salt concentration and poly(ethylene glycol) (PEG). The PEG is important for the binding of the DNA to the bead surface. Streptavidin coated beads are used for the isolation of very low amounts of mRNA. The beads are coupled with biotinylated oligo(dT) primer and the mRNA is coupled and isolated. This principle of binding biotinylated nucleic acids to streptavidin beads can be applied to a huge variety of isolation methods (e.g., isolation of DNA binding proteins).

26.8 Lab-on-a-chip Not only the format, time, and throughput of nucleic acid purification protocols have improved significantly in recent decades – with the lab-on-a-chip (LOC) system it is possible to isolate DNA in a miniaturized fashion. The LOC is part of the microelectromechanical systems (MEMSs) and is based on a chip that is between square millimeters and centimeters in size. The volume can be as low as 1 picoliter. The concept is to include all techniques, starting from the isolation of the nucleic acids (from blood or tissue) to the analysis of the nucleic acids, on the same chip. The systems are also part of the micro total analysis systems, μTAS. Similar to the isolation methods for automated liquid handling systems, protocols cannot be based on centrifugation or phenolic extraction. It is also important to achieve a certain concentration of the DNA for the subsequent analysis steps. The SPE methods can be transferred in part to the chip technology. The silica based isolation methods where DNA binds to the solid phase in the presence of chaotropic reagents are suitable. In addition, SPRI methods are applied. Additional suitable materials, like poly(methyl methacrylate) (PMMA), are used to enlarge the active surface on the chip.

Further Reading Ausubel, E.M., Brent, R., Kingston, R.E., Moore, D.D., Smith, J.A., Seidman, J.G., and Struhl, K. (1987) Current Protocols in Molecular Biology, John Wiley & Sons, Inc., New York. Farrell, R.E. (2010) RNA Methodologies: A Laboratory Guide for Isolation and Characterization, 4th edn, Academic Press, Elsevier. Glasel, J.A. and Deutscher, M.E. (1995) Introduction to Biophysical Methods for Protein and Nucleic Acid Research, Academic Press, New York. Green, M.R. and Sambrook, J. (2012) Molecular Cloning: A Laboratory Manual. 4th edn, Cold Spring Harbour Press, Cold Spring Harbor. Krieg, P.A. (ed). (1996) A Laboratory Guide to RNA: Isolation, Analysis and Synthesis, Wiley-Liss, New York. Kües, U. and Stahl, U. (1989) Replication of plasmids in Gram-negative bacteria. Microbiol. Rev., 53, 491–516. Levinson, P., Badger, S., Dennis, J., Hathi, P., Davies, M., Bruce, I., and Schimkat, D. (1995) Recent developments of magnetic beads for use in nucleic acid purification. J. Chromatogr. A, 816, 107–111. Micklos, D.A. and Freyer, G.A. (1990) DNA Science. A First Course in Recombinant DNA Technology, Cold Spring Harbor Laboratory Press and Carolina Biological Supply Company, Cold Spring Harbor. Perbal, B. (1998) A Practical Guide to Molecular Cloning, John Wiley & Sons, Inc., New York. Price, C.W., Leslie, D.C., and Landers, J.P. (2009) Nucleic acid extraction techniques and application to the microchip. Lab Chip, 9, 2484–2494. Tan, S.C. and Yiap, B.C. (2009) DNA, RNA and protein extraction: the past and the present. J. Biomed. Biotechnol., article ID 574396. Wen, J., Legendre, L.A., Bienvenue, J.M., and Landers, J.P. (2008) Purification of nucleic acids in microfluidic devices. Anal. Chem., 80, 6472–6479. Zähringer, H. (2012) Old and new ways to RNA. LabTimes, (2), 52–61.

Analysis of Nucleic Acids

Nucleic acids, isolated from different sources, different tissues from different organisms or cell or tissue cultures, subsequently appear as a compact, high molecular bulk of, especially in case of genomic DNA, unspecific fragments, which are hard to analyze in this status. For processing it is necessary to determine purity, conformation, fragment size and last but not least the sequence of these nucleic acid fragments. In this chapter we summarize basic analytical methods available for nucleic acids processing. The presented methods result in a basic characterization and/or are necessary for more detailed and final characterization or manipulation of nucleic acids. For example the transformation of a high molecular bulk of nucleic acids into specific molecular fragments by restriction analysis, which can easily be further characterized and manipulated for example by cloning. Fragments can be separated by gel electrophoresis, visualized by staining, isolated from the gel matrix or transferred by “blotting” to a specific carrier material for more specific characterization by “hybridization”. Most of these techniques are basic and daily routines when working with nucleic acids.

27.1 Restriction Analysis Ute Wirkner1 and Joachim W. Engels2 1

German Cancer Research Center, Clinical Cooperation Unit Translational Radiation Oncology, Im Neuenheimer Feld 400, 69120 Heidelberg, Germany Goethe University Frankfurt, Institute of Organic Chemistry and Chemical Biology, Department of Biochemistry, Chemistry and Pharmacy, Max-von-Laue Straße 7, 60438 Frankfurt am Main, Germany

2

Restriction analysis is used for the characterization, identification, and isolation of doublestranded nucleic acids and, thus, is a basic tool in nucleic acid analysis. Cloning of DNA molecules is almost unthinkable without restriction analysis. Even if it is possible to clone DNA by PCR without the need of restriction, mostly restriction analysis is used to prepare the DNA fragments and the vectors for cloning and to identify the resulting cloning product. In addition, for any other kind of DNA manipulation, such as mutagenesis or amplification by PCR, restriction analysis is the tool of choice to identify the desired product. To initially determine the crude structure of any DNA, from small fragments to whole genomes, establishing a restriction map is a useful step on the way to complete sequencing. Restriction analysis of genomic DNA to detect mutations or restriction fragment length polymorphisms (RFLPs) is used for genetic mapping, to identify and isolate disease genes, or, for example, in criminalistics to identify individuals.

27.1.1 Principle of Restriction Analyses The basis for restriction analysis is the activity of restriction enzymes, which bind and cut double-stranded DNA molecules at specific recognition sequences. These are mainly so-called Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

27

682

Part IV: Nucleic Acid Analytics

Hybridization Section 28.1 PCR, Chapter 29

type II restriction enzymes. DNA-fragments resulting from this activity have a specific length, defined by the positions of the recognition sites, and can be separated according their size by gel electrophoresis (Section 27.2.1). The analysis of a DNA molecule results in a specific band pattern of restriction fragments. By comparison with a respective size standard each fragment can be assigned to its approximate size. Depending on the size of the initial DNA molecule(s) the detection of the fragments is carried out: if the original molecule is comparatively small, as with most cloning products or vectors (plasmids, lambda phages, cosmids around 3–50 kb), unspecific detection by staining all nucleic acids in the gel by, for example, ethidium bromide is sufficient (Section 27.3.1). If a certain area within a complex genome is to be analyzed, detection by specific Southern blot hybridization (Section 27.4.3) has to be performed or the fragment has to be amplified in vitro by PCR before undergoing a restriction analysis. Thus restriction analysis can be performed on any type and size of double-stranded DNA and is comparatively easy and quick to perform. Another undoubted aspect of the broad application spectrum is the large variety of restriction enzymes and respective restriction sites.

27.1.2 Historical Overview Before restriction enzymes were discovered it was almost impossible to characterize or isolate certain genes or genomic areas. Genomic DNA isolated from cells or tissues is a mass of large, chemically monotonous molecules that in principle only allow separation according to their size. Since a functional unit such as a gene does not exist as a single molecule in a cell but is part of a much larger DNA- molecule, specific breakdown of this large molecule is required to separate and isolate an interesting section (e.g., a gene). Even if DNA–molecules can be sheared at random sites by mechanical forces the result is again a heterogeneous unspecific mixture of DNA molecules and no defined DNA-fragments can be isolated. It was only with the discovery and isolation of restriction enzymes by Arber, Smith, and Nathans in the late 1960s that the opportunity first arose to manipulate DNA in a defined way, namely, to degrade it to specific fragments of defined length, which can be specifically separated and isolated. Hence, a first detailed characterization of DNA was possible and the basis for isolating and amplifying DNA by cloning was laid. By the implementation of restriction enzymes and consecutive cloning, hybridization, and other enzymatic DNA-manipulations, DNA has shifted from the least accessible to the easiest manipulable and analyzable macromolecule.

27.1.3 Restriction Enzymes Restriction enzymes are endonucleases that occur mainly in bacteria but they also exist in viruses and eukaryotes. They cleave phosphodiester bonds of both strands in a DNA molecule by hydrolysis and differ in recognition sequence, cleavage site, organism of origin, and structure. Several thousand restriction enzymes with several hundred different recognition sequences are known. Biological Function The biological function of restriction enzymes is to protect the organism of origin from infiltration by foreign DNA (e.g., from phages by cleaving and inactivating and, thus, “restricting” the growth of the phage). Its own DNA and every DNA synthesized in the cell is protected from this attack by modifications, mostly methylation. This restriction/modification (R/ M) system is specific to its organism of origin and is a protection mechanism, a kind of immune system. The host specificity of bacteriophages is based on this system; they only can infect bacteria efficiently that have the same methylation pattern as their bacteria of origin. Classification of Restriction Enzymes Primarily, three types of restriction enzymes (I, II, and III) are distinguished, whose properties and differences are summarized in Table 27.1. Type I restriction enzymes possess restriction as well as methylation activity and have a defined recognition sequence. If both strands of the recognition sequence are methylated, DNA will not be cleaved. If only one strand is methylated, the sequence is recognized and the second strand will be methylated. If both strands are methylated the sequence is recognized as well and the DNA is cleaved around 100 bp away from the recognition sequence. Restriction enzymes that are most

27 Analysis of Nucleic Acids Table 27.1 Classification of restriction enzymes (REases). Type I

Type II

Type III

Function

Endonuclease and methylase

Endonuclease

Endonuclease and methylase

Recognition sequence

Two parts, asymmetric

4–8 Bases, most palindromic

5–7 Bases, asymmetric

Cleavage site

Unspecific, often >1000 bp distance to the recognition sequence

Within or close to recognition sequence

5–20 Bases in front of the recognition sequence

ATP needed

Yes

No

Yes

frequently used in analytics are usually type II restriction enzymes. In addition, most well-known restriction enzymes belong to type II. In contrast to type I and type III restriction enzymes, type II restriction enzymes usually possess only restriction activity and cleave the DNA within the recognition sequence, which results in DNA fragments of defined length and defined ends. Type III restriction enzymes, like type I, have restriction activity as well as methylation activity. They cleave the DNA at sites at a distinct distance from the recognition sequence so that the resulting fragments have a defined length and variable ends. So-called homing endonucleases, like I-PpoI, in their native form have longer recognition sequences (>15 bp) and initiate the insertion of their own genes, so-called homing endonucleases genes (HEGs). They are selfish elements that colonize genomes and occur in different animal kingdoms from bacteria to eukaryotes. These endonucleases are mainly used for gene targeting and they are engineered to alter target site specificity. Nickases represent a small group of nucleases that have double-stranded recognition sequences but cleave only one strand of the DNA. Nomenclature of Type II Restriction Enzymes (REases) The nomenclature of type II restriction enzymes is based on the organism of origin. For example, restriction enzyme EcoRI was isolated from a resistance (R) factor of Escherichia coli strain RY 13. Here “I” stands for the first restriction enzyme isolated from this strain. Analogous BamHI was the first enzyme isolated from Bacillus amyloliquefaciens strain H. The scientific community agreed in this unique nomenclature in 2003. Here the terms restriction enzyme and restriction endonuclease were denoted synonymously and the abbreviation REases was introduced. Since type II enzymes compose by far the largest group of restriction enzymes, and since in addition there are members that differ from classical recognition features, the type II group was segmented into subgroups, which are described below in Table 27.3. All type II enzymes do not depend on ATP, they mostly do not form a complex with the respective methylase, they recognize a specific DNA sequence, and they cut within or close to the recognition sequence. The resulting DNA fragments have 5´ phosphate and 3´ -OH groups. Recognition Sequences Thousands of type II restriction enzymes with hundreds of recognition sequences have been characterized, and the number is constantly increasing. Comprehensive compilations can be found in regularly updated databases, company catalogues, or books of molecular biological methods. The recognition sequences of these restriction enzymes span 4–8 nucleotides and are, most often, palindromic. Table 27.2 lists representative examples of some restriction enzymes and their recognition sequences. In accordance with convention the sequence is given in the direction 5´ to 3´ . The cleavage site is usually located within the recognition sequence and thereby the resulting restriction fragments have defined ends, which is among other factors relevant for cloning. But there are also restriction enzymes like FokI (Table 27.3), whose cleavage site is a few bases away from the recognition site. As shown in Figure 27.1, the cleavage of DNA with restriction enzymes can result in blunt ends or in cohesive or “sticky” ends. Sticky ends can have either an overhanging 5´ or 3´ end, depending on which strand of the DNA forms the overhang. Usually, DNA fragments resulting from restriction enzyme activity have a 3´ hydroxyl and a 5´ phosphate group.

683

684

Part IV: Nucleic Acid Analytics Table 27.2 Specification of some type-II restriction enzymes (type-II-REases). Restriction enzyme

Recognition and cleavage sitea)

Organisme of origin

Isoschizomers

BamHI

G/GATCC

Bacillus amyloliquefaciens H

Bstl

Bstl

G/GATCC

Bacillus stearothermophilus 1503-4R

BamHI

EcoRI

G/ÁATTC

Escherichia coli RY13

Fokl

GGATGN9/ CCTACN13/

Flavobacterium okeanokoltes

HindII

GTPy/PuAC

Haemophilus influenzae Rd

HindII

HindIII

A/AGCTT

Haemophilus influenzae Rd

HpaII

C/CGG

Haemophilus parainfluenzae

Mspl

Mspl

C/CGG

Moraxella species

HpaI

NotI

GC/GGCCGC

Nocardia otitidiscaviarum

Sadl

GAGCITC

Streptomyces achromogenes

Sau3A

/GATC

Staphylococcus aureus 3A

Smal

CCC/GGG

Serratia marcescens Sb

Xmal

Xmal

C/CCGGG

Xanthomonas malvacearum

Smal

Mbol, Ndell

a) Py: pyrimidine (C or T); Pu: purine (A or G); N: A, C, G, or T.

The frequency of a recognition sequence depends mainly on its length but also on its own base composition and the composition of the DNA that is restricted. Assuming a random composition a 4 bp recognition sequence statistically occurs approximately every 44 bp (256 bp), a 6 bp or 8 bp recognition sequence respectively every 46 bp (4096 bp) or 48 bp (65 536 bp). However, different organisms possess different base compositions of their Table 27.3 Subtypes of Type-II restriction enzymes (REases). Subtypea)

Characteristics

Example

Recognition and cleavage site

A

Asymmetric recognition sequence

Fokl Acrl

GGATG (9/13) CCGC (3/-1)

B

Cleavage on both sides of the recognition sequence

Bcgl

(10/12) CGANNNNNNTGC (12/10)

C

Symmetric or asymmetric recognition sequence; R- and M-function in one polypeptide

Gsul HaelV Bcgl

CTGGAG (16/14) (7/13) GAYNNNNNRTZ (14/9) (10/12) CGANNNNNNTGC(12/10)

E

Two copies of the recognition sequence, one is cleaved, the other serves as an allosteric effector

EcoRII NaeI

↓CCWGC GCC↓GGC

F

Two recognition sequences, both are cleaved in coordination

SfiI SgrAl

GGCCNNNN↓NGGCC CR↓CCGGYG

G

Symmetric or asymmetric recognition sequence; depend on AdoMet

Bcgl Eco571

GTGCAG (16/14) CTGAAG (16/14)

H

Symmetric or asymmetric recognition sequence, gene structure similar to type I REases

Bcgl AhdI

(10/12) CGANNNNNNTGC(12/10) GACNNN↓NNGTC

M

Subtype IIP or IIA; recognize only methylated recognition sequences

DpnI

Gm6A ↓TC

P

Symmetric recognition and cleavage site

EcoRI PpuMI BslI

G↓AATTC RG ↓GWCCY CCNNNNN↓NNGG

S

Asymmetric recognition and cleavage site

FokI MmeI

GGATG (9/13) TCCRAC (20/18)

T

Symmetric or asymmetric recognition sequences; heterodimers

BpuI0I BslI

CCTNAGC ( 5/-2)b) CCNNNNN↓NNGG

a) Not all subtypes are exclusive! For example, BslI is subtype P and T. b) Abbreviation means the following cleavage: 5´ CC ↓T N AG C 3´ G GA N T ↓C G

27 Analysis of Nucleic Acids

685

Figure 27.1 DNA ends generated by restriction enzyme cleavage. Depending on the applied restriction enzyme three kinds of DNA ends occur: cohesive ends (sticky ends) arise by, for example, cleavage with BamHI and Sacl, whereby BamHI creates 5´ -overhanging and Sacl 3´ -overhanging ends. Blunt ends are created by, for example, SmaI.

genomes The A/T and accordingly the G/C content is rarely 50% and the dinucleotide CpG occurs less frequently in eukaryotes than do the other dinucleotides. Consequently, a recognition sequence containing CpG will occur less frequently in eukaryotic genomes than calculated according its length. Restriction enzymes with an 8 bp recognition sequence are, for example, applied to establish restriction maps from whole chromosomes. The resulting, very long DNA fragments are separated and detected by pulse-field gel electrophoresis (Section 27.2.3). Most frequently used restriction enzymes recognize 6 bp sequences since the length of the resulting fragments is good for separation and isolation. However, if a partial restriction is to be performed, for instance to establish a genomic library, restriction enzymes with 4 bp recognition sequences are selected. Isoschizomeres Isoschizomeres (Table 27.2) are restriction enzymes that have identical recognition sequences but originate from different organisms. The cleavage site might be identical (e.g., BamHI and Bstl) or different (e.g., Smal and XmaI). Isoschizomers with different cleavage sites are termed neoschizomers. The enzymes might also differ in their sensitivity towards methylation: for example, HρaII and Mspl have identical recognition sites but HρaII does not cleave if the second cytosine is modified to 5-methylcytosine (5m C), while Mspl will cleave despite this methylation.

27.1.4 In Vitro Restriction and Applications In a restriction enzyme reaction mixture, the DNA to be analyzed is incubated with the desired restriction enzyme under defined buffer conditions at a defined temperature for a certain time. The restriction buffer usually contains Tris-buffer, MgCl2, NaCI, or KCl as well as a sulfhydryl reagent (dithiothreitol (DTT), dithioerythritol (DTE) or 2-mercaptoethanol). A divalent cation (mostly Mg2+) is necessary for enzymatic activity, as well as the buffer, which provides the correct pH, mostly between pH 7.5 and pH 8. Some restriction enzymes are sensitive towards ions such as Na+ or K+, while others are active within a wide concentration range. Sulfhydryl reagents stabilize the enzyme. The optimal temperature for most restriction enzymes is 37 °C, but it may vary depending on the enzyme, with respect to the organism of origin, to higher (e.g., 65 °C for Taql) or lower (e.g., 25 °C for Smal) temperatures. Complete Restriction For most purposes complete restriction of DNA is intended. For this, optimal conditions for the respective restriction enzyme are selected and a sufficient amount of enzyme for the DNA to be cleaved.

The amount of restriction enzyme is given in units: one unit of a restriction enzyme is the amount needed to cleave one microgram of substrate DNA under optimal conditions within one hour. As a general rule bacteriophage lambda DNA is used as substrate for this definition.

686

Part IV: Nucleic Acid Analytics

Incomplete or Partial Restriction For some purposes, like restriction mapping or the preparation of a genomic DNA-library, partial restriction is desired. This means that statistically not all of the restriction sites are cleaved. This is achieved by an under optimization of the reaction conditions, such as a lower amount of restriction enzyme, shorter reaction time, or change of buffer conditions (e.g., reduced MgCl2 concentration). Multiple Restriction This involves the restriction of DNA with several restriction enzymes. The DNA might be incubated either simultaneously or one after another with the desired restriction enzymes. The crucial criterion here is the compatibility of the reaction conditions. Multiple restriction, among others methods, is applied to establish restriction maps.

Restriction Mapping To establish a restriction map recognition sequences of one or several restriction enzymes are localized within a DNA-molecule. Thus the restriction map is a crude physical map of the DNA-molecule to be analyzed. The perfect physical map is the complete nucleotide sequence of the DNA. Consequently, restriction mapping is applied to identify known sequences, for example, to verify successful cloning of a known DNA-fragment or as the first step of projects that aim to identify a complete nucleotide sequence. Therefore, restriction analysis of DNA fragments integrated in cloning vectors (e.g., plasmids, cosmids, or lambda phages) is performed. Before the introduction of next generation sequencing, DNA had to be cloned before elucidating the nucleotide sequence, which was usually performed by Sanger sequencing. Restriction maps of these sequencing clones are established, overlapping clones are identified by comparing their restriction maps, and finally the map of the originally cloned DNA can be elucidated. Often it is not necessary to isolate the fragments after the first restriction, instead it is sufficient to compare the fragment pattern of the single digest with that of the double digest to determine the order of the restriction fragments. For this approach it is important to use restriction enzymes that produce at least a few overlapping fragments. Consequently, it might be necessary to test some restriction enzymes.

Radioactive Systems, Section 28.4.2

Probes for Nucleic Acid Analysis, Section 28.2 Analysis of Epigenetic Modifications, Chapter 31

Combination of Multiple Restriction Enzymes By this method the relative position of recognition sequences of different restriction enzymes is determined and from this their absolute position on the originally analyzed DNA fragment. To do so, the DNA fragment to be analyzed is first restricted with each single restriction enzyme in one reaction and fragments are analyzed by gel electrophoresis. Ideally these fragments are isolated and restricted with the second restriction enzyme and these double restrictions again analyzed by gel electrophoresis. By comparing the lengths of the resulting DNA fragments after single and double restriction, overlapping parts can be identified and the relative order of the fragments can be determined. This is shown in Figure 27.2 using the example of a 5 kb DNA-fragment. Partial Restriction By this method the order of recognition sequences of a single restriction enzyme can be identified. The DNA fragment to be analyzed is digested once completely and once incompletely with the same restriction enzyme and both reactions are analyzed by gel electrophoresis. By comparing the pattern of the resulting restriction fragments the completely restricted fragments can be allocated to the incompletely restricted fragments and thus the order on the original DNA fragment can be determined. This method is shown in Figure 27.3 on a 5 kb DNA-fragment. In the case of a complex restriction pattern, for example, when analyzing a long DNA fragment or using a very frequently cutting enzyme, it is advisable to use the method shown in Figure 27.4. Hereby, the DNA-molecule is labeled at one end before partial digestion by, for example, incorporation of a labeled nucleotide. After gel electrophoresis these labeled fragments can be detected selectively (e.g., by autoradiography). The size of a detected fragment corresponds here to the distance of a cleavage site to the labeled end of the DNA molecule.

Restriction Analysis of Genomic DNA When carrying out restriction analysis of large eukaryotic genomes there is the problem that too many restriction fragments are generated. After gel electrophoresis no single bands are visible but instead there is a smear of DNA, which consists of specific DNA fragments with many sizes. By selecting certain hybridization probes, a fragment in the analyzed genome containing DNA complementary to the selected probe can be detected. This is done by so-called Southern blot analysis (Section 28.4.3). This analysis enables, for instance, the restriction analysis of a gene whose transcript has been cloned as cDNA and can be used as hybridization probe. There are other objectives for which restriction analysis is helpful, such as the detection of a methylation pattern that is lost by cloning and

27 Analysis of Nucleic Acids

687

Figure 27.2 Restriction mapping by multiple restriction. A 5 kb long, linear DNA fragment was cleaved by restriction enzymes A and B in single reactions and in a double reaction. (a) Separation of the restriction fragments in an agarose gel. Fragment sizes determined by comparison to the size standard are given. Cleavage with enzyme A results in restriction fragments with lengths of 2500 bp (fragment A2500), 1300 bp (A1300), and 1200 bp (A1200). The corresponding nomenclature for enzyme B fragments and double restriction fragments is shown. Restriction fragments from single reactions were isolated and cleaved with the respective second restriction enzyme: A-fragments with enzyme B and B-fragments with enzyme A. (b) Electrophoretic separation of these secondary cleavage products. By comparison of the restriction pattern, overlapping fragments can be identified and, as shown in (c), can be aligned: The 1900 bp fragment from the double digest is contained in A2500 and B2100; consequently, A2500 and B2100 overlap in this area. In addition, A2500 contains a 600 bp fragments that is also present in B1400, and B1400 contains a 200 bp fragment that overlaps with A1200. After analysis of all fragments the restriction map of the 5 kb DNA fragment can be generated.

688

Part IV: Nucleic Acid Analytics

Figure 27.3 Restriction mapping by partial digestion. A 5 kb DNA molecule was cleaved both completely and partially with restriction enzyme A. (a) Gel electrophoretic separation of the resulting restriction fragments. By comparing complete and partial cleavage 5000, 3800, and 3700 bp fragments can be identified as partially cleaved, of which the 5000 bp fragment is the original fragment. (b) The 3800 bp fragment can only be composed of the 2500 and 1300 bp fragments and the 3700 bp fragment composed of the 2500 and 1200 bp fragments. Accordingly, the restriction map can be established.

DNA sequencing. For other applications, for example, to compare restriction patterns in several individuals, cloning is too laborious and analysis is done directly on genomic DNA. Alternatively, interesting areas may be amplified by polymerase-chain-reaction (PCR) before amplification products are then analyzed by restriction. This reaction can then be analyzed again by normal gel electrophoresis, without any specific labeling. Detection of Methylated Bases

Since there are isoschizomers like HρaIl and Mspl (see above) that differ in their sensitivity towards a methylation within their recognition sequence, methylated bases can be detected by them. As an example so-called CpG islands are found in several promoter regions of eukaryotic genes. They are sections of DNA where dinucleotide CpG is overrepresented. If a gene is transcriptionally inactive this is often connected to the methylation of cytosines in the CpG island of the gene. If in a CpG island not all restriction sites cleaved by MspI are cleaved as well by HpaII this is an indication of methylation and thus transcriptional inactivity of the respective gene. Restriction analysis has to be performed directly on genomic DNA and is detected by Southern blot analysis (Section 28.4.3). The difficulty of DNA-methylation analysis is discussed in detail in Chapter 30.

Detection of Mutations and Restriction Fragment Length Polymorphisms (RFLP) Individuals within a population differ in the composition of their genomes. There exist highly conserved areas that are of high relevance for the carrier and which are nearly unchanged within the population or even among species (e.g., globin genes). A mutation of such a region may cause illness or the death of the carrier (e.g., sickle cell anemia as a mutation in globin genes). On the other hand, there are areas with several variants within a population, so-called polymorphisms.

27 Analysis of Nucleic Acids

689

Figure 27.4 Partial restriction and end labeling. A 5 kb DNA molecule is labeled at one end, partially restricted by enzyme A, and the reaction products are separated by electrophoresis. Only the endlabeled fragments are detected (a). (b) The size of every fragment corresponds to the distance between a restriction site of enzyme A and the labeled end. The label is shown as a dot. The resulting restriction map is shown.

These differences in DNA sequence can be exchanges, deletions, or insertions of single bases or sections of DNA. These mutations can cause a change in length of a restriction fragment, or a restriction sequence can be deleted or inserted. If a polymorphic region can be detected by change of a restriction pattern this is called a RFLP. Hereby restriction analysis is either performed on genomic DNA in combination with Southern blot analysis (Section 27.4.3), and the interesting region is used as hybridization probe, or the region is amplified by PCR in vitro and restriction analysis is performed on the PCR product. Since every individual has two homologous copies of every DNA section, in the case of heterozygosity two kinds of restriction pattern will be detected when analyzing this RFLP, one representing the paternal and one the maternal allele (Figure 27.5). Figure 27.6 shows the heredity of a RFLP is over three generations. Genetic Fingerprint A genetic fingerprint is based on the detection of highly variable RFLPs, which result in a restriction pattern that is highly characteristic for each individual. The basic causes for this are short, mostly two to three base pairs long, highly repetitive sequences, whose number of repetition is highly variable. This is helpful in identifying individuals, for example, as proof of paternity or in criminalistics (compare Section 27.2.1). Restriction Fragment Length Polymorphisms in Genetic Mapping In genetic mapping it is not the nucleotide sequence that is evaluated but the relative order of so-called genetic markers towards each other. This is done by gene linkage analysis. Possible genetic markers are blood groups and disease genes and also RFLPs. This is discussed in detail in Chapter 36.

RFLPs as Genetic Markers, Section 36.1.2

690

Part IV: Nucleic Acid Analytics

Figure 27.5 Detection of a RFLP by Southern blot analysis. (a) Homologous chromosomal sections of an individual containing a polymorphous region. Restriction sites are indicated by arrows. The area detected by the hybridization probe in Southern blot analysis (Section 27.4) is marked. (b) The respective result after restriction, gel electrophoresis, and Southern blot analysis. One restriction site is missing in the maternal allele, which results in detection of a longer restriction fragment. The shorter fragment corresponds to the paternal allele. Thus the respective restriction fragment is polymorphic in this individual.

Figure 27.6 Heredity of a restriction fragment length polymorphism over three generations. In the family analyzed four alleles occur for the polymorphic region: allele A, B, C, and D. The heredity is in accordance with Mendel’s laws. Most individuals are polymorphic for the restriction fragment analyzed, others have the same allele in both homologous areas.

27.2 Electrophoresis Marion Jurk Miltenyi Biotec GmbH Friedrich-Ebert-Straße 68 51429 Bergisch Gladbach Germany

Electrophoretic Techniques, Chapter 11

Electrophoresis is a most important method by which to analyze nucleic acids. Its advantages are obvious: electrophoresis can be performed in a very short time frame with low amounts of material. The necessary equipment and detection methods are in most cases very cheap and are easily available in every laboratory. The underlying theoretical principles and the hands on work have similarities to, but also significant differences from, the electrophoretic separation of proteins. Like proteins, the separation of nucleic acids in an electric field is performed in a solid carrier material such as agarose or polyacrylamide. In contrast to proteins, nucleic acids are negatively charged within a very broad pH range. The negative charges are carried by the phosphate groups on the backbone of

27 Analysis of Nucleic Acids

691

Figure 27.7 Theories explaining the movement of nucleic acids in the gel matrix. The Ogston theory (a) postulates a globular sphere for the nucleic acids. Its radius is defined by the length of the molecule and thermal agitation. The molecules migrate through the pores of the gel matrix if the diameter of the nucleic acid smaller than the average pore size. According to the reptation theory (b) the nucleic acids align themselves along the electric field and move snake-like through the gel matrix. Source: adapted according to Martin, R. (1996) Gel Electrophoresis: Nucleic Acids, Bio Scientific Publishers Limited, Oxford.

the nucleic acids. The migration of nucleic acids in the electric field towards the anode is therefore pH independent. Another notable difference from proteins is their constant charge density, meaning that the ratio of molecular weight to negative charge remains unchanged. There is no need to generate homogenous charge surfaces by SDS like it is the case with proteins. The electrophoretic mobility, that is, the velocity of migration in the electric field (Chapter 12), is equal for all nucleic acids in free solution independent of their molecular weight. Differences in their mobility can only be measured in a solid gel matrix. The differences in migration velocity are caused solely by different molecule sizes. The movement of nucleic acids in an electric field can be described by two theories (Figure 27.7). The migration of nucleic acids in reality could be seen as a “mixture” of these two theories. The Ogston sieving effect is based on the assumption that nucleic acids in solution have a globular, spherical structure. The size of nucleic acids is described by the radius of the sphere that is theoretically occupied by the nucleic acid. The bigger the sphere the more often collisions with the gel matrix will occur. Migration of the nucleic acids in the field will then be slowed down. Very small fragments will not be slowed down by the pores of the gel matrix. Small fragments cannot be separated. According to the Ogston sieving theory, very big molecules with sphere sizes bigger than the pores of the gel should not be able to migrate at all. A second theory, the reptation theory aims to explain the migration of big nucleic acids in the electric field. The theory assumes that big nucleic acids can abandon their globular structure and align themselves in the electric field. The migration of the molecules occurs by moving one end of the molecule ahead through the matrix pores (end-to-end migration). The theory is called reptation owing to the snake-like movement of the nucleic acids. Size selection occurs because bigger molecules need more time to move than smaller ones. Both theories together can explain most of the phenomena observed in the electrophoresis of nucleic acids with sizes of 10 kb. The behavior of very large molecules cannot be explained by these theories and requires new model theories (Section 27.2.3).

27.2.1 Gel Electrophoresis of DNA Agarose Gels The choice of carrier material depends mainly on the size and kind of nucleic acid to be analyzed. Agarose, a linear polysaccharide polymer, is the most important electrophoresis material for nucleic acids. The migration velocity of DNA molecules is determined by several factors. The effective size of a nucleic acid is not only determined by its absolute mass but also depends on its form: superhelical (form I), open-circular (form II), double-stranded linear (form III), or single stranded. Separation of Linear, Double-Stranded DNA Fragments Gel electrophoresis of linear DNA fragments (form III DNA) can be used to determine the size of the DNA reproducibly

692

Part IV: Nucleic Acid Analytics

Figure 27.8 Relationship between the migration distance and fragment length at various agarose concentrations. The semi-logarithmic curves were created using length standards. The size of a fragment can be determined by its position. Buffer: 0.5 TBE/0.5 μg ml 1 ethidium bromide. electrophoresis at 1 V cm 1; 16 h.Source: adapted according to Maniatis, T., Fritsch, F.F., and Sambrook, J. (1989) Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory, Cold Spring Harbor.

with good accuracy. There is a linear relation between the logarithm (log10) of the size (in bp) of the fragment and the migration distance (measured in cm in reference to the total distance) in an agarose gel (Figure 27.8). The migration velocity of linear DNA fragments is dependent on the concentration of the agarose, the applied voltage, composition of the running buffer, and the presence of intercalating dyes. Linear DNA fragments can be separated by agarose gels spanning a broad range of fragment lengths (Table 27.4). Very small fragments (100 bp) migrate in 1–1.5% containing agarose gels at the same speed because the pores of the gels are bigger than the fragments. Separation of these small fragments becomes possible by increasing the agarose concentration. Small DNA fragments and oligonucleotides are usually separated using 2–3% agarose gels. The migration speed of the fragments is proportional to the applied voltage. Big fragments migrate into the gel slowly if the voltage is too high – bigger fragments should therefore be separated at lower voltages. Good separation of fragments (2 kB) takes place when the applied voltage is less than 5 V cm 1, with the distance between the electrodes being the influential parameter, not the length of the gel. For the separation of DNA molecules a running buffer with Tris acetate (TAE) or Tris borate (TBE) is used. Fragments separated in TAE buffer can be better isolated from an agarose gel. The bands are usually sharper. A disadvantage of TAE buffer is its lower buffering capacity and lower stability during electrophoresis. If long electrophoresis times or high field forces are necessary, TBE buffer is used. Linear fragments migrate faster in TBE buffer (approx. 10% faster) than in TAE buffer. The separation capacity is similar in both buffer systems; however, superhelical DNA can be better separated in TBE buffer. The ion concentration of the running buffer is of importance as well. Too low a concentration causes minimal electric conductivity and consequently the speed of migration of the nucleic acids is low. Too high a concentration results in a very high electric conductivity, which causes heating of the buffer. The DNA could possibly be denatured and the agarose melted. The presence of intercalating dyes influences the speed of the nucleic acids as well. The principle of intercalation is described in Section 27.3.1. Addition of ethidium bromide decreases the migration velocity of linear double-stranded DNA fragments about 15%. Separation of Circular DNA The migration velocity of circular DNA form I (superhelical) or form II (open) depends mainly on the consistency of the agarose gel. Superhelical DNA migrates faster than linear DNA. Relaxed DNA molecules (form II) are slower than linear or superhelical DNA (Figure 27.9). The migration velocity of these three forms is influenced by the running conditions, concentration of the agarose, applied voltage, and choice of running buffer. The different forms can be identified by ethidium bromide staining.

Practical Considerations Agarose gels can be poured as vertical or horizontal gels. In most laboratories, the more practical, vertical gels are used. According to the size of the gel, there are mini, midi, and maxi gels. Mini gels have a very short distance for separation (6–8 cm) and are not suited for size determination of DNA fragments. They are used for a quick analysis of the quality of the DNA and for control of restriction digestion. Midi and maxi gels (approx. 20 or Table 27.4 Coarse separation of DNA fragments at different agarose concentrations. Source: according to Sambrook, J. and Russell D.W. (eds) (2001) Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Press, Cold Spring Harbor. Agarose concentration (% w/v)

Optimal separation range of linear double-stranded DNA fragments (kb)

0.3

5–60

0.6

1–20

0.7

0.8–10

0.9

0.5–7

1.2

0.4–6

1.5

0.2–3

2.0

0.1–2

27 Analysis of Nucleic Acids

30–40 cm) are used for accurate DNA size determination and for the isolation of fragments. The separation distance and loading capacity of the gel are much higher. The DNA is loaded onto the gel using a so-called loading buffer. They increase the density of the DNA solution (using Ficoll, glycerol, or saccharose). The DNA solution sinks into the gel pockets and does not diffuse to the running buffer. The loading buffer usually contains negatively charged dyes indicating the migration during electrophoresis. The most commonly used dyes are bromophenol blue and xylene cyanol. Bromophenol blue migrates in an agarose gel, depending on the exact conditions, like a linear DNA fragment of approx. 300 bp. An important means for the determination of DNA fragment sizes are DNA standards. DNA standards are commercially available and contain DNA fragments of defined sizes. They are separated together with the DNA fragments of interest. Using the known sizes of the DNA standard, the unknown size of the DNA fragment can be determined. For an accurate size determination it is very important that the DNA standard and unknown fragment are loaded as similar amounts and with similar buffer conditions. DNA standards can be produced by restriction cleavage of plasmid or phage DNA. A common DNA standard is the 1 kb DNA ladder (Figure 27.10). There are also standards available for small size DNA fragments, for example, the 100 bp ladder consisting of DNA fragments that differ exactly by 100 bp. The choice of standard depends on the size of the expected DNA fragments.

693

Figure 27.9 Migration behavior of superhelical, open, linear, and denatured DNA in an agarose gel. The migration properties of superhelical DNA can be influenced by the ethidium bromide concentration.

Denaturing Agarose Gels

Single-stranded DNA forms intramolecular secondary structures and intermolecular aggregates very easily. These structures influence the migration behavior in a gel. For the size determination of single-stranded DNA, denaturing agarose gels are employed where the electric mobility of the DNA depends solely on the molecular size.

Figure 27.10 Various commonly used DNA length standards. The λ-DNA marker can be generated by digestion with certain restriction enzymes (Section 27.1). Source: Photography by Dr. Marion Jurk.

694

Part IV: Nucleic Acid Analytics

Alkaline agarose gels are used to determine the synthesis efficiency of first and second strand synthesis of cDNA and to test the nicking activity of enzyme preparations. Sodium hydroxide is used as denaturing agent. The agarose needs to be dissolved in water because the addition of hot sodium hydroxide hydrolyzes its polysaccharide structures. Since ethidium bromide does not bind to the DNA at high pH, the electrophoresis is performed in the absence of ethidium bromide. Low Melting Agarose and Sieving Agarose Derivatization of agarose by the introduction of hydroxyethyl groups into the polysaccharide chains results in an agarose moiety with different properties. This low melting agarose is also heated and will gel when cooling down. However, its melting point is reduced. This property is used when isolating DNA fragments out of agarose gels (Section 27.5.3). The migration velocity of DNA in low melting agarose gels is higher, the separation and loading capacity is lower. The properties of sieving agarose are similar to those of low-melting agarose. Sieving agarose is used especially for the separation of small DNA fragments. Both types of agarose should not be used at concentrations below 2% for reasons of stability. Recommended concentrations are 2–4%.

Gel Media for Electrophoresis, Section 11.3.2

Polyacrylamide Gels The properties of polyacrylamide and the definitions of concentration and degree of crosslinking have been introduced in Section 12.3.2. Electrophoresis of DNA in polyacrylamide gels (abbreviated as PAGE, polyacrylamide gel electrophoresis) can be performed under native or denaturing conditions depending on the scope of application. Polyacrylamide gels (these gels as well as agarose gels are colloquially called slab gels) are poured as horizontal gels between two glass plates. Advantages and disadvantages of polyacrylamide and agarose gels are listed in Table 27.5. Non-denaturing Gels for Analysis of Protein–DNA Complexes Native, non-denaturing gels are used for electrophoretic mobility shift assays (EMSAs). With this method, protein– DNA complexes can be separated from free DNA. Large DNA–protein complexes are retarded in the gel by the cage effect. This method is described in detail in Chapter 32. Non-denaturing PAGE of Double-Stranded DNA Native polyacrylamide gels yield a higher resolution than agarose gels (Table 27.6) together with a higher loading capacity. This is used for the purification and isolation of double-stranded DNA fragments (1000 bp) Loading capacity higher without loss of resolution

DNA can easily be isolated

Purification yields DNA of high quality

DNA can easily be stained Capillary and vacuum blotting Disadvantages Bands are more diffuse and broader

Difficult to pour, greater technical effort

Resolution of smaller fragments is lower

No capillary or vacuum blotting possible

Isolated DNA fragments can contain impurities

Lower separation range

27 Analysis of Nucleic Acids

695

Table 27.6 Separation range of native polyacrylamide gels. The ratio of acrylamide to N,N´ methylene-bisacrylamide is 29 : 1. Acrylamide concentration (%)

Separation range (bp)

Migration of bromophenol blue in native gels (bp)

3.5

100–2000

100

5.0

100–500

65

8.0

60–400

45

12.0

50–200

20

15.0

25–150

15

20.0

5–100

12

(number of base pairs) under certain electrophoresis conditions. This effect is thought to be due to conformational changes in the DNA, such as kinks or bends. The abnormal migration behavior is more pronounced at higher polyacrylamide concentrations, higher Mg2+ concentrations, or at lower temperatures. An increase in temperature or concentration of Na+ ions results in opposite effects. Native PAGE of Single-Stranded DNA (SSCP) These gels are usually used to analyze changes within genomic DNA for certain disease indications. For the determination of various genetic mutations, methods are needed that allow many patient samples to be run at the same time. Sequencing of the individual genomic DNA would be far too expensive and time consuming. The commonly used SSCP (single-stranded conformational polymorphism) method is based on the observation that single-stranded DNA molecules with different sequences assume different conformations. The double-stranded DNA fragments to be analyzed are denatured using formaldehyde and are loaded onto a native polyacrylamide gel. The isolated single-stranded DNA strands will assume individual conformations resulting in different migrational behavior (Figure 27.11). Gene sections of individuals

Figure 27.11 Schematic principle of SSCP analysis. The DNA fragments are denatured in the presence of formamide by heat. The resulting single-stranded molecules assume a certain conformation according to their sequence and base composition. The single-stranded molecules are separated using native polyacrylamide gel electrophoresis. The migration properties differ according to the different conformations. From the characteristic gel pattern, homozygous and heterozygous individuals with genes containing certain point mutations in certain genes can be identified. Source: adapted according to Martin, R. (1996) Gel Electrophoresis: Nucleic Acids, Bio Scientific Publishers Limited, Oxford..

696

Part IV: Nucleic Acid Analytics Table 27.7 Separation of oligonucleotides in denaturing polyacrylamide gels. The ratio of acrylamide to N,N´ -methylene-bisacrylamide 19 : 1. Acrylamide concentration (%)

Separation range (in nt; nt = nucleotide)

20–30

2–8

15–20

8–25

13.5–15

25–35

10–13.5

35–45

8–10

45–70

6–8

70–300

containing point mutations will have a different conformation and consequently result in a different migration speed. It is essential to perform a native PAGE as denaturing gels would result in a uniform separation according to the size of the fragments but not according to sequence. Using SSCP various different DNA samples can be analyzed for point mutations simultaneously. The first SSCP analysis was performed using restriction generated fragments of DNA following Southern blot analysis (Section 27.4.3). In more recent approaches, the gene sections of interest are amplified using PCR and radioactive labeled with no follow up detection necessary. The analysis is performed under the assumption that mutations will behave differently than the original fragment. A negative result is not irrevocable proof that there are no point mutations present in the analyzed gene section. Denaturing PAGE of single-stranded DNA or RNA is used in various fields of applications due to the very exact separation of the single-stranded molecules (Table 27.7). The most common denaturing agent is urea, but formaldehyde is also used. Alkaline reagents cannot be used with polyacrylamide gels because the gel matrix will be destroyed. The gels are usually polymerized in the presence of 7 M urea, the running buffer is TBE. The loading buffer is usually formaldehyde to denature the probes (Section 27.2.2). Single-stranded DNA or RNA migrates in this type of gel independent of its sequence and, therefore, the separation of DNA molecules differing in size by only in one nucleotide is possible. These gels are therefore used for sequencing, S1 nuclease analysis, and RNAse protection experiments. The gels are also used for the DNA fingerprinting method. DNA Fingerprinting DNA fingerprinting is applied for the lineage analysis of genomic DNA. The technique is used in forensic analysis and for zoological studies, and also for paternity testing. DNA minisatellites, repetitive variant repeats in the genomic DNA of 10–100 bp, are inherited to a similar extent from both parents. The distribution and cleavage behavior are unique for each individual. The genomic DNA is cut by restriction enzymes and separated on denaturing polyacrylamide gels. The gels are blotted to a membrane (Southern blotting, Section 27.4.3) and hybridized using specific probes that recognize the minisatellite DNA. Fingerprinting can also be performed using PCR with random primers. Short DNA fragments are synthesized by PCR in the presence of radioactive labeled nucleotides. Two individuals will differ in their spectra of synthesized DNA fragments when run on a denaturing polyacrylamide gel with high resolution. The electrophoretic methods have been replaced but will eventually be completely replaced by the next generation of sequencing methods. Oligonucleotide Purification Denaturing polyacrylamide gels are often used for the purification of synthetic oligonucleotides or single-stranded DNA. Oligonucleotides with n bases can be separated from oligonucleotides with n 1 bases, yielding a population of nucleic acids with uniform length. The oligonucleotides are detected by fluorescent quenching. The gel is laid on a thin-layer chromatography (TLC) plate and irradiated with UV light (long wavelength). The TLC plate fluoresces upon excitation, except for the parts where the oligonucleotides are located. The oligonucleotide can be identified as dark bands and excised from the gel. If the oligonucleotides need to be isolated from the gel, the excitation time should be kept to a minimum to avoid damage to the nucleic acids or crosslinking to the gel matrix.

27 Analysis of Nucleic Acids

697

When this method is chosen for oligonucleotide purification the oligonucleotides should not contain modifications that interact with the polyacrylamide matrix.

27.2.2 Gel Electrophoresis of RNA Similar to single-stranded DNA, single-stranded RNA forms secondary structures by intra- and intermolecular base pairing. These different conformations behave differently during gel electrophoresis. An exact and reproducible analysis of RNA is only possible using denaturing gel electrophoresis. In denaturing gels the hydrogen bridges are destroyed and all RNA molecules will be separated according to their molecular weight. The electrophoresis of complex RNA mixtures (e.g., for Northern blotting, Section 27.4.4) is performed using denaturing 1–1.5% agarose gels. Smaller RNA fragments are separated like oligonucleotides using denaturing PAGE. For a rapid analysis of the RNA (e.g., for reasons of quality control), non-denaturing TBE gels can be used. The denaturing reagents employed are usually dimethyl sulfoxide/glyoxal or formaldehyde for agarose gels, and urea for polyacrylamide gels. Formaldehyde Gels The denaturing effect of formaldehyde is based on the formation of socalled Schiff bases between the aldehyde functional group and the amino group of the adenine, cytosine, and guanine bases. Consequently, the amino groups of the nucleobases cannot form hydrogen bonds for the formation of secondary structure or aggregates. The agarose gel usually contains 1.1 % formaldehyde. For longer separations (overnight), the formaldehyde content must be increased. As formaldehyde is toxic, electrophoresis should be performed under a fume hood. Since formaldehyde also interacts with the amino groups of Tris (Tris(hydroxymethyl) aminomethane), a different running buffer has to be used. This is usually a mixture containing 3-N-morpholino-1-propane sulfonic acid (MOPS) and sodium acetate. The RNA has to be denatured before loading onto the gel in the presence of formaldehyde using formamide and MOPS. The formamide destroys the base pairing of the RNA, allowing the formation of Schiff bases between the formaldehyde and the amino groups. Formamide can be contaminated by ions like ammonium formate, which can hydrolyze the RNA. Formamide is therefore deionized using ion exchange chromatography. Since MOPS possesses a very high buffering capacity, there is no need to replace or recycle the running buffer during the run as it is the case with glyoxal gels. If the gel needs to be blotted afterwards, the formaldehyde needs to be removed before blotting. Otherwise the amino groups of the bases will not be available for hybridization with the probes.

Glyoxal gels yield sharper RNA bands than formaldehyde gels, which is an advantage for blotting. Glyoxal binds to the guanine residues at neutral pH and prevents base-pairing of the RNA. In contrast to formaldehyde gels, glyoxal is only added before loading and is not added to running or loading buffer. The RNA is denatured in the presence of 1 M glyoxal and sodium phosphate and dimethyl sulfoxide (DMSO) at 50 °C. Sodium phosphate acts as buffer and DMSO destroys the inter- and intramolecular hydrogen bonds, enabling the glyoxal to react with the guanine residues. Glyoxal is easily oxidized to glyoxylic acid, which hydrolyzes RNA. Contaminating glyoxylic acid has to be removed by ion exchange before use of glyoxal in gel electrophoresis. Glyoxal reacts with ethidium bromide; consequently, the separation is performed in the absence of the intercalating reagent. Above pH 8.0, the glyoxal dissociates from the RNA. To avoid pH gradients during electrophoresis, the running buffer has to be replaced or recycled using a pump. RNA Standards Cytoplasmic RNA of eukaryotic cells consists of approx. 95% of ribosomal RNA (rRNA). Ribosomal RNA consists of 28S, 18S, and 5S rRNA. RNA preparations of high quality display two sharp, clearly separated bands in an agarose gel (Figure 27.12) that can be used as internal standards. The exact size of the ribosomal RNA depends on its origin: for human rRNA the lengths were determined to be 5.1 kb for 28S and 1.9 kb for 18S rRNA. Other length standards are commercially available or can be generated by in vitro transcription of DNA fragments of defined length.

Since RNA is subject to nuclease digestions and hydrolysis by acids or bases, the experimental set-up for RNA electrophoresis has to be modified compared to DNA electrophoresis. The same precautionary measures as for isolation of RNA have to be taken for the electrophoresis of RNA. For example, the electrophoresis chamber needs to be cleaned carefully and only RNAse-free water should be used.

Schiff bases are generated by the reaction of primary amino groups with aldehydes and water is released. An imine bond is formed.

698

Part IV: Nucleic Acid Analytics

Figure 27.12 Migration of cytoplasmic RNA. High quality RNA preparations should generate clearly visible bands of the 28S and 18S of ribosomal RNA and should barely be degraded. These bands can be used as internal length standards. Source: Photography by Dr Marion Jurk.

27.2.3 Pulsed-Field Gel Electrophoresis (PFGE) Principle High molecular weight nucleic acids cannot be separated by regular gel electrophoresis. They all possess the same so-called limiting mobility. This effect cannot be explained by the reptation theory; several other theories have been put forward to explain the observed phenomenon. One model postulates that the DNA molecules act as rigorous entities, whereby no separation effect can be obtained. Another model describes the movement of high molecular weight DNA as similar to movements in solution, where no separation effect can occur. The formation of loop structures could also explain the limiting mobility effect. Pulsed-field electrophoresis (PFGE) uses, instead of a continuous electric field, a pulsed field with changing directions of the electric fields. DNA molecules assume a relaxed globular shape in free solution (without an electric field). When an electric field is applied, the molecules align themselves according to the electric field and will move towards the anode (according to reptation theory). On removing the electric field, the molecules will again resume the relaxed globular state. By applying an electric field with a different direction, molecules will realign according to the new field. If the direction of the electric field is changed again, the molecules have to realign again. Relaxation and alignment of the molecules according to the electric field depends on the size of the molecule. Larger molecules need more time for relaxation and alignment than smaller molecules. The time needed for movement along the electric field is shorter for larger molecules than for smaller ones. The sum of all applied field vectors yields the direction of movement of the DNA molecules. The separation principle of large DNA molecules here is based on the time the molecule needs to align according to the applied electric fields. Within PFGE, several techniques are condensed, most of them differing in direction and sequence of the electric pulses. Field inversion gel electrophoresis (FIGE) uses two electric fields with opposing directions. Migration is achieved because the duration and amplitude of the forward pulse is larger (Figure 27.13). The method can separate molecules within a broad size range with high resolution. The CHEF (contour-clamped homogenous electric field) method is a more commonly used PFGE method. The electrodes are arranged in a hexagon around the agarose gel (Figure 27.14).

Figure 27.13 Principle of field inversion gel electrophoresis (FIGE). Two alternating electric fields with directions differing by 180°. The migration direction to the anode is determined by a longer or stronger pulse in this direction.

27 Analysis of Nucleic Acids

699

Figure 27.14 Principle of the contour-clamped homogenous electric field (CHEF) method. The pulses are applied in different directions. The migration of the DNA is determined by the sum of all applied field vectors and is, as displayed, a zigzag pattern.

The electric field is applied in such a way that the field vectors are aligned at angles of 60° and +60° relative to the vertical axis of the gel. The resulting movement of the nucleic acids resembles a zigzag pattern. The angle, field strength, and duration of pulses can be varied. With this method molecules up to 2000 kb can be separated. An improved version of the CHEF method is PACE (programmable autonomously controlled electrode). Twenty-four hexagonally arranged electrodes can perform any desired pulse sequence. Improvements of pulse sequences result in optimal resolution and separation properties of the gel. Applications For the separation of high molecular weight DNA the integrity of the nucleic acids is of upmost importance. To avoid any destruction of the DNA, the material is packed into agarose blocks before lysing the cells with detergents and proteinase K. High molecular weight DNA is isolated within the agarose block by incubating the agarose in the respective buffers. The agarose blocks are then applied to the gel pockets of the PFGE gel (usually 1% agarose gels). With PFGE, higher voltages are applied, resulting in a temperature increase of the running buffer due to the higher currents. The running buffer is diluted TBE (0.5 × TBE) with the addition of glycine. Glycine increases the mobility of DNA without influencing the current. To avoid pH and temperature differences, the buffer is recycled during electrophoresis. The electrophoresis is usually performed in the absence of ethidium bromide, but when separating molecules with sizes smaller than 100 kB the addition of ethidium bromide can increase the resolution efficiency because the dye influences the reorientation of the DNA molecules. The length and type of pulse sequences is very variable for the different types of PFGE and need to be optimized according to the individual conditions. Pulses are between 5 and 1000 s, field strength is usually between 2 and 10 V cm 1. Running times can vary from 10 h up to several days.

Length standards can be high molecular nucleic acids like the genomic DNA of bacteriophage T7 (40 kb), T2 (166 kb), or G (756 kb). Ligation of bacteriophage lambda DNA yields an optimal length standard (Figure 27.15) with multiples of the lambda-DNA (48.5 kb)

Figure 27.15 Common length standards for PFGE gels. The λ-DNA ladder can be generated by ligation of λ-phage DNA. Source: with kind permission of Bio-Rad, Munich.

700

Part IV: Nucleic Acid Analytics

PFGE is commonly used for the analysis of pathogens from food or clinical isolates. Different strains of certain bacteria (e.g., Listeria monocytogenes) can be analyzed according to its origin. Although with PFGE a resolution of 5 Mb is possible, PFGE cannot resolve human chromosomes (>50 Mb). However, using restriction enzymes, mapping and analysis of the human genome is possible. Rare cutters (e.g., NotI, NruI, MluI, SfiI, XhoI, SalI, SmaI (see Section 27.1)) are used. The PFGE gels are then blotted and hybridized. Physical maps of the human genome by PFGE are used for genomic fingerprinting and to analyze chromosomal deletions or translocations. The whole genome mapping method is also used for subtyping pathogenic strains.

27.2.4 Two-Dimensional Gel Electrophoresis Two-dimensional gel electrophoresis (2D gel electrophoresis) is necessary when information obtained by one-dimensional electrophoresis is not sufficient or clear-cut. The high resolution of 2D gels is achieved by repeating the electrophoresis under completely different conditions. Complex nucleic acid mixtures can be separated that cannot be achieved by a single electrophoresis. The nucleic acid mixture is first separated through a standard electrophoresis where the nucleic acids are separated by their molecular weight (first dimension). The gel lane containing the separated nucleic acids is cut out of the gel and applied to a second gel where the nucleic acids are separated under different conditions (second dimension, Figure 27.16). Usually, the electric field of the second dimension is perpendicular to that of the first dimension. Twodimensional electrophoresis can be applied in the analysis of RNA and DNA. Two-Dimensional Electrophoresis of RNA The electrophoresis conditions for RNA differ regarding the concentration of urea, polyacrylamide, and the pH of the two dimensions (Table 27.8). By performing an electrophoresis in the presence (first dimension) or absence (second dimension) of urea (urea shift), the nucleic acid is first separated according to its size (first dimension) and then according to its conformation (second dimension). A change in concentration of polyacrylamide within the two dimensions can separate the RNA molecules by interaction with the gel matrix. Molecules with different conformations can display a similar migration behavior at a given polyacrylamide concentration but will differ when the pore size of the gel is changed. In addition, the concentration of urea, pH, and pore size can be changed simultaneously. The net charge of the RNA molecules will be influenced by lower pH, so that not all RNA molecules are negatively charged. As certain bases are protonated more easily at lower pH the net charge of the whole RNA molecule depends on the sequence. The second dimension is then performed under conditions that separate the nucleic acids according to their molecular weight. Two-Dimensional Electrophoresis of DNA The 2D electrophoresis of DNA can be used for genome mapping, whereby the DNA is first cut with one restriction enzyme and separated. The separated fragments are then cut with a second enzyme and a second electrophoresis is performed. Fragments that are not cut by the second enzyme will be found on the diagonal of the gel; only

Figure 27.16 Principle and practical application of two-dimensional gel electrophoresis. The gel lane is isolated after first dimension electrophoresis and loaded to a second dimension gel. The direction of the second electrophoresis is rectangular with respect to the first dimension.

27 Analysis of Nucleic Acids

701

Table 27.8 Experimental conditions for the 2D gel electrophoresis of RNA molecules. First dimension %PAAa) Urea-shift

Concentration shift

pH/urea/concentration

pHb)

Second dimension Urea (M)

%PAAa)

pHb)

Urea (M)

X

Neutral

5–8

X

Neutral

0

X

Neutral

0

X

Neutral

5–8

X

Neutral

0

2X

Neutral

0

X

Neutral

4

2X

Neutral

4

X

Acidic

6

2X

Neutral

0

a) X represents a certain polyacrylamide concentration. b) The pH range of neutral electrophoresis is 4.5–8.5. Acidic electrophoresis occurs at a pH below 4.5. Neutral gels are usually run at pH 8.3, acidic gels at pH 3.3. A typical polyacrylamide concentration is 10–15%.

fragments cut by the second enzyme will run differently. With new methods, like next generation sequencing, these methods are becoming less important. Two-dimensional electrophoresis has also been applied in mapping the origins of replication and in the analysis of topoisomers of superhelical DNA. The curvature of DNA can also be analyzed by 2D gel electrophoresis. Temperature Gradient Gels A distant variant of the 2D gels is temperature gradient gel electrophoresis (TGGE), where the second dimension is temperature. The electrophoresis is performed in one direction and a temperature gradient is applied perpendicular to the gel (Figure 27.17). For this method one sample has to be loaded to the whole width of the gel. With increasing temperature the DNA is converted into the denatured state. Melting of DNA is a cooperative process accompanied by a drastic reduction of electric mobility. The process is strongly dependent on the sequence of the DNA fragments, since A/T rich regions melt at lower temperatures. Temperature gradients are applied for mutational analysis as this method can resolve single base changes. Using parallel TGGE (temperature gradient parallel to the electric field), many different samples can be analyzed simultaneously. For denaturing gradient gel electrophoresis (DGGE), the temperature gradient is replaced by a chemical-based gradient with increasing concentrations of denaturing reagents in the opposite direction to the electric field. Double-stranded DNA fragments are separated according to their melting properties. Both methods, TGGE and DGGE, are applied in heteroduplex analysis and for the analysis of microorganisms in environmental analytics.

27.2.5 Capillary Gel Electrophoresis Capillary gel electrophoresis (CGE) is mainly applied for analysis of nucleic acid. The advantages of this method lie in its rapidity, small sample volume, higher sensitivity, and high resolution. The theory of CGE is covered in Chapter 12. The separation principle is the migration of negatively charged nucleic acids in an electric field. The separation is performed in a capillary (50–100 μm in diameter, approx. 20–50 cm long) utilizing the sieving effect of the gel matrix. An important difference to the already described slab gels is that in CGE noncrosslinked gels can also be used as sieving material. Crosslinked gels (also referred to as chemical gels) usually consist of polyacrylamide and can be used in the capillary for 30–100 runs. Non-crosslinked gels (referred to as physical gels) can be easily replaced after each run from the capillary by applying pressure, allowing each run to be performed under reproducible conditions with fresh material. Polymers used for physical gels are hydroxypropylmethyl cellulose (HPMC), hydroxyethyl cellulose (HEC), poly(ethylene oxide) (PEO), polyvinylpyrrolidone (PVP), or linear polyacrylamide. The running buffer is similar to that used with slab gels, TBE. For denaturing conditions urea is added. The samples (1–2 μl) are loaded by electrokinetic injection or with pressure (Chapter 12). Injection of the sample and migration in the capillary are strongly dependent on the salt concentration – in most cases the probes are desalted before loading. During electrophoresis high voltages are applied (1–30 kV). The nucleic acids can be detected by UV light in the presence of fluorescent dyes (OliGreen® ,

Topoisomers forms of DNA molecules of identical length and sequence that differ only in their linking number.

Figure 27.17 Principle of temperature gradient gels. The DNA is loaded onto the whole gel width and a temperature gradient is applied perpendicularly to the electric field.

702

Part IV: Nucleic Acid Analytics

SYBR® Green, Section 27.3). The UV-opaque polyimide layer stabilizing the capillary has to be removed at the detection site (Chapter 12). An important application of CGE is the quality control of synthetic oligodeoxy- and oligoribonucleotides as CGE allows resolution of one nucleotide difference. Oligonucleotides of up to 100 nucleotides can be separated and failure sequences (n 1, n 2, etc.) can be detected. Using the area under the curve method, the ratio of the separated nucleic acids and hence the purity of the material can be determined. Figure 27.18 shows an electropherogram of an oligonucleotide. CGE is also used in nucleic acid sequencing. The sequence reactions labeled with four different dyes are separated by CGE and detected with laser-induced fluorescence (LIF). Conformational polymorphisms (SSCP and HA) can also be analyzed by CGE. Figure 27.18 CGE electropherogram of an oligonucleotide. An oligodeoxyribonucleotide 23 bases long was analyzed by denaturing capillary electrophoresis. The main peak is the full-length product and the lower peaks are contaminations with failure sequences (n 1, n 2). By determination of the area under the curves, the ratio of the different products can be calculated. Source: with kind permission of Dr. Bernhard Noll, Qiagen GmbH.

Figure 27.19 Chemical structure of ethidium bromide. Ethidium bromide intercalates preferentially into double-stranded DNA and interacts with the planar heterocyclic rings of the nucleobases. Newly developed fluorescent dyes that irreversibly stain the DNA are TOTO-1 and YOYO-1.

27.3 Staining Methods 27.3.1 Fluorescent Dyes Ethidium bromide Ethidium bromide (3,8-diamino-5-ethyl-6-phenylphenanthridinium bromide) is an organic dye with a planar structure that intercalates to DNA (Figure 27.19). The aromatic rings can interact with the heteroaromatic rings of the nucleobases. Single-stranded DNA or RNA also intercalates ethidium bromide but to a much lesser extent. The intercalated dye is excited by UV light (254–366 nm) and emits orange–red light (590 nm). Binding of the dye to DNA increases the fluorescence (increased quantum yield) so that the staining of the DNA is also

27 Analysis of Nucleic Acids

703

Figure 27.20 Changes of the geometric properties of circular DNA by intercalation of ethidium bromide. The intercalation of ethidium bromide into negative supercoiled DNA is energetically preferred compared to intercalation into the relaxed form of DNA as positive supercoils have to be introduced.

visible in the presence of unbound ethidium bromide. Ethidium bromide can be added to the gel and running buffer during electrophoresis, making post-staining of the gel unnecessary. The fluorescent ethidium cation migrates to the cathode during electrophoresis. When performing longer electrophoresis runs, the running buffer should contain ethidium bromide since smaller, faster migrating fragments will otherwise be stained only weakly. For certain applications (e.g., blotting of the gel), the gels are stained after the electrophoresis. The blotting efficiency of RNA is diminished in the presence of ethidium bromide. Lanes for staining are cut from the gel and stained separately to control quality and size standard.

In agarose gels, approximately 10–20 ng of double-stranded DNA can still be detected. DNA with intercalated dye has a reduced mobility in the gel (approx. 15%). Since ethidium bromide intercalates into the DNA; it is a strong mutagen and should be handled with extreme care. Influence on DNA Geometry Ethidium bromide changes the superhelical density of circular DNA molecules (form I) through a reduction of the negative supercoiling. Topoisomers with negative superhelical density turn into the relaxed form (increase of entropy). This conversion is favored over intercalation of ethidium bromide to give linear DNA fragments for thermodynamic reasons. Further intercalation of ethidium bromide will induce positive supercoiling. This process is less favorable compared to linear DNA (decrease of entropy, Figure 27.20). For CsCl gradient density centrifugation the ethidium bromide concentration should be saturated. All supercoiled DNA is transformed into the conformation with positive supercoiling with a lower amount of ethidium bromide than for the relaxed forms. The positive supercoiled DNA has a lower density than linear (chromosomal) DNA and can be separated from the chromosomal contamination. The intercalation of ethidium bromide can also be used to analyze the conformation of circular DNA. Other Fluorescent Dyes In recent years various intercalating dyes on the basis of asymmetric cyanine substance classes have been developed. These dyes are highly sensitive and less

704

Part IV: Nucleic Acid Analytics

mutagenic than ethidium bromide. A commonly used dye is SYBR® Green with an excitation maximum of 492 nm and a second absorption maximum at 284 nm. The emission maximum is at 519 nm. The dye can be used in fluorescence reader instruments, allowing exact quantification of the nucleic acid solution by comparison with standards. Newly developed variants of this substance class display a higher affinity to singlestranded DNA or RNA (e.g., SYBR® Green II, OliGreen® ). The dyes do not specifically bind to single-stranded nucleic acids but the quantum yield is drastically increased when interacting with single-stranded RNA. Other dyes, like TOTO-1 and YOYO-1, bind with much higher affinity to DNA than ethidium bromide. This can be used for certain applications when higher sensitivity is a decisive factor. Some of the new dyes (e.g., TOTO-3, YOYO-1, JOJO-1) cannot be excited with UV light and can only be used with laser induced fluorescence (LIF).

27.3.2 Silver Staining Silver stains can only be performed with polyacrylamide gels as agarose gels yield too high a background staining. The gels need to be poured using high quality reagents and should be handled with extreme care as all protein or nucleic acid contamination will result in high background staining.

Silver staining is a less commonly applied method for the detection of nucleic acids. The advantage of this method, as for proteins, is its sensitivity. Very small amounts of nucleic acids (up to 0.03 ng mm 2) can be detected. There is no need for mutagenic or radioactive detection reagents. The method is time consuming and background staining can be high. Silver staining is based on the change of redox potential in the presence of nucleic acids (or proteins). The reduction of silver nitrate to silver is catalyzed. The metallic silver precipitates on the nucleic acids if the redox potential is higher than that in the surrounding solution. These conditions can be achieved through the choice of buffer and reagents. Recent analysis found that the purine bases account for this reaction.

27.4 Nucleic Acid Blotting 27.4.1 Nucleic Acid Blotting Methods Electroblotting, Section 11.7

For further analysis of the nucleic acids they are separated by gel electrophoresis or transferred to a membrane. The fixed nucleic acids can be identified and analyzed by hybridization with labeled probes of known sequence. The nucleic acids can be transferred to a membrane by various methods: capillary blotting, vacuum blotting, and electroblotting. Figure 27.21 displays the principle underlying each method. While blotting and subsequent hybridization with certain labeled probes was once the first choice to analyze nucleic acids (e.g., to detect chromosomal rearrangements, mutations, etc.), newer techniques (e.g., PCR, next generation sequencing) are nowadays applied more frequently than nucleic acid blotting, although blotting remains a useful, fast, and inexpensive way for nucleic acid analysis. Capillary blotting can be performed with the least technical effort. The nucleic acids are transferred by capillary forces to the membrane using paper towels, through which the blotting buffer is soaked through the gel and membrane. Vacuum blotting is performed using a vacuum chamber with the membrane attached and the nucleic acids are transferred through the gel. Capillary and vacuum blotting can only be performed with agarose gels. The nucleic acids will not move from polyacrylamide gels due to the lower pore size. For polyacrylamide gels, electroblotting systems are used. The nucleic acids are transferred to the membrane using an electric field. Electroblotting can be performed using a tank filled with buffer or by semi-dry blotting where the membrane and the filter are in contact with wet filter paper.

27.4.2 Choice of Membrane For nucleic acid blotting two types of membranes are used: nitrocellulose and nylon membranes. Nitrocellulose has long been used but is increasingly being replaced by nylon (or poly

27 Analysis of Nucleic Acids

705

Figure 27.21 Schematic drawing of the different blotting techniques.

(vinylidene fluoride), PVDF) membranes with improved handling and binding properties. Table 27.9 gives an overview of membranes commonly used and their properties. The nucleic acids are bound covalently to the surface of the nylon membrane, fixing them better to the material. The filters can be used several times. The binding to nitrocellulose is non-covalent. The advantages of nylon membranes are manifold: higher stability, higher binding capacity, and better fixation. Nitrocellulose is harder to handle and is more fragile.

Nylon and PVDF membranes can yield a strong background signal, which can be reduced by the choice of suitable blocking reagents. If the membrane needs to be hybridized more than once, nylon membranes are the material of choice.

27.4.3 Southern Blotting In 1975, E. Southern was the first to immobilize DNA separated by gel electrophoresis to a nitrocellulose membrane. Since then, the transfer of DNA from gels to a membrane has been called Southern blotting. Table 27.9 Properties of different blotting membranes. Source: according to Ausubel, F.M., Brent, R.E., Moore, D.D., Smith, J.A., Seidman, and J.G., Struhl, K. (1987–2005) Current Protocols in Molecular Biology, John Wiley & Sons, Inc., New York. Property

Nitrocellulose

Improved nitrocellulose

Neutral nylon membrane

Positively charged nylon membrane

Application

ssDNA, RNA, proteins

ssDNA, RNA, proteins

ssDNA, dsDNA, RNA, proteins

ssDNA, dsDNA, RNA, proteins

Binding capacity (μg nucleic acid cm 2)

80–100

80–100

400–600

400–600

Type of nucleic acid binding

Non-covalent

Non-covalent

Covalent

Covalent

Size restrictions for transfer

500 nt

500 nt

50 nt or bp

50 nt or bp

Resilience

Bad

Good

Good

Good

Multiple hybridizations

Bad (become brittle)

Bad (loss of signal intensity)

Good

Good

706

Part IV: Nucleic Acid Analytics

Southern blotting can be described by three main steps: preparation of the gel, transfer, and immobilization of the DNA on the membrane. The efficiency of the transfer can be strongly improved by partial depurination of the DNA in the gel. The gel is treated with dilute hydrochloric acid, whereby purines are separated from the DNA backbone. When the DNA is denatured, either within the gel or during blotting, the phosphodiester bonds on the apurinic nucleotides are broken leading to a fragmentation of the DNA during transfer. The procedure is necessary especially for large DNA fragments (>10 kB). However, the fragments should not be too small because small fragments are not fixed efficiently to the membrane. The denaturing step is dependent on the type of nucleic acid to be transferred. For genomic DNA denaturing is essential. Usually, the gels are not stained with ethidium bromide to avoid gel and separation artifacts. If nitrocellulose is used for blotting, the transfer buffer has to be high in salt for efficient binding to the membrane. Usually, 20 × SSC buffer is used containing sodium chloride and sodium citrate. If a denaturing step was performed, the gel needs to be neutralized before blotting as DNA does not bind to nitrocellulose at above pH 9. For nylon membranes, the DNA can also be transferred using 20 × SSC buffer (with a previous denaturing step) or the denaturing step can be performed during blotting using an alkaline transfer buffer (e.g., 0.4 M sodium hydroxide) and sodium chloride. Capillary blots are usually transferred overnight whereas vacuum blots are performed within 1–2 h. The DNA then needs to be immobilized on the membrane. The crosslinking of the DNA to the membrane can be performed using UV light. The thymidine bases are covalently linked to the amino groups of the nylon membrane. The duration and strength of the UV crosslinking needs to be optimized as crosslinking that is too strong makes most of the thymidine bases unavailable for hybridization whereas weak crosslinking leads to loss of signal intensity. If the transfer is performed using alkaline buffer, immobilization of the DNA to the nylon membrane is not necessary. With nitrocellulose membranes, the DNA is non-covalently bound by incubation of blotted membranes by temperature (not higher than 80 °C and not longer than 2 h, as the nitrocellulose can ignite). For the transfer of DNA out of polyacrylamide gels, only electroblotting can be performed using nylon membranes. Southern blotting can be used for genomic analysis where the genomic DNA is cut with different restriction enzymes, separated, and blotted. The DNA can be analyzed for a specific hybridization pattern using known gene probes. It can also be used to detect gene families or single-copy genes. With increasing knowledge of the human genome by deep sequencing, genomic analysis by Southern blotting is becoming less common. Southern blotting techniques are and have been used to detect similarities between different species (“Zoo blot”). The genomic DNA of different species (or strain subtypes) is cut with a restriction enzyme and hybridized with the probe of interest.

27.4.4 Northern Blotting According the nomenclature of Southern blotting for the transfer of DNA to membranes, the transfer of RNA to membranes is termed Northern blotting (and the transfer of proteins is called Western blotting). Like electrophoresis methods, blotting techniques need to account for the different properties of DNA and RNA. Since RNA is often separated in denaturing gels, the denaturing step preceding the blotting is not necessary. However, the denaturing reagents used during electrophoresis need to be removed by soaking of the gel in dilute sodium hydroxide solution or by incubation of the filter at higher temperatures. If high molecular weight RNA molecules need to be transferred, the gel is incubated briefly in sodium hydroxide (0.05 M) to partially hydrolyze the RNA for easier transfer. Nitrocellulose membranes are more often used for Northern blotting than for Southern blotting. RNA gels are usually blotted in 10× or 20 × SSC buffer. An alkaline transfer is possible but only with a very low concentration of sodium hydroxide (7.5 mM). RNA gels are blotted by vacuum or capillary blotting, but the transfer requires more time (usually two days for a capillary blot). The RNA is immobilized similarly to DNA on the membrane. Northern blots are used to study the expression of certain genes in different cells or tissue using total RNA or poly-adenylated mRNA purified by oligo d(T) affinity chromatography. With low expression genes, the use of purified poly(A) mRNA is mandatory. To detect

27 Analysis of Nucleic Acids

707

Figure 27.22 Scheme of the setup of a slot and dot blotting unit.

differences in the expression levels of mRNA of a certain gene, it is necessary to load equal amounts of RNA to the gel. To control the amount of RNA loaded and blotted to the gel, the blot is hybridized again with a probe for a gene that is equally expressed in all tissue or cells (housekeeping genes, usually glucose-6-phosphate dehydrogenase, β-tubulin, or β-actin mRNA). The signal strength obtained with the housekeeping probe should be similar for each of the different samples. A similar analysis is performed with RT-PCR (Chapter 29) and is usually the first choice method as Northern blots are not as quantitative as RT-PCR and require more hands-on time.

27.4.5 Dot- and Slot-Blotting Dot- and slot-blotting are simple applications of membrane hybridization. The nucleic acids to be analyzed are blotted to the filter without prior separation by electrophoresis. The transfer is performed in dot-blotting units (Figure 27.22). With this set-up a large number of samples can be analyzed simultaneously. Dot- and slot-blots are used to analyze a large sample number for the presence or absence of a certain nucleic acid sequence.

27.4.6 Colony and Plaque Hybridization A variant of the blotting technique is the generation of so-called colony or plaque filters for screening of cosmid or phage libraries. Colonies of bacteria grown on agar plates (colony hybridization) or phages (plaque filters) are transferred to membranes for hybridization with a certain probe. A huge number of colonies or phages can be screened for the presence of the sequence of interest. For colony or plaque hybridization, the stable nylon membranes are used. The membranes are transferred to the colony plate and colonies are lifted to produce exact copies of the agar plate. The direction of plate and filter need to be marked exactly to ensure that the colonies on the membrane can be allocated later to the colonies on the plate. Bacteria and phages are lysed using a denaturing solution (containing sodium hydroxide and sodium

708

Part IV: Nucleic Acid Analytics

chloride). The DNA is fixed to the membrane and the remaining, contaminating RNA hydrolyzed. Filters are neutralized and treated with UV light. To avoid hybridization artifacts, each agar plate is lifted twice and only those hybridization signals that can be detected on both membranes are considered positive.

27.5 Isolation of Nucleic Acid Fragments To isolate pure DNA fragments, the fragments need to be separated with high resolution. This can be achieved by optimization of agarose concentration and electrophoresis conditions. The gel should not be overloaded as resolution will decrease. Chaotropic agents Certain ions can disrupt hydrogen bonding and destroy chemical structures. These ions are negatively charged with large diameters and low charge density. Examples are I , ClO4 , and SCN .

DNA fragments of a defined sequence are the basis for various methods. Isolation methods for DNA fragments depend on the type of application they are needed for. Isolation of DNA from polyacrylamide gels will yield very pure material but the separation range and isolation efficiency are lower than for agarose gels. Isolation of fragments from agarose gels can result in contamination with polysaccharides, depending on the quality of the agarose. Very large DNA fragments (>5 kb) as well as small amounts of DNA are only isolated from an agarose gel with low efficiency. For preparative approaches (e.g., isolation of plasmid vectors), the DNA digested with restriction enzymes is loaded to several gel pockets. After separation the DNA fragment is cut out from the gel using a scalpel. The fragments are detected by ethidium bromide staining and UV light. For isolation of the fragments it is important to use long wavelength UV light and to keep the exposure time as short as possible to avoid damage of the DNA and crosslinking to the gel matrix. The described methods can also be used to isolate or purify fragments that have not been separated previously by slab gels.

27.5.1 Purification using Glass Beads Most commercially available DNA fragment isolation kits use glass beads. DNA can be bound to the glass surface in the presence of chaotropic salts (e.g., lithium acetate or sodium perchlorate). Hydrogen bridges within the agarose polymer are destroyed by high concentrations of chaotropic salts and the gel matrix is dissolved. At these salt concentrations, the DNA will adsorb to the silica surface of the glass beads. The adsorption is strongly pH dependent. Some protocols use pH indicators to ensure the correct pH (must be below 7.5 for optimal binding). The glass beads are either centrifuged or handled in columns. The adsorbed DNA is eluted from the glass beads after several washing steps to remove residual agarose and salt using low salt buffer at high pH (TE usually works well). With this method, fragments with sizes between 40 bp and 50 kB can be isolated from 0.2–2% agarose gels. Higher molecular weight fragments (>4 kB) need more time and higher temperatures for isolation from the glass beads. The fragment yield will depend on the size of the DNA fragment. If the fragments are isolated from TBE gels, monosaccharides need to be present to chelate the borate anions. The method can also be used with some modifications for the isolation of DNA from polyacrylamide gels.

27.5.2 Purification using Gel Filtration or Reversed Phase The principle of gel filtration has been discussed in Chapter 26.1.2. For fragment isolation this method is used to separate (radioactive labeled) nucleotides from reactions or to desalt the nucleic acid solutions. The choice of column material is dependent on the fragment size. The method is also frequently used for the purification of PCR fragments, to remove primers, nucleotides, and fluorescent dyes. Gel filtration methods are available for 96-well format.

27.5.3 Purification using Electroelution This method is based, similarly to electrophoresis, on the migration of nucleic acids within the electric field. The method requires a higher instrumental effort and is less frequently used than commercially available kits, but if an instrument is available electroelution is cheaper than gel filtration columns. With the simplest experimental set up, the agarose gel is captured in a dialysis tube containing electrophoresis buffer and submitted to an electric field. The DNA will migrate

27 Analysis of Nucleic Acids

709

according to the field out from the agarose piece and is trapped in the dialysis tube. Dependent on the type of electroelution instrument, the technical set up can vary; however, the DNA will always concentrate in the direction of the anode.

27.5.4 Other Methods Oligonucleotides can be purified using denaturing polyacrylamide gels (Section 27.2.1). It is possible to separate n-mers from failure sequences (n 1) or (n 2). The oligonucleotide band can be visualized using fluorescence quenching. Efficient elution of the cut bands can be achieved by incubation of the small cut polyacrylamide gel pieces in sodium acetate solution. The oligonucleotides will diffuse into the solution. If DNA fragments of low purity and yield are needed (e.g., for simple cloning experiments), agarose gel pieces can be centrifuged through silanized glass wool. The agarose matrix will be disrupted by the centrifugal forces and held back by the glass wool. Simple centrifugation of the gel pieces in a tube can also yield enough DNA fragment in the supernatant above the agarose pellet. These methods are not suitable if high purity and high yield fragments are needed.

27.6 LC-MS of Oligonucleotides Markus Weber1 and Eugen Uhlmann2 1 2

Currenta GmbH & Co. OHG, Chempark Q18, 51368 Leverkusen, Germany iNA ImmunoPharm GmbH, Zentastrasse 1, 07379 Greiz, Germany

27.6.1 Principles of the Synthesis of Oligonucleotides Synthetic oligonucleotides and their derivatives are important tools in molecular biology and in the development of new types of drugs, in particular antisense oligonucleotides, siRNAs, aptamers, antagomirs, and CpG adjuvants. These days their synthesis takes place on gram and kilogram scales, primarily by the phosphoramidite method on a solid phase (Figure 27.23). The step-by-step, computer-controlled synthesis takes place in the 3´ to 5´ direction. The first nucleoside residue is bound to the solid phase support (organic polymer or controlled pore glass) by its 3´ -hydroxyl group and a base labile succinic acid. Orthogonal protective groups, like the acid labile 5´ -O-dimethoxytrityl (DMT) protective group and base labile protective

Figure 27.23 Schematic representation of the reaction cycle of oligonucleotide synthesis according to the phosphoramidite method on a solid phase (TBDMS: tbutyldimethylsilyl, DMT: dimethoxytrityl).

710

Part IV: Nucleic Acid Analytics

Figure 27.24 Possible by-products of the oligonucleotide (phosphorthioate) synthesis.

Protective groups Temporarily introduced groups that enable a molecule with several reactive functions to selectively react at only one of these functions. Diastereomers Substances with several chiral centers that are not mirror images of one another and are therefore different compounds with different physical properties. For example, the phosphorothioate modified oligonucleotides have, besides the chiral β-Ddeoxyribose, a further chiral center on the phosphate.

Figure 27.25 Depurination reaction (iBu: isobutyryl).

groups on the nucleobases and on the phosphate residue, allow the targeted exposure of reactive functions. In the first step, the DMT group is removed by treatment with diluted tri- or dichloroacetic acid. The free hydroxyl group is converted into a trivalent phosphite ester in a condensation reaction with 5´ -O-DMT-nucleoside-3´ -phosphoramidite, catalyzed by tetrazole. These esters can then be oxidized to a phosphotriester with iodine or to a thiophosphotriester with a sulfurizing agent like the Beaucage reagent. After complete chain synthesis and removal of the protective groups, the oligonucleotides can be phosphodiesters, phosphorothioates, or mixed backbone analogs, depending on the reagents employed. Since the substitution of an oxygen atom with a sulfur atom creates a chiral phosphate, in the course of making phosphorothioates a mixture of 2n diastereomers results, where n is the number of internucleotide bonds. For example, a 20-mer oligonucleotide with 19 phosphorothioate modifications on the internucleotide bonds consists of 524 288 diastereomers. An acylation capping reaction is used to prevent excess 5´ -hydroxy components from reacting in subsequent cycles of the coupling reaction. After multiple repetitions of the reaction cycle, corresponding to the length and composition of the desired sequence, the oligonucleotide is removed from the solid-phase column with concentrated ammonia and the protective groups are removed. RNA synthesis differs from DNA synthesis only in as far as an additional protective group is required for the 2´ hydroxyl group. Often, the t-butyldimethylsilyl (TBDMS) protective group is used, which is stable during the synthesis and can be removed in the very last step with triethylammonium fluoride. Although the cycles of the phosphoramidite method operate with a very high yield of 98–99%, by-products can be present that are either the product of failed reactions during chain assembly or result from the final reactions to remove the protective groups (Figure 27.24). Since the coupling reactions do not operate with 100% efficiency, not only are oligonucleotides of the full length (N) present but also those of shortened lengths (N 1, N 2, N 3, etc.), which are missing one or more nucleotides. Interestingly, the reactions can also result in a nucleotide of greater than the expected length (N + 1). These arise from a double addition during the tetrazolecatalyzed coupling reaction as a result of a minor cleavage of the acid labile DMT group, either on the monomer or on the growing chain. This side reaction happens most frequently during the condensation of deoxyguanosine, whose 5´ -DMT group is the most labile of the four bases, due to the slightly acidic nature of the tetrazole catalyzed coupling reaction. In the case of incomplete sulfurization during the synthesis of phosphorothioates, reaction products contain a phosphodiester bond (mono-phosphodiester) in addition to the expected phosphorothioate. Another side reaction during the synthesis of purine-containing sequences is depurination (Figure 27.25), which takes place in an acidic environment. This refers to the hydrolysis of the N-glycoside bond between the nucleobase and deoxyribose due to protonation of the purine base at the 7-position. Possible side reactions of the deprotection are either the incomplete removal of the protective groups, such as the isobutyryl protective group on the exocyclic amino function of guanine, or the production of acrylonitrile adducts. The latter can come about

27 Analysis of Nucleic Acids

711

Figure 27.26 Formation of cyanoethyl adducts during the removal of the cyanoethyl protective groups with concentrated ammonia.

after β-elimination of the 2-cyanoethyl phosphate protective group and subsequent base-catalyzed addition of the acrylonitrile to the N3 of the thymine base (Figure 27.26). The by-products of oligonucleotide synthesis, due to their complexity, can only be partially separated by subsequent ion exchange or reversed-phase chromatography and are therefore only detectable by suitable analytical methods, such as the LC-MS method described in the following section.

27.6.2 Investigation of the Purity and Characterization of Oligonucleotides While traditionally oligonucleotides with the natural phosphodiester internucleotide bond were of primary interest, more recently modified oligonucleotides now play a greater role, particularly in the area of therapeutic applications. The demands on the capabilities of the analytical methods have increased dramatically due to the increasing use of these modified oligonucleotides in recent years. Methods like polyacrylamide gel electrophoresis (PAGE), capillary gel electrophoresis (CGE), and anion exchange high performance liquid chromatography (HPLC) have been the mainstay of analytical techniques in the past, but the investigation of synthetic oligonucleotides, in particular, makes the use of HPLC-MS methods increasingly important. The online coupling of HPLC with a detection method based on mass spectrometry (MS) results in extremely powerful and conclusive HPLC-MS methods. A successful LC-MS analysis requires the best possible separation of the analytes with a HPLC method. Furthermore, it must be ensured that the HPLC method is compatible with the subsequent MS methods, which represents a significant hurdle in the development of new LC-MS techniques. Electrospray ionization mass spectrometry (ESI-MS) is the method of choice. It, however, requires the use of volatile buffer systems in the preceding HPLC. The direct infusion of the oligonucleotide to be investigated without prior HPLC purification is greatly complicated by the formation of cation adducts that are a product of the high affinity of oligonucleotides for sodium and potassium ions. Greig and Griffey have shown that the addition of strong bases like triethylamine (TEA) or piperidine strongly reduces the formation of adducts and thereby increases the sensitivity of ESI detection. For the investigation of complex mixtures by ESI-MS, a separation of the analytes by HPLC is essential. It turned out, however, that the mobile phases that led to a good separation of the analytes inhibited the ionization of the electrospray. As a result of the weakly hydrophobic character and the polyanionic nature of oligonucleotides they are ill suited to conventional reversed-phase (RP) HPLC. Ion pair reagents, which strengthen the interaction between the analytes and the stationary phase of the column, are therefore employed during the HPLC separation of oligonucleotides (Figure 27.27). Triethylammonium acetate (TEAA) and tetrabutylammonium bromide (TBAB) are two IP reagents that are frequently used during the separation of oligonucleotides by IP RP-HPLC. TBAB is, however, not volatile and can therefore not be used in combination with electrospray ionization. Although TEAA is a volatile ion pair reagent, it negatively impacts the sensitivity of the MS detection. The concentration of TEAA normally required for efficient separation leads, in general, to a significant loss of sensitivity of the MS detection. Apffel et al. were the first to use hexafluoroisopropanol/

712

Part IV: Nucleic Acid Analytics

Figure 27.27 Schematic representation of the mechanism of separation of ion pair reversed-phase HPLC. Reversed phase chromatography means that the stationary phase is less polar than the eluent mixture. Typical stationary phases are porous silica gels, which have alkyl groups of various lengths bound to their surface. The chain length of the alkyl residues determines the hydrophobicity of the stationary phase. Most often C18 or C8 alkyl chains are used.

triethylamine (HFIP/TEA) as an ion pair reagent and thereby achieved a high efficiency of the HPLC separation while maintaining the high sensitivity of MS detection and low adduct formation. Apffel attributed the increased MS sensitivity of the HFIP/TEA buffer to the different boiling points of HFIP, acetic acid, and TEA. Since acetic acid (boiling point 118 °C) has a higher boiling point than TEAA (89 °C), TEA evaporates preferentially, leading to a decrease in the pH of the HPLC eluent during the electrospray process. The drop in pH leads to protonation of the negatively charged oligonucleotide and therefore to a decrease in the sensitivity of the MS detection. During the desolvation of a HPLC eluent from HFIP/TEA buffer and analytes, in contrast, the HFIP preferentially evaporates, which leads to an increase of the pH value and therefore a deprotonation of the phosphate groups in the oligonucleotide backbone. The resulting negatively charged oligonucleotide can be evaporated into the gas phase during the ESI process and can be detected with high sensitivity. Gilar et al. further optimized the HFIP/TEA buffer system originally introduced by Apffel et al. for the separation of oligonucleotides by HPLC and thereby facilitated its broad adoption in the research and development of oligonucleotides. Depending on the nature of the column and eluents, even nonpolar and polar analytes can be separated by RP-HPLC. If the polar characteristics are very pronounced, such as with oligonucleotides, a separation by normal RP methods is not possible. To increase the reactivity and thereby also the affinity of the polar substances to RP phases, ion pair reagents are generally used. These are characterized by undergoing a hydrophobic interaction with the RP phase and a charged interaction with the analyte. Ion pair reagents are used for the separation of oligonucleotides, which use alkyl residues to attract ammonium ions to increase the interaction between the analytes and the RP phase. Besides charged interactions, hydrophobic interactions between the RP phase and the hydrophobic bases of the oligonucleotides occur, which contribute to the total retention of the analytes. The separation efficiency of an ion pair method for the separation of oligonucleotides is primarily determined by the lipophilic nature of the ammonium cation. In addition, the counter-ion also has an impact on the separation. Gilar explains the high separation efficiency of the HFIP/TEA buffer as a result of the decreased solubility of protonated TEA molecules in HFIP, compared to acetic acid, which increases the surface concentration of the cation on the RP phase.

27.6.3 Mass Spectrometric Investigation of Oligonucleotides HPLC systems coupled to mass selective detectors use electrospray ionization in which the analytes, dissolved in the separation buffer used to elute them from the HPLC column, are injected via a capillary into the ion source. Under normal atmospheric pressure, an electric field of several kilovolts is applied to the LC capillary as the ion source passes through it to form a

27 Analysis of Nucleic Acids

713

Figure 27.28 Schematic representation of the LC-MS analysis of a complex mixture of three components, of which two (2a and 2b) cannot be separated by chromatography.

finely dispersed spray of highly charged solvent droplets with a diameter of a few micrometers. The analysis of oligonucleotides, which due to their phosphate backbone form very slightly negatively charged molecular ions, takes place in negative ion mode, in which the LC capillary receives a positive charge. The ionization is particularly effective when the oligonucleotide is already in a deprotonated form due to the use of a suitable HPLC buffer system. This can be achieved by the use of buffers that have an alkaline pH value during the electrospray. The ions in the solvent droplets move into the gas phase through the process of desolvation, during which, dependent on the molecular weight of the oligonucleotide, primarily multiply charged ion molecules are formed. The charge distribution is determined mainly by the molecular weight of the analyte, but may also be influenced by the type of HPLC buffer, as well as the device parameters. A fragmentation of the analytes is not observed due to the low thermal load during the ionization as part of the electrospray procedure. After the transfer of the analytes into the gas phase, the mass analysis of the ions takes place in an ion mass spectrometer. For the analysis of oligonucleotides, HPLC coupling offers the advantage that complex substance mixtures can be investigated in a relatively simple manner. In addition, the chromatographic purification offers the possibility of an almost complete removal of salts, which would otherwise inhibit the electrospray process. By coupling HPLC with the ESI-MS one receives, in addition to the UV chromatogram, further chromatographic data, the so-called total ion current (TIC) chromatogram, which usually correlates well with the UV chromatogram detected at 260 nm. The mass spectrometric detection allows the visualization of a mass spectra at every point in the TIC chromatogram (Figure 27.28). The electrospray ionization of oligonucleotides generally leads to the formation of a series of multiply charged ions, which carry a variable number of negative charges in their backbones. Therefore, in the mass spectra of an oligonucleotide there are always a series of ion signals that differ by exactly one charge, z. As a result, for a molecule of mass m, a series of different values of m/z are always detected. Since ESI-MS does not detect the mass directly, but instead the ratio m/z, multiply ionized molecules with a relatively high mass can still be measured. The intensity of an ion signal always depends on the statistical probability that the corresponding ion is formed during the electrospray process. In the ideal case the intensity distribution forms a Gaussian distribution curve. The actual form of the intensity distribution and the position of the maximum are, however, dependent on the choice of MS parameters, since these can strongly influence the transmission of individual ions. The formation of ion series prevents the direct determination of the molecular weight from the mass spectra of the oligonucleotides. The molecular weight can be obtained, however, by deconvolution of the data using the measured m/z values of the individual charged states. The mass spectra recorded by an ion trap in full scan mode can be used to generate an extracted ion chromatogram (EIC). This involves using the individual m/z values of an ion series of a compound to calculate a chromatogram trace, which can be used to see at what time point a particular component is eluted. An EIC can be used to generate

714

Part IV: Nucleic Acid Analytics

Gauss (or normal) distribution – a symmetrical distribution in the form of a bell curve, with which many random processes in nature can be described.

chromatogram traces of co-eluting substances of differing molecular weights. These can then be used like any normal UV or TIC chromatogram to quantify substances; this means that even substances that cannot be separated by HPLC can be quantified, provided they are of differing molecular weights.

27.6.4 IP-RP-HPLC-MS Investigation of a Phosphorothioate Oligonucleotide This section explains the investigation of a synthetic oligonucleotide using IP-RPHPLC-MS by using the example of a 20-mer phosphorothioate oligonucleotide. There are no other chemical modifications besides the phosphorothioate modification of the oligonucleotide backbone. Table 27.10 shows the sequence of the phosphorothioate. The objective of IP-RP-HPLC-MS based LC-MS analysis is to identify by-products of the oligonucleotide synthesis based on their molecular weight and therefore to draw conclusions about the purity of the main product of the synthesis. The UV or TIC chromatograph shown in Figure 27.29 provides an overview of the number of contaminants present. In addition to the Table 27.10 Identification of the by-products from the synthesis of the 20 mer phosphorothioate oligonucleotide based on their molecular weight. The by-products, which make up less than 0.5% of the total UV trace, are not discussed here. ID

MW (Da)

AMW (Da)

MWBer. (Da)

Identificationa)

% UV

A

6665.6

0

6665.4

N

78.1

B

6532.7

132.9

6532.2

N-Guanine + H2O

C

6649.8

15.7

6649.3

Nox

D

6320.3

345.3

6320.1

N

E

7010.8

+345.2

7010.6

N+G

5.3

F

6719.1

+53.5

6718.4

N + CE

1.8

G

5974.9

690.7

5974.8

N

2G

2.4

H

5629.6

1036.0

5629.5

N

3G

0.8

a) Nox: Mono phosphodiester. N-Guanine + H2O: depurinated. N + CE: cyanoethyl adduct. Sequence of the desired oligonucleotide: G∗G∗G∗G∗G∗A∗G∗C∗A∗T∗G∗C∗T∗G∗G∗G∗G∗G∗G∗G. where ∗ represents a phosphorothioate-internucleotide bond.

Figure 27.29 Ion pair RP-HPLC-separation of a 20-mer phosphorothioateoligonucleotide. The total ion current (TIC, - - -) of the ESI detection corresponds to conventional UV (___) detection at 260 nm.

G

11.6

27 Analysis of Nucleic Acids

715

Figure 27.30 ESI spectra of the main components (A) of the chromatographic separation. The molecular weight of components (A) is determined by deconvolution of the ion series (A).

contaminants separated by HPLC, other contaminants that are not separable by chromatography can be present, which cannot be identified on the basis of the UV or TIC traces. Provided they are of differing molecular weight, they can be detected by ESI mass spectrometry and their content estimated via the EIC method described earlier. With the aid of mass spectrometric detection, which is carried out in addition to conventional UV detection, it is possible to determine the molecular weight of compounds eluted from the HPLC into the mass spectrometer. Figure 27.30 shows a typical example of a measured MS spectrum of the desired main component (A). It shows a series of signals of the negatively charged ions typical of oligonucleotides. In this case, five signals were measured, which can be attributed to species with a differing number of negative charges (6–10) in the phosphorothioate backbone. Depending on the measurement conditions, a different number of charged states may be found in the primary spectrum. By deconvolution the molecular weight of the analyte can be calculated from the m/z values of an ion series. Here a molecular weight of 6665.6 Da was measured, which is sufficiently close to the calculated mass of the desired oligonucleotide of 6665.4 Da. The molecular weight of two analytes (compare compounds E and F in Figure 27.29) incompletely separated by HPLC chromatography are easily determined based on their mass spectra (Figure 27.31). The series of the MS signals of the components of E correlate due to their higher signal intensity with the signal of the higher UV intensity of the double peak (E), (F) in the UV lane. Using the higher intensity signals of series E, a molecular weight of 7010.8 Da is calculated, which corresponds to an oligonucleotide extended by one nucleotide (N + G) (Table 27.10). The signals of the series F result in a mass of 6719.1 Da after deconvolution, which correspond to the cyanoethyl adduct of the oligonucleotide (N + CE). The signals E and F, which could not be completely separated chromatographically, can be unequivocally assigned to the two by-products N + G (E) and N + CE (F) (Table 27.10). On the basis of these simple considerations, it is possible to determine the molecular weight of components A and D to H without great difficulty (Figure 27.29). The by-products B and C of the synthesis cannot be separated from the main product A of the reaction by HPLC; this means that all three compounds appear in the UV or TIC detection traces as a single signal. In this case it is, however, possible to use the mass-dependent detection of these components to identify them. Just ahead of the main components in the mass spectra, beside the signal from the main component A, there are two more ion series: B and C (Figure 27.32). Deconvolution allows determination of the molecular weights of the by-products B and C, which coelute with the target compound A. In this case, besides the desired oligonucleotide N of mass 6665.4 Da, two

716

Part IV: Nucleic Acid Analytics

Figure 27.31 ESI spectra of the chromatographically incompletely separated components (E) and (F).

other by-products of masses 6532.7 Da (B) and 6649.8 Da (C) are detected, which correspond to the depurinated product (B) and the mono phosphodiester Nox of the oligonucleotide (C). The m/z value of the differentially charged analytes of the ion series A, B, and C can also be used to extract an ion chromatogram. The EIC method is particularly helpful in this case, since the by-products of the synthesis (ion series B and C) cannot be separated chromatographically from the main product A. In this manner the by-products B and C, which coelute with the main product, can be quantified (Figure 27.33). The mass spectrometrically determined molecular weight of the main product A can be used to make a comparison with the expected molecular weight to confirm its identity. In addition, in many cases the difference in molecular weight between the main product and by-products can

Figure 27.32 ESI spectra from the leading edge of the main components detect three components of differing molecular weight.

27 Analysis of Nucleic Acids

717

Figure 27.33 Extracted ion chromatograms (EIC) of the components A, B, and C. By integration of the chromatogram traces, the relationship of the co-eluting compounds A, B, and C to one another can be determined.

be used to determine the identity of the by-products. (Table 27.10). In this example, what appears in the UV-HPLC chromatogram as a single uniform peak (A, B, C) (Figure 27.29) is revealed by the EIC trace to be a mixture of the desired main product A (88%), as well as the depurinated product B (7.5%) and the mono phosphodiester C (4.5%) (Figure 27.33).

Further Reading Section 27.1 Ausubel, F.Μ. Brent, R., Kingston, R.E., Moore, D.D., Smith, J.A., Seidman, J.G., and Struhl, K. (1987–2005) Current Protocols in Molecular Biology, John Wiley & Sons, Inc., New York. Roberts, R.J. et al. (2003) A nomenclature for restriction enzymes, DNA methyltransferases, homing endonucleases and their genes. Nucleic Acids Res., 31, 1805–1812. Roberts, R.J., Vincze, T., Posfai, J., and Macelis, D. (2010) REBASE – a database for DNA restriction and modification: enzymes, genes and genomes. Nucleic Acids Res., 38, D234–D236.

Section 27.2–27.5 Ausubel, F.M., Brent, R., Kingston, R.E., Moore, D.D., Smith, J.A., Seidman, J.G., and Struhl, K. (1987–2005) Current Protocols in Molecular Biology, John Wiley & Sons, Inc., New York. Chrambach, A., Dunn, M.J., and Radola, B.J. (eds) (1987) Advances in Electrophoresis, Volume 1, VCHVerlagsgesellschaft, Weinheim. Darling, D.C. and Brickell, E.M. (1994) Nucleinsäure-Blotting. Labor im Fokus, Spektrum Akademischer Verlag, Heidelberg. Glasel, J.A. and Deutscher, M.E. (1995) Introduction to Biophysical Methods for Protein and Nucleic Acid Research, Academic Press, New York. Grossman, L. and Modave, K. (eds) (1980) Nucleic Acids: Part I, Methods in Enzymology, vol. 65, Academic Press, New York. Hafez, M. and Hausner, G. (2012) Homing endonucleases: DNA scissors on a mission. Genome, 55, 553–569. Krieg, E.A. (ed.) (1996) A Laboratory Guide to RNA: Isolation, Analysis and Synthesis, Wiley-Liss, New York. Martin, R. (1996) Elektrophorese von Nucleinsäuren, Spektrum Akademischer Verlag, Heidelberg.

718

Part IV: Nucleic Acid Analytics Miller, J.M. (2013) Whole-genome mapping: a new paradigm in strain-typing technology. J. Clin. Microbiol., 51 (4), 1066–1070. Nassonova, E.S. (2008) Pulsed field gel electrophoresis: theory, instruments and application. Cell Tissue Biol., 2 (6), 557–565. Nowacka, M., Jockowiak, P., Rybarcyk, A., Magaz, T., Strozycki, P.M., Barciszewski, J., and Figlerowicz, M. (2012) 2D-PAGE as an effective method of RNA degradome analysis. Mol. Biol. Rep., 39, 139–146. Rickwood, D. and Hares, B.D. (eds) (1990) Gel Electrophoresis of Nucleic Acids: A Practical Approach, IRL Press, Oxford. Salieb-Beugelaar, G.B., Dorfman, K.D., van den Berg, A., Eijkel, J.C.T. (2009) Electrophoretic separation of DNA in gels and nanostructures. Lab Chip, 9, 2508–2523. Sambrook, J. and Russell, D.W. (2001) Molecular Cloning: A laboratory Manual, 3rd edn, Cold Spring Harbor Press, Cold Spring Harbor.

Section 27.6 Apffel, A., Chakel, J.A., Fischer, S., Lichtenwalter, K., and Hancock, W.S. (1997) New procedure for the use of high-performance liquid chromatography-electrospray ionization mass spectrometry for the analysis of nucleotides and oligonucleotides. J. Chromatogr. A, 777, 3–21. Engels, J.W. (2013) Gene silencing by chemical modified siRNAs. New Biotechnol., 30 (3), 302–307. Gilar, M. (2001) Analysis and purification of synthetic oligonucleotides by reversed-phase high-performance liquid chromatography with photodiode array and mass spectrometry detection. Anal. Biochem., 298, 196–206. Gilar, M., Fountain, K.J., Budman, Y., Holyoke, J.L., Davoudi, H., and Gebler, J.C. (2003) Characterization of therapeutical oligonucleotides using liquid chromatography with on-line mass spectrometry detection. Oligonucleotides, 13, 229–243. Gilar, M., Fountain, K.J., Budman, Y., Neue, U.D., Yardley, K.R., Rainville, E.D., Russell, R.J. II, and Gebler, J.C. (2002) Ion-pair reversed-phase high-performance liquid chromatography analysis of oligonucleotides: retention prediction. J. Chromatogr. A, 958, 167–182. Greig, M. and Griffey, R.H. (1995) Utility of organic bases for improved electrospray mass spectrometry of oligonucleotides. Rapid Commun. Mass Spectrom., 9 (1), 97–102. Kusser, W. (2000) Chemically modified nucleic acid aptamers for in vitro selections: evolving evolution. Rev. Mol. Biotechnol., 74, 27–38. Martin, R. (1996) Elektrophorese von Nucleinsäuren, Spektrum Akademischer Verlag, Heidelberg. Matteucci, M.D. and Caruthers, M.H. (1981) Synthesis of deoxynucleotides on a polymer support. J. Am. Chem. Soc., 103, 3185–3191. Uhlmann, E. (2000) Recent advances in the medicinal chemistry of antisense oligonucleotides. Curr. Opin. Drug Discovery Dev., 3, 203–213. Uhlmann, E. and Vollmer, J. (2003) Recent advances in the development of immunostimulatory oligonucleotides. Curr. Opin. Drug Discovery Dev., 6, 204–217. Warren, W.J. and Vella, G. (1995) Principles and methods for the analysis and purification of synthetic deoxyribonucleotides by high-performance liquid chromatography. Mol. Biotechnol., 4, 179–199.

Techniques for the Hybridization and Detection of Nucleic Acids Christoph Kessler1 and Joachim W. Engels2 1

PD Christoph Kessler, Consult GmbH, Icking, Schloßbergweg 11, 82057 Icking-Dorfen, Germany Goethe University Frankfurt, Institute of Organic Chemistry and Chemical Biology, Department of Biochemistry, Chemistry and Pharmacy, Max-von-Laue Straße 7, 60438 Frankfurt am Main, Germany 2

The last two decades have seen the development of many new assay techniques for the detection and analysis of sequences of DNA or RNA. These highly specific and sensitive methods have become standard methods in molecular biology in a short span of time. Today they are used for:

         

diagnosis of infective diseases: viral or bacterial identification; tissue and organ tolerance diagnostics: histocompatibility genes; cancer diagnosis and risk analysis: gene mutation analysis; diagnosis of inheritable diseases, pre-implantation diagnostics: gene and chromosome analysis; paternity testing, forensic medicine, animal breeding: DNA profiling; plant breeding: analysis of gene transfer, patterns of resistance; crop and wine analysis: tests for pathogens, resistance, or marker genes linked to new or modified gene products; molecular archeology and anthropology: gene analysis of mummies, archeological finds; production of recombinant human, pharmaceutically active proteins: quality control and specificity analysis; safety surveillance of genetic laboratories: contamination tests.

Other important fields of application are the elucidation of certain genetic changes, such as point mutations, deletions, and insertions, or triplet repeats, which cause disease in the fields of oncology, genetic disease, and chronic infection. Knowledge of the precise sequence changes responsible for disease is a prerequisite for the development of genetic diagnosis tests and gene therapy approaches. The Human Genome Project has provided a standard of reference for these efforts. Begun in the early 1990s and projected to last 15 years, the Human Genome Project elucidated completely the molecular structure and sequence of the human genome. The first sets of data of the complete human genome were published at the beginning of 2001 and updated by the Human Genome Consortium in October 2004. In 2008 the sequence of eight human genomes were published and at present efforts are underway to sequence 1000 genomes in the USA and 10 000 genomes in the UK (UK10K). Disease can result from defined gene defects, chromosome aberrations, such as translocations, chromosome number abnormalities, or sub-chromosomal aberrations, such as amplifications or deletions. A well-known example is Down’s syndrome, which is caused by the presence of three copies of chromosome 21. Due to the great variety of potential gene changes, the methods used for the analysis of nucleic acids span a broad range. The analysis must be able to detect monogenetic mutations, those in single positions of the genome, as well as polygenetic mutation patterns, which involve a number of mutations. In many cases the mutations are polymorphic, meaning that more than one type of mutation is known, and new, spontaneous mutations may also arise. The type of Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

28

720

Part IV: Nucleic Acid Analytics

Fluorescent in Situ Hybridization in Molecular Cytogenetics, Chapter 35 DNA Sequencing, Chapter 30

mutation determines the assay type: while defined mutations or simple mutation patterns are primarily detected with hybridization techniques, the analysis of variable, complex mutations, such as polymorphisms and spontaneous mutations, is increasingly performed with the new sequencing methods. The sample type also influences the choice of assay type. Different techniques are required depending on whether the nucleic acids are isolated, amplified, or chromosome preparations, or whether they are fixed to solid membranes, in solution, or in cells, tissue, or organisms. An extreme case is Drosophila embryos fixed to glass. The analysis of sequence fragments or amplified DNA or RNA sequences involves in vitro nucleic acid analysis. In situ analyses are used in molecular cytogenetics to detect chromosome aberrations or to analyze endogenous or exogenous sequences in cells, tissues, or organisms. Different hybridization and sequencing methods are used in these techniques, depending on the nature of the changes in the nucleic acids. In this chapter we will discuss the most common methods of in vitro hybridization and detection. The nucleic acid sequencing will be explained in Chapter 30.

28.1 Basic Principles of Hybridization

Isolation of Plasmid DNA from Bacteria, Section 26.3.1

Figure 28.1 Complementarity of base pairing.

The complementary bases of DNA, A and T or C and G, are bound to one another via hydrogen bonds (Figure 28.1). Hydrogen bonds are non-covalent bonds of middle to low strength, depending on the hydrogen donor acceptor atom and the distance between them. This complementarity of base pairing is the basis of hybridization. In 1961, Julius Marmur and Paul Doty discovered that the DNA double helix could be separated into two strands when denatured by heating it above its melting temperature. Allow a mix of single strands long enough to cool and they will hybridize back into double strands. Based on this observation, Sol Spiegelman developed the technique of nucleic acid hybridization. Spiegelman investigated whether newly synthesized bacterial mRNA is complementary to T2 DNA after an infection by T2 phages. In his experiment, he mixed 32 P-marked T2 mRNA and 3 H-marked T2 DNA. After denaturation of the double-stranded DNA into single strands and subsequent reassociation, he analyzed the nucleic acid mix by density gradient centrifugation. Since RNA has a higher density than DNA, it is possible to separate them in a caesium

28

Techniques for the Hybridization and Detection of Nucleic Acids

721

Figure 28.2 C0t (cot) curves represent the renaturation of different thermally denatured DNA species. The left-hand curve, mouse satellite DNA, contains many repetitive sequences and therefore renatures quickly. Source: adapted according to Britten, R.J. and Kohne D.E. (1968) Science, 161, 1530.

chloride (CsCl) density gradient. A measurement of the distribution of the radioactivity showed that a third band of RNA–DNA hybrids had appeared, in addition to the single-stranded RNA and the DNA double helices. In further experiments with T2 mRNA and DNA from unrelated organisms, no such hybrids were observed. This experiment showed that the correct sequence of complementary bases in the antiparallel nucleic acids is necessary for the hybridization reaction to take place. Information about the complexity of a particular DNA sequence can be obtained from hybridization experiments. Sequences that appear frequently in the genome renature faster than those that only appear once. Eukaryotic DNA can be divided into four classes depending on its frequency within the genome: DNA molecules immediately renature to double helices when made up of inverse repeats (palindromes), which pair by folding back on themselves, and hairpin (or stem) loops. Highly repetitive sequences reform the helices somewhat slower. Then come the less repetitive sequences, and finally the unique sequences, which, under normal circumstances, are last to rehybridize. The complexity of the DNA is expressed in the cot value: if c0 is the concentration of singlestranded DNA at time point t = 0 and c(t) the concentration of single-stranded DNA at time point t: c …t † 1 ˆ c0 1 ‡ kc 0 t

(28.1)

where k is the association constant, a kinetic constant. The function c(t)/c0 is the proportion of double-stranded DNA at a particular time point, t (Figure 28.2). At a specific time, tϕ, 50% of the DNA strands are hybridized: c(t)/c0 = 0.5; the value of c0 × tϕ is the cot value. Besides conclusions about the complexity of DNA, hybridization experiments can also be used to identify particular DNA sequences in mixtures. A known sequence, a probe, is labeled, either radioactively or non-radioactively, and is hybridized with the DNA to be analyzed. Identification results from the detection of the labeled hybrids.

28.1.1 Principle and Practice of Hybridization One thing all hybridization techniques have in common is that the detection of the nucleic acid target molecules results from the sequence-specific binding of complementary labeled nucleic acid probes (Figure 28.3). In general, the nucleic acid mixture to be analyzed is blotted onto a membrane or other solid support, such as onto the surface of a microtiter plate, or left in solution. Under carefully controlled, stringent conditions (Section 28.1.2), the nucleic acids are mixed with a solution containing the labeled probe and left to incubate at a fixed temperature. Labeled probes can be made from oligonucleotides, DNA fragments, PCR (polymerase chain reaction) products (amplicons), in vitro RNA transcripts, or artificial probes like PNA (Section 28.2.3). Single-stranded RNA and DNA hybridize with one another, such that all three possible double helices are formed: DNA:DNA, DNA:RNA, and RNA:RNA. The probe hybridizes with the complementary target sequence. After completing the incubation, stringent washes are carried out

722

Part IV: Nucleic Acid Analytics

Figure 28.3 Nucleic acid detection by hybridization. Southern blots (named after Edwin Southern) are often used for this process. The polynucleotides, such as DNA fragments or PCR products, are incubated in a plastic bag for several hours with labeled probes at 60–70 °C. This temperature allows the DNA probes to hybridize with their complementary sequences on the blot. After washing, the position of the DNA segments complementary to the labeled probe is revealed by detection of the signal from the label.

to wash away unspecifically adsorbed probes. Techniques without wash steps, called homogeneous assays, have also been described. The sought target sequence is identified by measurement of the specific binding of the labeled probe; this specific binding is visualized by autoradiography or non-radioactive methods, which will be described in Sections 28.2 and 28.3.

28.1.2 Specificity of the Hybridization and Stringency The specificity of hybridization is dependent on the stability of the hybrid complex formed, as well as the stringency of the reaction conditions. The stability of the hybrid correlates directly with its melting point (Tm). The Tm value is defined as the temperature at which half of the hybrids have dissociated. It is dependent on the length and base composition of the hybridizing section of sequence, the salt concentration of the medium, and the presence or absence of formamide or other helix-destabilizing agents, as well as the type of hybridizing nucleic acid strands (DNA:DNA, DNA:RNA, and RNA:RNA). For DNA:DNA hybrids the following formula applies to a first approximation:   T m ˆ 81:5 °C ‡ 16:6log c…Na‡ † ‡ 0:41…% G ‡ C†

500 n

(28.2)

In this equation, c (Na+) is the concentration of Na+ ions and n the length of the hybridizing section of sequence in base pairs. Because G/C hybrids contain three hydrogen bonds instead of two, like A/T hybrids, they increase the melting temperature more, which needs to be taken into account. The expression 500/n does not apply for a sequence longer than 100 bp. The melting point of DNA:RNA hybrids is 10–15 °C higher, that of RNA:RNA hybrids around 20–25 °C higher, than for DNA:DNA hybrids given by the formula above. Base pair mismatches lower the melting point. The kinetics of hybridization are decisively influenced by the diffusion rate and length of the duplexes; the diffusion rate is highest for small probe molecules. As a result, hybridization with oligonucleotide probes is usually complete in 30 min to 2 h, while hybridization with longer probes is typically carried out overnight. A disadvantage of oligonucleotide probes, however, is that their sensitivity is not as high as with longer probes, since both the length of the hybridizing sequence and the number of labels that can be incorporated are limited. Nonetheless, the sensitivity can be increased by the use of oligonucleotide cassettes (Section 28.2.1) and through terminal tailing with multiple labels. Repetitive sequences hybridize faster since the number of potential matches for any given sequence is greater. Reaction accelerators, such as the inert polymers dextran sulfate or

28

Techniques for the Hybridization and Detection of Nucleic Acids

poly(ethylene glycol), coordinate water molecules and thereby increase the effective concentration of the nucleic acids in the remaining solution. The proportion of correctly paired nucleotides in a hybridized duplex molecule is determined by the degree of stringency with which the hybridization was conducted. Stringent conditions are those in which only perfectly base-paired nucleic acid double strands are formed and remain stable. When, under given conditions, oligo- or polynucleotide probes will only pair with the desired target nucleic acid in a mix of nucleic acids (i.e., no cross-hybridization with other nucleic acids takes place) the hybridization is defined as selective. An example of the use of this with oligonucleotide probes in specific hybridization is the detection of differences between almost identical sequences, with only a single base pair difference, such as ras wild-type/mutant at position 12 or the difference between Neisseria gonorrhoeae and Neisseria meningitidis, which only differ by a single base. Factors that influence the stringency are mainly ion concentration, the concentration of helix-destabilizing molecules, such as formamide, and the temperature. While monovalent cations, usually Na+, and mutually repulsive, negativelycharged phosphates coat the helix backbone and thereby increase the stability of the double helices, formamide inhibits the formation of hydrogen bonds and thus weakens helix stability. Temperature is of considerable importance. For example, the melting point, Tm, of a DNA segment composed of 50% (G + C), at an ion concentration of 0.4 M NaCl, is 87 °C. Hybridization takes place between 67 and 72 °C in this case. Adding 50% formamide lowers the melting point of the DNA helix to 54 °C, so hybridization can take place at 37–42 °C. This decrease in temperature is used in in situ hybridization, for example, because cellular structural integrity is lost at typical hybridization temperatures. Temperature is what defines the stringency of hybridization at defined formamide and Na+ concentrations. The Tm of a duplex molecule decreases up to 5 °C for every 1% base pair mismatch; higher temperatures serve to allow only perfectly complementary sequences to pair (high stringency). By reducing the temperature, hybrids with unpaired bases are also tolerated (low stringency). Use of high stringency conditions restricts the detection of hybridizing sequences to those that find a perfectly complementary match. After hybridization, the wash steps are conducted at only 5–15 °C below the Tm (destabilization of hybrid complexes) and in a buffer containing a low ion concentration (0.1 × sodium saline concentration (SSC), which corresponds to 15 mM Na+). A precise differentiation of mismatched base pairing is easiest using a PNA hybridization probe (Section 28.2.3). These artificial nucleic acid analogs, with an uncharged, peptide-like backbone, show more pronounced differences in stability than RNA or DNA probes between wild-type and mutant hybridizations, which results in better discrimination of base pair mismatches (also see LNA, Section 28.2.4).

28.1.3 Hybridization Methods Heterogeneous detection systems employ a detection reaction subsequent to washing off the remaining probe. Homogeneous detection systems carry out detection without separating the remaining probe, which usually involves a change in probe state to turn a signal on when hybridized. Heterogeneous Systems for Qualitative Analysis In addition to the Southern blots already mentioned, there are also dot, reverse dot, and slot blots methods for the qualitative analysis of DNA. The same methods can be used for RNA except that the Southern blot is called a Northern blot instead. Bacteria are detected with the aid of colony hybridization assays and viruses with plaque hybridization assays. Targets for in situ hybridization are chromosomes, cells, swabs, tissue, or even entire small organisms, such as Drosophila, on slides. Heterogeneous Systems for Quantitative Analysis Heterogeneous systems for the quantitative analysis of nucleic acids include sandwich assays using capture and detection probes, replacement assays of a short detection probe from the complex, or special amplification methods, in which the labeling of the detection complex takes place by incorporation of a dNTP or a primer during amplification (e.g., with DIG; Section 28.4.3). The labeled amplicon is subsequently hybridized to a biotinylated capture probe and the complex immobilized on a streptavidin-coated membrane. Alternatively, reverse dot blots capture the target using a probe

723

In Situ Hybridization, Section 35.1.4

The higher the stringency, the more specific the hydrogen bonding between the complementary strands along the entire length of the hybridizing sequence. With oligonucleotide probes, it is possible to differentiate single mutations under stringent conditions, which is essential, for example, for the specific detection of point mutations, such as the single base pair difference in sickle cell anemia, or the detection of RNA sequences from certain pathogenic species of bacteria, such as Neisseria gonorrhoeae.

724

Part IV: Nucleic Acid Analytics

covalently bound to the membrane. After washing away the excess label, the amount of bound, DIG-labeled amplicon reflects the original concentration of the target. In amplification assays, the principle is turned around and the primer is labeled with biotin, while detection is via hybridization with a labeled detection probe (Figure 28.4).

Figure 28.4 Principle of heterogeneous amplification systems. The capture marker attached to the solid support (F) is incorporated during amplification. The amplicon strand attached to the solid support is detected by hybridization with an oligonucleotide probe (D: detection marker).

28

Techniques for the Hybridization and Detection of Nucleic Acids

Homogeneous Systems for Quantitative Analysis Homogeneous systems are much more difficult to design and develop but their convenience and large dynamic range gives them significant advantages over heterogeneous systems. In combination with efficient amplification techniques like PCR, homogeneous amplification techniques have enabled quantitative and reproducible detection of femto- to attogram amounts (10 15 to 10 18 g), which suffices to detect as little as a few copies of the target sequence. At this extreme level of sensitivity, statistical limitations in collecting the samples begin to limit the overall sensitivity of detection of the target sequences. Homogeneous systems allow measurement of the amplification products during the course of the amplification reaction, without requiring the separation of reaction educts before addition of the detection reagent. The risk of contamination can be greatly reduced by carrying out the reaction in a closed system of sealed vessels, which also allows the direct detection of fluorescence signals at any time during the amplification reaction. The use of glass capillary tubes, for example, allows detection of fluorescent signals directly through the wall of the tube. The formation of the amplicons can be followed in real time. Another advantage of the homogeneous systems is their larger dynamic range, up to eight–to-ten orders of magnitude, in comparison to heterogeneous systems. In homogenous systems, the probes bind to target sequences between the primers during the amplification, generating a detectable signal. This results either from digestion of the primer by the elongation enzyme, in the case of TaqMan probes, or through the binding itself, for LightCycler rapid PCR and 3´ - and 5´ end-labeled probes (HybProbes), which hybridize next to one another. Molecular beacon-labeled probes open their branched structure on hybridization to the target sequence. This results from labeling the probes with a fluorescence-quencher set or two FRET pairs. Rhodamine or fluorescein are often used as the fluorescent marker in these systems and rhodamine derivatives, such as dabcyl or cyanine dyes, are used as the quencher. A particularly good quencher is the black hole quencher, which almost completely absorbs the emitted light. The type of label pair is tuned to the specific system. TaqMan or 5´ Nuclease Methods A modern homogeneous detection system is the TaqMan or 5´ nuclease amplification detection principle, also known as the 5´ nuclease assay. In this wellknown homogeneous detection system the probe is equipped with a marker pair, consisting of a fluorescent marker and a quencher. The distance between the pair is chosen such that the incident primary light stimulates release of fluorescent light from the marker, which is absorbed by the quencher as long as the probe is free, preventing release of a detectable signal. After hybridization of the detection probe to the amplified target molecules, the 5´ -3´ exonuclease activity of the Taq DNA polymerase frees the fluorescent nucleotide from the probe, which diffuses away from the quencher and can now emit unquenched light (Figure 28.5). Prior to degradation by the 5´ nuclease catalysis, the probe’s fluorescent signal is inactivated by the proximity of the quencher. As a result the free probe is inactive and need not be separated to specifically detect newly formed hybrid complexes. Since the quenched fluorescent nucleotide comes exclusively from hybridized complexes, the amount of fluorescence measured as a result of the decoupling of the fluorescent probe from the quencher is a direct measurement of the amount of the hybrid complex. This allows measurement of the intended target molecules without the need to separate the excess probes. Measurement of the signal increase allows the quantification of the amplified targets formed (Figure 28.6a). Figure 28.7 shows the results of a TaqMan measurement: The amount of amplification product is not measured after completion of the amplification reaction, as in end point measurements, instead the formation of amplification products is measured continuously during the course of the PCR cycles, which is why it is called a real time measurement. The unit of measurement, the cT value, gives the PCR cycle during which a signal is first seen above the threshold of detection. Plotting the cT value semi-logarithmically against the initial copy number, prior to amplification, results in a linear relationship between the cT value and the original copy number of the target sequence in the sample. By correlating the results with corresponding curves from external controls, or co-amplification of the target sequence with internal standards of known copy number, the copy number of the target sequence in the sample can be quantitatively determined. Internal standards are constructed such that they can use the same primers but contain a different probe-binding sequence. If the probe for the desired target sequence is labeled with one dye, and the probe for the control with another, filters can be used to separate the two signals measured simultaneously.

Amplification of DNA, Section 29.2.2

Instruments, Section 29.2.1

725

726

Part IV: Nucleic Acid Analytics

Figure 28.5 Principle of the 5´ nuclease reaction format (TaqMan). As long as the fluorescent detection marker (D) and quencher (Q) are linked in close proximity, no signal is emitted. The 5´ nuclease activity eliminates the linkage, the detector diffuses away and is no longer quenched, resulting in a signal proportional to the number of targets created by the Taq DNA polymerase.

28

Techniques for the Hybridization and Detection of Nucleic Acids

727

Figure 28.6 Homogeneous detection systems: (a) TaqMan/5´ nuclease system; (b) FRET system with HybProbes; (c) molecular beacon system; (d) intercalation system.

Figure 28.7 Coupled amplification and detection in the TaqMan system. The higher the cT value, the lower the copy number of the sample shown by the curve.

728

Part IV: Nucleic Acid Analytics

Figure 28.8 FRET principle. The absorption and emission spectra of the two FRET components are different. The emission spectrum of the energy donor, however, overlaps with the absorption spectrum of the energy acceptor, so that an energy exchange can take place. The detected secondary light emitted by the FRET pair is of a longer wavelength than the primary light input to induce the signal.

Förster Resonance Energy Transfer (FRET), Section 7.3.7

FRET System In the FRET system, probes are used that carry two fluorescent detection markers, with different but overlapping absorption and emission spectra. This allows energy to be transferred from one to the other, and also allows distinguishing between the emitted light signals. The two components form a fluorescence energy resonance transfer (FRET) pair. One component takes up the primary light and transfers it, in the form of light, to the second component, if it is close enough (0.5–10 nm). The second component emits the absorbed energy in the form of longer waved secondary light. The increase in wavelength allows the output signal to be measured without interference from the primary light (Figure 28.8). HybProbes are probe pairs in which each of the two probe components are marked individually with one of the two FRET components. The upstream binding probe is marked on the 3´ terminal, the second probe, which binds immediately downstream, is labeled on the 5´ end. In solution, the two component probes do not result in a FRET signal since they are not in physical proximity. But if the two probes hybridize to the newly synthesized target sequence during the annealing phase of the PCR reaction they are now in direct proximity to one another. Since the FRET components are labeled on the directly proximal ends, the FRET effect between the components results in a signal. The more probe pairs that bind to the new amplicons, the stronger the signal becomes. Measuring the signal increase allows quantification of the amplified sequences (Figure 28.6b). In the LightCycler, the FRET probes and samples are contained in thin glass capillary tubes, which allows rapid PCR temperature cycles. Probes used in the LightCycler are HybProbes, which have a fluorescent reporter and quencher pair. The tiny volume and high surface area of the capillaries allows for a very rapid temperature exchange. Each cycle is short, which reduces the total amplification time significantly. Molecular Beacon System The use of molecular beacons is another way to measure the formation of products during amplification. In this case, in contrast to the HybProbes, only a single probe is used. The probe has a reporter–quencher pair bound to each end. An example of such a molecular beacon is the combination of fluorescein as the reporter and dabcyl as the quencher. The probe sequence is so chosen that the probe displays a pronounced stem-loop structure, which brings the photosensitive components on the two ends into immediate proximity of one another, leading to quenching. As a result, no fluorescent signal is observed. The other end of the stem-loop structure is covalently held together by the loop. The loop contains the sequence that can hybridize with the target sequence. When it does so, the probe is unfolded and the two ends are separated, which prevents the quencher from blocking the fluorescent signal of the detection marker. Thus, molecular beacons are also a type of fluorescence dequenching assay. In this homogeneous system, the signal becomes stronger as more probe pairs bind to the formed amplicons, which directly reflects the amount of the amplified products. By measuring the increase in signal, this method allows quantification of the formation of amplified sequences (Figure 28.6c). Intercalation Assay In dye intercalation assays the signal is generated by the intercalation of dyes like SYBR Green into the newly synthesized double strands of the amplification products. In

28

Techniques for the Hybridization and Detection of Nucleic Acids

729

contrast to other assay types, no detection probe is hybridized to the amplicon, instead a sequenceindependent, quantitative measurement is allowed by the intercalation of fluorescent dye into the amplification products (see also Section 28.3.1). Comparison of shifted LightCycler profiles of different products generated in a single run is possible if the amplicons differ in length or base composition and thus have different melting points. Other Homogeneous Systems Other homogeneous systems are used primarily for the quantitative analysis of nucleic acids of bacteria or viral infections. Examples are the activation of inactive β-galactosidase by a complementary α peptide in enzyme complementation assays or the measurement of changes in mass of the complex through changes in fluorescence depolarization. These homogeneous assays will not be discussed in more detail.

In Situ Systems In situ assays are used to analyze nucleic acids in fixed cells, tissues, chromosomes (metaphase chromosomes or prophase nuclei), or complete organisms such as whole mount embryos. Such molecular cellular analysis supplements the nucleic acid analysis to detect sequence or genetic aberrations in isolated nucleic acids discussed up to this point. In situ detection begins with the mounting of the biological material on slides or cover slips. Cell walls must be lysed enzymatically to allow hybridization with a labeled probe in the cells, whole mounts, or to the fixed chromosomes. The resulting optical (e.g., DIG-AP plus BCIP/ NBT) or fluorescent signal (e.g., directly coupled fluorescein, rhodamine) creates the signal. An example of sequence detection in whole mounts is the detection of mRNA expression from early developmental genes in Drosophila embryos. In cells from higher organisms, DNA in the nucleus is in the form of chromatin or, during the metaphase of cell division, chromosomes. Cell division of many lymphoid cells can be stimulated with the application of phytohemagglutinin, a plant hormone, and cultivation for 2–3 days at 37 °C. The resulting metaphase chromosomes can be isolated by treating the cells with the spindle inhibitor colchicine; this freezes the cells in the middle of cell division. After pipetting the cells onto a glass slide, the cells are lysed with a hypotonic solution to create a chromosome spread. After fixing, the chromosomes can be observed under a microscope. Fluorescence in situ hybridization (FISH) is the best-known system for karyotype analysis of chromosomes with the aid of fluorescence-tagged probes. Specific regions of the chromosomes can be detected with fluorescent signals. The following sections go into the details of nucleic acid detection with a primary focus on hybridization assays and non-radioactive labeling and detection. Radioactive labeling is also discussed. Staining is presented in Section 28.4.1

28.2 Probes for Nucleic Acid Analysis Labeled oligonucleotides or nucleic acid fragments play a central role in strategies for sequence-specific nucleic acid detection by both sequencing and hybridization by serving as primers for nucleic acid synthesis. Either radioactive or non-radioactive probes are employed. The repertoire of methods employed to generate and use non-radioactively labeled probes has expanded greatly in the last few years, making them the method of choice, particularly for standard methods of nucleic acid analysis. While radioactive methods were originally the only option, problems relating to contamination of sophisticated and expensive instrumentation and disposal problems have greatly favored the development of non-isotopic assays. Although the routinely used isotopes 3 H, 14 C, 32 P, 33 P, 35 S, and 125 I have the advantage that the chemical structure, and therefore the hybridization characteristics, of the probes remains unchanged during nucleic acid analysis, the use of isotopes has the following serious disadvantages:

 limited half-life and therefore detection opportunity: for example, the frequently used 32 P isotope has a half-life of only 14.3 days;

 necessity of internal standards for quantitative analyses;  the molecular damage to the probe itself caused by its radioactive emissions;

In Situ Hybridization, Section 35.1.4

Application of FISH and CGH, Section 35.2

730

Part IV: Nucleic Acid Analytics

    

need to repeat probe labeling for longer experiments; necessity for a special safety laboratory with expensive safety precautions; necessity of disposing the radioactive material; increased planning and logistics; potential health risks.

These disadvantages make the use of isotopes, particularly with the increasing availability of non-radioactive methods with at least comparable sensitivity and range of use, more problematic. Since, however, many laboratories in the research field still have the equipment for radioactive work, isotopes remain in use for blot hybridization and manual sequencing. It is, however, to be expected that with increasing standardization and automation of analytical methods the radioactive methods will be used less and less often. Probe Types DNA and RNA probes, short single-stranded DNA oligonucleotides or longer, double-stranded DNA, or single-stranded RNA probes, are used in the analysis of nucleic acids. Cloned probes contain vector fragments, unless the vector sequences are removed in additional steps. Vector fragments can lead to undesired cross-hybridization; for example, unspecific crossreactions of pBR vector sequences with genomic human DNA have been described. These undesirable effects can be avoided by using vector-free probes, which can be synthesized by PCR amplification, in vitro synthesis, or chemical synthesis. A recently invented alternative to DNA oligonucleotides is peptide nucleic acid (PNA) oligomers, which contain the same base-specificity and hybrid geometry, attached to an uncharged peptide-like synthetic backbone. Due to the lack of the mutually repelling phosphate groups, hybrids of PNA probes and nucleic acid sequences have a higher melting point, allowing the use of higher hybridization temperatures and increasing the specificity of the hybridization. A further advantage of PNA is an increased ability to discriminate mismatches.

28.2.1 DNA Probes

Polymerase Chain Reaction, Chapter 29

The main types of DNA probes used in nucleic acid analysis are genomic probes, cloned DNA probes or corresponding restriction fragments, (i.e. DNA probes), PCR-generated amplicon probes, and synthetic oligonucleotides probes. For many years, cloned cDNA or genomic fragments were the most commonly used DNA probes used to detect complementary DNA or RNA sequences in Southern or Northern blots. The probes were usually between 300 bp and 3 kb in length. The sensitivity depends on the length of the hybridizing region and the density of the labeling: as a result, genomic probes are more sensitive than cDNA probes, since the cDNA probes can only hybridize with the exon sequences of the nucleic acids to be detected, while genomic probes often encompass the extensive intron sequences. Both probe types, however, have the disadvantage that their cloning and subsequent plasmid isolation is time consuming. The production of vector-free probes requires an additional restriction cleavage and fragment separation. Even then, repetitive sequences within the section of the probe can lead to cross-reactions with eukaryotic DNA or amplified eukaryotic genes or cellular total mRNA. This can lead to unspecific bands. The possibility of creating hybridization probes by PCR amplification with Taq DNA polymerase led to an enormous and rapid increase in the availability of probes. This procedure has several advantages:

 Cloning and plasmid isolation are no longer necessary, so the probes do not contain vector sequences.

 Both DNA and RNA can serve as templates to create probes (RNA must first converted into cDNA with reverse transcriptase in RT-PCR).

 The probes are of a defined length, allowing easy adjustment of the stringency of the hybridization.

 Probe design is extremely flexible, since the probe length and position can be easily controlled by the selection of the primers.

 Only the sequences where the primers bind need to be known for the amplification, thus it is possible to generate probes for new, unknown sequences between the primer binding sites.

28

Techniques for the Hybridization and Detection of Nucleic Acids

731

 Similarly, probes for mutants with sequence variations in the probe area are readily available.  The probes can be labeled during generation by using marked labeled nucleotides or primers, leading to a uniform density of labeling.

 By using new DNA polymerases or blends of polymerases, probes that are kilobases in length are possible, instead of the 100–1000 bp length from standard Taq DNA polymerases. These advantages have made PCR the method of choice for the production of DNA probes. The probes can often be used directly in hybridizations. To avoid co-hybridization with unspecific amplification products, the amplification products are usually purified. Long PCR products can have secondary structure effects, resulting in lower sensitivity or unspecific hybridization signals; these can be avoided by additional restriction digestion. Besides long PCR probes, synthetic oligonucleotides are increasingly used for hybridization probes. Modern oligonucleotide synthesis machines can create oligonucleotides up to a length of around 150 nucleotides with a defined sequence or with targeted sequence changes in any position. Oligonucleotide probes are well suited to detect point mutations. Oligonucleotides between 17 and 40 bp in length are used for this purpose, which allows the optimization of the hybridization and wash steps. Base pair mismatches are easiest to recognize when they are located in the middle of the hybridizing region; mutations in flanking sequences are not as well discriminated. A further advantage of short oligonucleotide probes is that they hybridize faster than longer probes. A disadvantage is their lower sensitivity (Section 28.1.2). Nevertheless the strength of oligonucleotide probes does not lie in the detection of single copy genes nor low copy mRNA, but instead in the mutation analysis of PCR-amplified genes, strongly expressed mRNAs, or rRNA species amplified 103–104-fold. Oligonucleotide arrays are currently in development as hybridization tools for oligonucleotides and cDNA. These arrays contain a large number of oligonucleotides or sDNA capture probes with differing specificity. With these chips, a large number of mutations in different positions of the target amplicon can be analyzed in mutation/polymorphism analysis or from differing cells, tissues, or organs in expression pattern analysis.

28.2.2 RNA Probes Single-stranded RNA probes are produced by run off in vitro transcription of sequences cloned into vectors containing bacteriophage SP6, T3, or T7 promoters (Figure 28.9). DNA fragments or PCR products are cloned into the multiple cloning site immediately downstream of the promoter. The recombinant vector is then cleaved directly at the 3´ end of the insert, which creates a fixed termination point for the run-off transcription. The strong promoter selectivity and the fixed

Figure 28.9 Synthesis of labeled RNA probes.

DNA Microarray Technologies, Chapter 37

732

Part IV: Nucleic Acid Analytics

Template DNA and Detection of in Vitro Transcripts, Section 34.3.3

termination point lead to transcripts of identical length. The transcription cycle terminates and reinitiates 100–1000 times resulting in a high yield of the probe. Radioactive ribonucleotides or hapten-modified dUTP can be added during transcription to label the RNA probes in the process of transcription, in the same way as for PCR probes. New vectors contain two different promoters in opposite orientations on either side of the cloning region, which allow the transcription of complementary RNA strands with opposite polarities (sense/antisense RNA). To avoid unspecific hybridization signals from the vector segments, the in vitro transcripts are treated with RNase-free DNase. The main advantage of RNA probes is that DNA:RNA or RNA:RNA hybrid complexes havea higher meltingpointthan DNA:DNA hybrid complexes. This provides increased sensitivity, so that low abundance mRNA in Northern blots or in situ can also be detected. However, RNA probes are vulnerable to ubiquitous RNases, so that all solutions and equipment must be sterilized with chemical additives like diethyl pyrocarbonate and heat treatment prior to use. When used for in situ experiments, the run off transcripts are treated with a limited amount of RNase because full-length transcripts often penetrate the cell wall or membrane poorly; the shortened molecules penetrate better, which leads to more probe in the nucleus available for hybridization and, thus, increased sensitivity.

28.2.3 PNA Probes Concept of Peptide Synthesis, Section 22.1

Figure 28.10 Structural comparison of PNA and DNA.

Synthetic peptide nucleic acids (PNAs) contain a peptide-like backbone which has many advantages relative to DNA and RNA probes (Figure 28.10). PNA probes can be produced in peptide synthesis machines as well as DNA synthesis machines. Boc synthesis chemistry is used for peptide analog synthesis; Fmoc synthesis chemistry is used for DNA analog synthesis. In both Boc and Fmoc chemistry protective groups are used – only the structural elements of the backbone differ. The solubility of PNA oligomers can be increased by the introduction of charged terminal or internal side chains (e.g., Glu, Lys) or charged groups in the labeling group spacers, so that the synthesis of up to 30-mers is possible. PNA oligomers have a range of advantages relative to DNA oligonucleotides:

 Greater hybrid stability; therefore higher temperature, and correspondingly more stringent, hybridization conditions can be used.

 Shorter oligonucleotides have higher diffusion rates and faster reaction kinetics.  Ion concentration-independent hybridization allows low salt concentrations: double-stranded PCR products hybridize without denaturing.

 The hybridization at low salt concentrations opens potential secondary structures within the target molecules.

 The Tm difference between matched and mismatched base pairs is more pronounced with PNA probes than with DNA or RNA probes. This allows better mismatch discrimination.

 The discrimination of mismatched bases is optimal throughout the entire length of the probe, except for the terminal three or four bases on the flanks.

 PNA probes are more resistant to nucleases and proteases due to the artificial structure of their backbones and base linkage, which increases the stability of the probes.

 The solubility of PNA can be increased by the use of charged amino acids (e.g., Lys, Glu) in the backbone or by charges in the marker linkers. These advantages make PNA oligomers an attractive alternative to oligonucleotide probes for point mutation analysis. These advantages are particularly relevant to array systems on chips, since the selective detection of mismatches and avoiding secondary structure effects are of central importance in these systems. Also critical is the solubility of the PNA probes in these systems, which is achieved with a long linker molecule. PNA capture probes allow the selective isolation of target nucleic acids through the formation of very stable triplexes on certain target sequences or duplex formation in mixed target sequences.

28.2.4 LNA Probes Locked nucleic acids (LNAs) are a new class of bicyclic DNA analogs in which the 2´ and 4´ position is coupled into a furanose ring by an O-methyl group (LNA: stabilization by bicyclic

28

Techniques for the Hybridization and Detection of Nucleic Acids

733

bridged ribose; Figure 28.11). The binding of these analogs to complementary nucleic acids is the strongest of the known DNA analogs. LNA probes are more hydrophilic, and therefore nuclease-resistant, than PNA, which makes them a desirable alternative for some purposes. In addition:

 LNA probes are more soluble than PNA probes, which makes their hybridization characteristics similar to DNA probes.

 The highest Tm values of any oligonucleotide analog pair are seen in LNA:DNA mixed



oligonucleotides. This allows the use of shorter oligonucleotide probes for the determination of point mutations, which increases the discrimination between wild-type and mutation targets, similarly to PNA. The O-methyl bridge can also be replaced by a thio or amino bridge.

28.3 Methods of Labeling Non-radioactive modifications can be incorporated into probes in enzymatic, photochemical, or chemical reactions. Isotopes are usually incorporated into probes by enzymatic reactions. The labeling positions and type of label differs between methods, depending on whether isotopes or non-radioactive reporter groups are used. DNA, RNA, and oligonucleotides can be labeled by means of the enzymatic incorporation of labeled nucleotides, resulting in high sensitivity probes that are densely labeled. The photolabeling of DNA and RNA results in less strongly labeled probes, but avoids damage caused by radioactive labeling, which can affect the length of the probe; thus, it is the most suitable method for the synthesis of labeled size standards. Chemical labeling was initially used to label DNA fragments and its use is increasing for the labeling of DNA or PNA oligomers. Figure 28.12 shows an overview of the most common non-radioactive labeling reactions, which are discussed in this section. Enzymatic labeling uses either 5´ labeled primers (PCR labeling) or labeled nucleotides instead of, or in addition to, unlabeled nucleotides. Non-radioactive labeling uses nucleotide analogs modified with haptens like digoxigenin fluorescein or conjugates like biotin (Section 28.4.3). Hapten-dUTP or hapten-dCTP can be used as the enzyme substrate for RNA, DNA, and oligonucleotide labeling; the latter two can also make use of hapten-cATP. The labels can interfere with one another if they are too close to one another; consequently, maximum sensitivity requires a certain distance between them. The optimum distance needed to achieve highest labeling density and the necessary minimum separation is hapten-specific. The optimal labeling density is achieved with hapten-specific mixes of hapten-dNTP and non-modified dNTPs (e.g., 33% DIG-dUTP/67% dTTP). High sensitivity requires that the modifications are not buried in the helix structure. Therefore, a spacer between the nucleic acid strand of the probe and the modifying group is critical. In the case of the haptens previously mentioned, the spacers are at least 11 atoms long and are often composed of oxycarbonyl elements, which are coupled via ester or amide bonds. The N and O atoms make the linker sufficiently hydrophilic.

28.3.1 Labeling Positions Radioactive labeling involves exchanging stable natural isotopes for unstable radioactive isotopes, depending on the nature of the isotope, on different positions of nucleoside triphosphates. Exchanging isotopes does not change the chemical structure, so that the labeled molecules have the same chemical properties as the natural substances. This means that the reaction conditions do not have to be changed for enzymatically incorporating labeled nucleotides into probes, nor do the hybridization conditions require adaptation. The most commonly used labels are 32 P or 33 P phosphates, exchanged for either the α- or γ-phosphate residue in 2´ -deoxyribo-, 3´ -deoxyribo- (cordycepin) or 2´ -ribonucleotides. Labeling with 35 S replaces an oxygen atom of the α-phosphate with the radioactive sulfur (Figure 28.13a). The 32 P- or 35 S-labeled α position remains attached to the nucleoside when probes are homogeneously labeled by polymerases (e.g., random-primed labeling, nick translation, reverse transcription, or PCR amplification), whereas end labeling, such as with T4 polynucleotide kinase, involves transfer of the labeled γ-phosphate from ATP to the free 5´ terminal OH of the

Figure 28.11 Structural comparison of LNA and DNA; X = O, S, NH.

734

Part IV: Nucleic Acid Analytics

Figure 28.12 Schematic representation of the enzymatic, photochemical, and chemical labeling reactions, which are discussed in the following sections. Source: from Kessler, C. (1992) Non-radioactive Labeling and Detection of Biomolecules, Springer, Berlin, Heidelberg; and Kessler, C. (ed.) (2000) Nonradioactive Analysis of Biomolecules, Springer, Berlin, Heidelberg.

28

Techniques for the Hybridization and Detection of Nucleic Acids

735

Figure 28.13 Exchange positions for radioactive labels: (a) nucleoside triphosphate; (b) positions of the base rings. For the residues R1 to R3, see the inside front cover; X = N, CH (7-deaza purine).

probe. The 3 H isotope is usually used for in situ applications, due to its lower radiation scatter and longer half-life. It is incorporated into various positions of the base ring. Labeling with 125 I is at the C5 position of cytosine (Figure 28.13b). The low radiation intensity and very long halflife, with its attendant disposal problems, of 14 C isotopes have made their use rare. The chemical structure of the labeled nucleotides and the labeled probe is changed by the non-radioactive modification of probes with reporter groups. This means, for example, that the reaction conditions for enzymatic incorporation of the label need to be adapted to the altered substrate characteristics. Unspecific binding to the modifications must be blocked with suitable blocking reagents (e.g., milk proteins); in the case of blot hybridization, the entire membrane surface must be blocked; for in situ hybridization, the cell or tissue surface requires blocking. The correspondingly modified protocols are, however, well established and do not limit the use of non-radioactive probes.

The most commonly labeled position for sequencing primers, PCR primers, or hybridization oligomers (DNA, PNA) is the 5´ -terminal. This preserves the characteristics of the probe and the formation of hydrogen bonds during hybridization is not influenced. Modifications are introduced through bifunctional linear diamino bonds of variable length. Nucleotides are usually labeled by base modifications; however, labeling of the 2´ position of ribose has also been described. The base modifications are chosen such that they do not interfere with the formation of hydrogen bonds during hybridization. The most common modification position is the C5 position of uracil or cytosine; in the case of deoxyuridine, the modification group imitates the methyl residue of the thymidine base of deoxythymidine (compare Figure 28.13b). In both cases the introduction of the modifying group is not direct, instead it is introduced attached to a spacer. Other suitable positions for the introduction of non-radioactive modifying groups are the C6 of cytosine and the C7 of deaza-guanine and deaza-adenine (Figure 28.13). The amino groups of cytosine, adenine, or guanine, which were frequent targets in the past, are less well suited, since these positions are involved in hydrogen bonding.

28.3.2 Enzymatic Labeling Many of the enzymatic labeling reactions are analogous when using radioactive or non-radioactive labels. The difference is, however, that in the case of radioactive labeling enzyme substrates are structurally identical isotope-labeled nucleotides, while non-radioactive labeling involves additional modification of the nucleotides, which may require adaptation of the reaction conditions.

736

Part IV: Nucleic Acid Analytics

Enzymatic Labeling of DNA Homogeneous DNA labeling involves random priming with the large fragment of the Escherichia coli DNA polymerase I (Klenow enzyme), nick translation with Escherichia coli DNA polymerase I (Kornberg enzyme) or via PCR amplification with Taq DNA polymerase. The density of the labeling is around one label per 25–36 base pairs. Oligonucleotides can be labeled with the help of the terminal transferase reaction; depending on the substrate, one to five labels per oligonucleotide are attached (for an overview see Figure 28.12). Random Priming In the random priming protocol, double-stranded DNA is denatured and rapidly cooled and high concentrations of primer are added to prevent the rehybridization of the target strands. The primers are a mix of all 4096 conceivable hexanucleotides (thus “random” primer), so that statistically every target sequence is covered and the hybridization can occur at any point in the sequence. The Klenow enzyme, the large subtilisin fragment of the Escherichia coli DNA polymerase holoenzyme, extends the primer in a template-dependent reaction. During the elongation reaction, unlabeled dNTPs and hapten-modified dUTPs are built in. Since the template strand has been replicated, strand displacement leads to a new round of synthesis, which leads to a yield of over 100% of the input template DNA. Since statistically several primers bind per target, each primer elongation only replicates part of the sequence; the result is a mix of probes of variable probe length. The partial sequences are all target-specific, however, and carry a homogeneous labeling. Random primed DIG-labeled probes reach a high detection sensitivity in the sub-picogram range. Nick Translation Nick translation uses the Escherichia coli DNA polymerase holoenzyme (Kornberg enzyme) and very small amounts of pancreatic DNase I. This technique requires the 5´ -3´ exonuclease activity of the DNA polymerase, as well as its polymerase activity. The DNase catalyzes the creation of single strand nicks; a very precisely controlled, small amount of DNase is required to ensure that the nicks remain limited. Both ends flanking the single-strand break function as templates, either for the 3´ -5´ polymerase activity or the 5´ -3´ exonuclease activity. The 5´ -3´ exonuclease activity successively removes 5´ -phosphorylated nucleotides, while the 3´ 5´ polymerase activity simultaneously fills in the holes with new, labeled nucleotides. By this concerted action, the nick moves in the 5´ -3´ direction, referred to as nick translation. It involves a DNA replacement synthesis, the yield though remains below 100%, which is less than random priming. Besides the lower labeling, getting the relationship between the amounts of DNase I and Escherichia coli DNA polymerase exactly right is a critical factor. As a result, nick translation is going out of fashion for labeling probes.

Polymerase Chain Reaction, Chapter 29

PCR Amplification

As previously mentioned, PCR amplification with Taq DNA polymerase is a useful method for the creation of vector-free probes. The amplification reaction consists of 30–40 temperature cycles with three partial reactions – denaturation, primer binding, and primer elongation – each at different temperatures. Besides this three-step protocol, two-step protocols have also been described, in which primer binding and primer elongation take place in a single step. Labeled primers or labeled dNTPs are used in the creation of labeled products as hybridization probes. PCR amplification is explained in detail in the next chapter.

Reverse Transcription Labeled DNA probes can also be synthesized from RNA targets with viral reverse transcriptases (e.g., AMV RTase, MMLV RTase). After first and second strand cDNA synthesis, unlabeled dNTPs and hapten-labeled dNTPs are added to the reaction mix to label the probe. Terminal Transferase Reaction DNA oligonucleotides can be labeled enzymatically by the template-independent attachment of labeled dNTPs, sometimes referred to as tailing, with the enzyme terminal transferase. If a mix of labeled and unlabeled dNTPs are employed, tails with multiple labels are created in a template-independent reaction. Using labeled cordycepin triphosphate (3´ -dATP) or 2´ ,3´ -ddNTPs allows the attachment of only a single labeled nucleotide, since the reduced 3´ position can no longer be extended. (see Figure 28.12: enzymatic oligonucleotide 3'-end labeling, oligonucleotide 3'-tailing).

Template DNA and Detection of in Vitro Transcripts, Section 34.3.3

Enzymatic Labeling of RNA

RNA can also be labeled using the previously described in vitro run off transcription with bacteriophage-encoded SP6, T3, or T7 RNA polymerases (Section 28.2.2). Owing to the re-initiation of transcription, high synthetic yields are achieved (up to

28

Techniques for the Hybridization and Detection of Nucleic Acids

737

20 μg transcript from 1 μg recombinant vector). The density of the labeling for enzymatic DNA labeling is around one label for every 25–36 nucleotides.

28.3.3 Photochemical Labeling Reactions There is only one photochemical DNA and RNA labeling reaction: aryl azide-activated haptens are reacted with nucleic acids under long wavelength UV light. Light excitation leads to the release of elementary nitrogen (N2); the short-lived reactive nitrogen radical reacts with different positions of the DNA or RNA and thus covalently bonds the hapten to the nucleic acid. The labeling density is relatively low, however, at around one label per 200–400 base pairs. A range of photoactive substances intercalates into nucleic acids and can be subsequently covalently bound to nucleic acid bases in a photoreaction. These substances include coumarin compounds (psoralen, angelicin), acridine stains (acridine orange), phenanthroline (ethidium bromide), phenazine, phenothiazine, and chinone (Figure 28.14). For photolabeling the most important of these is the bifunctional psoralen, which forms either mono- or bis-adducts with pyrimidine bases after intercalation and photoactivation.

Figure 28.14 Intercalating, photoactive substances for the detection of nucleic acids. The compound first intercalates and a subsequent photoreaction bonds it to the nucleic acid strand covalently.

28.3.4 Chemical Labeling Chemical labeling reactions are used mainly for the labeling of DNA, LNA-oligonucleotides, and PNA-oligomers. Labeling takes place during solid-state oligonucleotide synthesis by the direct incorporation of modified phosphoramidites. Alternatively, labeling can take place after synthesis by coupling the modifying group and the 5´ terminal phosphate of the oligonucleotide with the help of bifunctional diamino compound linkers. Protected phosphoramidites containing the desired hapten are commercially available. Such labeled oligonucleotides can often be used directly after removal of the protective group, without need for further HPLC purification. A second possibility is the incorporation of uracil or cytosine phosphoramidites carrying protected allyl- or propargylamine residues onto the C5 position or deaza-purines onto the C7 position during nucleic acid synthesis (Figure 28.15). After removal of the protective group, they react with N-hydroxysuccinimide-activated haptens or other coupling-capable reporter

Principles of the Synthesis of Oligonucleotides, Section 27.6.1

Figure 28.15 Examples of modified phosphoramidites used in oligonucleotide synthesis. Shown are a pyrimidine and a purine nucleoside as DNA and RNA building unit. R1 and R2 are protecting groups, R3 = H, OH, and X = hapten, reporter molecule.

738

Part IV: Nucleic Acid Analytics

molecules. In this case, subsequent purification by gel electrophoresis or HPLC is often necessary, owing to the often non-quantitative synthetic yield. Oligonucleotides can be coupled directly to labeling enzymes like alkaline phosphatase or horseradish peroxidase with bifunctional linking reagents. Direct detection of the enzyme is possible after hybridization with such enzyme-tagged probes; however, the hybridization temperature and duration are limited by loss of enzyme stability. After synthesis and removal of the protective groups, the chemical labeling of PNA involves coupling of the 5´ terminal amino group with the hapten with the aid of the previously described bifunctional diamino compounds.

28.4 Detection Systems Radioactively and non-radioactively labeled probes, primers, or nucleotides are the central components of nucleic acid detection systems for sequencing and hybridization. The hybridization formats allow greater flexibility in the use of labels and detection components than sequencing protocols. This is because direct and indirect systems, in which a particular modifying group can be detected and measured in multiple ways, can be used. This allows great variety in the application of non-radioactive detection systems, not only in blot formats but also in the full range of qualitative and quantitative systems. If the detection of particular sequences is unnecessary and the pattern of restriction fragment patterns is of interest, staining with intercalating compounds like ethidium bromide is the method of choice. The formation of nucleic acids can also be detected immediately in scintillation counters after incorporation of radioisotopes and separation of unincorporated labeled nucleotides; while these methods were once in widespread use, they are now only rarely used.

28.4.1 Staining Methods Determination of Nucleic Acid Concentration, Section 26.1.4

The absorption of UV light can be used to determine the concentration of nucleic acids; doublestranded DNA can also be detected by the intercalation of fluorescent dyes or by silver staining. Although silver staining is more sensitive, the visualization of DNA fragments in gels is routinely done by intercalation with ethidium bromide and illumination with UV light to generate a fluorescent signal. The sensitivity of the ethidium bromide dye depends on the length of the fragments, like all staining methods, and lies in the nanogram range. The sensitivity of silver staining is in the sub-nanogram range.

28.4.2 Radioactive Systems The β-emitters 32 P, 33 P, and 35 S are usually used for blotting procedures with radioactive probes. Although 35 S isotopes have approximately tenfold less energy, they have the advantage of a longer half-life. 33 P isotopes are more expensive, but offer more energy than 35 S and better resolution than 32 P, making them a good compromise in some situations. 3 H isotopes are also β-emitters but have tenfold less energy than 35 S. Owing to its high stability and its low scattering, this isotope is used in situ and in tissues, though these techniques are increasingly switching to non-radioactive methods. Scattering refers to the tendency of high energy emitters to expose film, even when striking the film at an oblique angle, which causes the signals to become so-called diffuse reflections that appear blurry. Trichloroacetic acid (TCA) precipitations have been used to employ 14 C, but today this isotope is only of historical significance. Table 28.1 lists the key facts for probe labeling and sequencing. The highest energy isotope in common use is 32 P, which makes it the most sensitive. However, its short half-life and high scattering limits the use of this isotope in sequencing and the analysis of complex fragment patterns, since the resolution of closely packed bands is limited. The legibility of sequence gels is also limited towards the origin. When high resolution is more important than high sensitivity, 33 P or 35 S are better choices. Examples are enzymatic sequencing and the analysis of DNA- or RNA-binding proteins in gel shift assays, due to altered

28

Techniques for the Hybridization and Detection of Nucleic Acids

Table 28.1 Characteristics of radioisotopes used to label probes. Isotope particle

Emax (MeV)

Half-life

Application

Characteristics

β

0.0118

12.3 years

In situ

Low sensitivity, high resolution

S

β

0.167

87.4 days

Filter hybridization, sequencing, in situ

Medium sensitivity, good resolution

125

I

γ

0.035

60.0 days

In situ

Low sensitivity, high resolution

P

β

1.71

14.2 days

Filter hybridization, sequencing

Highest energy, highest sensitivity, medium resolution due to reflection

3

H

35

32

mobility of the protein-bound fragments in gels. Besides 3 H, the low radiation intensity of 35 S and 125 I make them suitable for in situ applications, where limiting β-scatter is important to determine exact cellular localizations. Radioisotopes are commercially available in the form of nucleotides already labeled at the desired position. The aqueous solutions contain stabilizers to inhibit the degradation of the biologically active substances by the ionizing radiation. The isotopes are stored at –20 or –70 °C to balance thermal heat decay. Owing to the steady degradation and formation of radicals, radioactive nucleotides should be used as soon as possible, even when the half-life allows a longer use. The most important factors with respect to the storage of radioactively labeled probes are:

 the half-life of the isotope;  the specific radioactivity of the probe; probes with a high labeling density are very sensitive but are subject to rapid degradation;

 the position of the radioactive atom in the molecule; internal labeling leads to strand breaks more easily than terminal labeling. The detection of radioactive nucleic acids takes place in blot form by autoradiography with Xray film, which can be stored permanently to document the results. There are different means of detection, depending on the isotope and the required sensitivity:

 Direct autoradiography: The radiating surface (membrane, gel, cell layer or tissue slice) is

  



brought into direct contact with the X-ray film. This method applies to all β-emitters. Film without a protective layer is required for 3 H, so that the energetically weak electrons can penetrate to the photoactive layer. Fluorography: the radiating surface is treated with fluorescing chemicals, which convert the radiation energy into fluorescence; the most common fluorophores are 2,5-diphenyloxazole (PPO) and sodium salicylate. Indirect autoradiography with intensifier screens: High energy β-emissions are absorbed by phosphate residues of the intensifier screen and converted into visible light by illumination with a laser. Fluid emulsions for cytological or cytogenetic in situ applications: The low to mid energy 3 H or 35 S decay products require direct contact with the detection medium; the solid emulsion is melted at 45 °C and the slide is dipped into it. After drying, exposure occurs for days to months at 4 °C in the absence of light. Pre-exposed X-ray film for direct autoradiography and fluorography: A short pre-exposure activates the silver grains, which then require fewer photons to generate a signal. The preexposure can only be used for fluorography or light intensifiers (light processes).

The right method of detection depends on the characteristics of the isotope in use, such as its type, specific radioactivity, and total radioactivity, as well as the requirements of the experiment, such as the required sharpness, and maximum feasible exposure time.

28.4.3 Non-radioactive Systems The non-radioactive labeling and detection systems are divided into direct and indirect indicator systems (Figure 28.16).

739

740

Part IV: Nucleic Acid Analytics

Figure 28.16 Direct and indirect detection systems.

The two types of reactions differ in the number of components and the reaction steps, as well as the flexibility of use. While direct systems are primarily used in standardized processes (e.g., for the labeling of universal sequencing primers), the more flexible indirect systems are used to selectively detect nucleic acid sequences. In direct systems, the probes are directly and covalently coupled to the signal-generating reporter group; detection is done in two reaction steps:

 hybridization between the nucleic acid target and the directly labeled probe;  signal generation by the directly coupled reporter group. The advantage of direct systems is that only hybridization between the target and the probe is needed. The disadvantage is, however, that for every hybridization sequence each probe must be covalently coupled to the detection label. Therefore, this detection method is used primarily for the labeling of standard sequences with easily coupled fluorescent dyes. , e.g. with sequencing primers. In indirect systems, the probes are not directly labeled, instead they are detected by an additional, non-covalent interaction between a low molecular weight tag and a universal detector. Thus, indirect systems first require the enzymatic, photochemical or chemical incorporation of the modifying group into the probe (Section 28.3). These incorporation reactions are easy; the corresponding protocols are well-established. A universal detection unit binds to the tag, independent of the type of probe and its specificity. The detector, which contains a binding unit as well as the reporter, couples specifically and with high affinity to the tag of the probe. The indirect detection takes place in three reaction steps:

 hybridization between the nucleic acid target and the modified probe;  specific and high affinity, non-covalent interaction between the modifying groups of the probe and binding components of the universal detection unit;

 signal generation by the indirectly bound reporter component.

Although an additional reaction step is required, the high flexibility in the generation of the probes and the coupling with diverse types of detectors are significant advantages for indirect systems. As a result, simple and fast reactions can be used to build different tags into different types of probe; in addition, the tags can be detected with a large set of alternative, universal detectors, dependent on the application. The additional non-covalent interaction makes many combinations possible that allow a broad use of the non-radioactive reporter systems in basic research and the applications previously described at the beginning of the chapter. Direct Detection Systems The most commonly used non-radioactive reporter groups in direct detection systems are fluorescent or luminescent reporter groups, as well as reporter enzymes. Goldlabeling is used for in situ applications; additional use of colored latex beads or silver staining amplifies the signal up to 104-fold. Box 28.1 shows an overview of important reporter groups.

28

Techniques for the Hybridization and Detection of Nucleic Acids

741

Box 28.1 Important direct, non-radioactive reporter groups. Fluorescent label

Metal label

Direct fluorescence

Gold- labeled antibodies

Fluorescein

Enzyme labels

Rhodamine

Direct enzyme coupling

Hydroxy-coumarin

Alkaline phosphatase

Benzofuran

Horseradish peroxidase

Texas Red

Microperoxidase β-Galactosidase

Bimane Ethidium/TB

3+

Urease

Time-resolved fluorescence 3+

3+

Lanthanide (Eu /Tb ) micelles/chelates Fluorescence energy transfer (FRET)

Glucose oxidase Glucose-6-phosphate dehydrogenase Hexokinase

Fluorescein (FAM)

Bacterial luciferase

Rhodamine (TAMRA)

Firefly luciferase

Cy 3 Cy 3,5 CY 5 CY 5,5 CY 7

Enzyme channeling Glucose oxidase: horseradish peroxidase Enzyme complementation Inactive β-galactosidase: α-peptide Polymer labels

Luminescence labels

Latex particles

Chemiluminescence

Polyethyleneimine

(Iso-)luminol derivatives Acridinium esters Electroluminescence Ru2+ (2,2´ -bipyridine)3 complexes Luminescent energy transfer Marker enzymes are such as alkaline phosphatase (Jablonski, E. et al. (1986) Nucl. Acids Res., 14, 6115–6128), horseradish peroxidase (Renz, M. and Kurz, C. (1984) Nucl. Acids Res., 12, 3435–3444; as well as fluorescent tags such as fluorescein or rhodamine (Kessler, C. (1994) J. Biotechnol. 35, 165–189), and Kessler, C. (ed.) (2000) Non-radioactive Analysis of Biomolecules, Springer, Berlin, Heidelberg.

Bacterial alkaline phosphatase (AP) is mainly used for the direct labeling of oligonucleotides and horseradish peroxidase (POD) for the direct labeling of fragments. The use of marker enzymes requires an additional substrate reaction (see below). Coupling alkaline phosphatase to oligonucleotides is a single-step reaction using a bifunctional linker. Direct AP-coupled oligonucleotides are useful in standard assays with a fixed sequence. For example, AP-coupled primers are employed in sequencing, direct blotting electrophoresis (DBE), and as universal amplifiers in signal amplification systems, like probe brushes (Section 28.5.3). The use of POD labeled fragment probes is limited, since POD is increasingly unstable above 42 °C, which limits the maximum temperature to a range that is not suitable for all purposes. Well-known fluorescent labels are fluorescein, rhodamine, and coumarin derivatives. Higher sensitivity is achieved with phycoerythrins or fluorescein lattices; in these cases the coupling

742

Part IV: Nucleic Acid Analytics

FISH Analysis of Genomic DNA, Section 35.2.1

reactions are more complex. Besides use as sequencing primers, fluorescent markers are mainly used for fluorescent in situ hybridization (FISH). The detected fluorescence can contain unspecific signals due to unspecific background light or fluorescent reaction components. For example, hemoglobin fluoresces, which causes interference in any experiments carried out using serum. This can be avoided with time-resolved fluorescence measurements with europium or terbium complexes, chelates, or micelles coupled directly to the probe via a linker, since the emission of the secondary light is delayed in these cases. Direct luminescence markers can be grouped according to their activation type, which can be chemical, electrochemical, or biochemical. Well-known chemically activatable markers for the direct measurement of nucleic acids are acridinium esters, which are activated by H2O2/alkali, as well as the protein aequorin from the jellyfish Aequorea, which is activated by Ca2+ ions. Acridinium esters glow, as they release photons over a long time frame. Aequorin flashes a short, intense pulse of light that is very specific due to the extremely low background, leading to high sensitivity. Electrochemical luminescence markers are stimulated to emit photons by electrochemical reactions. Corresponding markers are [Ru2+(bipyridyl)3] or phenanthroline complexes. The ruthenium ions are oxidized on a gold electrode (Ru2+ → Ru3+) while the subsequent reduction of the Ru3+ by tripropylamine (TPA) creates a chemiluminescent signal. The resulting Ru2+ ion is then available to begin the next reaction cycle. Gold particles can be used in blot or in situ formats to directly visualize targets. An additional silver staining can increase the sensitivity of the detection. In this case, the original gold particles have silver layered on them, increasing their size and making them easier to see. Indirect Detection Systems Several different couplings are available to indirect detection systems due to the additional specific interaction between the modified nucleic acid hybrid and the universal detector carrying the reporter group. Table 28.2 shows the coupling groups commonly used for indirect nucleic acid detection. Most systems use antibodies or the biotinbinding proteins avidin or streptavidin to recognize special modifying groups attached to nucleic acids, but other less widespread systems exist that use specific sequences in the hybrid or specific hybrid conformations as binding components (operators or promoters) to bind to proteins like repressors or RNA polymerases. The DIANA concept rests on the binding of lac repressor β-galactosidase conjugates to hybrid-coupled lac operator tag sequences. In addition, non-sequence specific binding proteins like the single-stranded binding (SSB) protein or histones have been employed as tags. Conformation-specific antibodies are examples of conformation-recognizing binders. Tagging probes with metal ions or to poly(A)-coupled systems were also used in the early days of the development of non-radioactive reporter systems. Of the various systems, only the biotin (BIO) system and antibody systems with digoxigenin (DIG), fluorescein (FLUOS) and 2,4-dinitrophenol (DNP) have sensitivity in the sub-picogram range. Thus, these systems have become standards for the non-radioactive detection of nucleic acids, while the other systems are described more out of historical interest. Enzymatic labeling takes place with hapten-labeled nucleotides; as examples of these labels, the structures of DIG-, FLUGS- and BIO-labeled nucleotides are shown in Figure 28.17a. The labeling of Table 28.2 Interaction pairs for indirect, non-radioactive detection systems. Source: from Kessler C. (ed.) (1992) Non-radioactive Labeling and Detection of Biomolecules, Springer, Berlin, Heidelberg; for further inetraction pairs see Kessler, C. (ed.) (2000) Non-radioactive Analysis of Biomolecules, Springer, Berlin, Heidelberg. DNA modification ↔ binding partner

Examples

Vitamin ↔ binding protein

Biotin ↔ streptavidin

Hapten ↔ antibodies

Digoxigenin ↔ anti-digoxigenin antibody

Protein A ↔ constant region of IgG

Protein A ↔ IgG

DNA/RNA-hybrid ↔ DNA/RNA specific antibody

DNA/RNA ↔ anti-DNA/RNA antibody

RNA/RNA-hybrid ↔ RNA/RNA specific antibody

RNA/RNA-hybrid ↔ anti-RNA/RNA antibody

Binding protein-DNA-sequences ↔ binding protein

T7-promotor ↔ Escherichia coli RNA polymerase

Heavy metal ↔ sulfhydryl-reagent

Hg2+ ↔ HS-TNP ↔ (TNP/DNA)-specific antibody

Polyadenylation-polynucleotide phosphorylase/ pyruvate kinase

ATP-coupled red firefly luciferase reaction

28

Techniques for the Hybridization and Detection of Nucleic Acids

Figure 28.17 (a) Structure of hapten and biotin-labeled dNTPs. (b) DNP-modified phosphoramidites.

743

744

Part IV: Nucleic Acid Analytics

Figure 28.18 Structure of digoxigenin. Digoxigenin is a steroid with the formula C23H35O5. The A/B rings are cis isomers, the B/C rings are trans isomers, and the C/D rings are cis isomers.

FISH Analysis of Genomic DNA, Section 35.2.1

oligonucleotides takes place mainly through modified phosphoramidite; an example, the DNPmodified phosphoramidite, is shown in Figure 28.17b. Mixtures of differently labeled probes are used for the parallel detection of different fragments on blots (DIG, BIO, FLUGS: rainbow detection) or in situ for the detection of different chromosomal segments or different chromosomes (DIG, BIO, DNP: multiplex FISH, chromosome painting). The Digoxigenin System Digoxigenin is the chemically synthesized aglycone of the cardenolide lanatoside C (Figure 28.18). The digoxigenin:anti-digoxigenin (DIG) system is based on the specific interaction between digoxigenin and a high affinity, DIG-specific antibody covalently bound to a reporter group. The DIG system can specifically detect sub-picogram amounts

Table 28.3 Important optical, luminescent, and fluorescent detection types. Source: from Kessler, C. (ed.) (1992) Non-radioactive Labeling and Detection of Biomolecules, Springer, Berlin, Heidelberg; for further detection types see Kessler, C. (ed.) (2000) Non-radioactive Analysis of Biomolecules, Springer, Berlin, Heidelberg. Format

Specific detection Optical

Blots

Luminescence

Fluorescence ®

®

AP/BCIP, NBT

AP/AMPPD, Lumiphos , CSPD , CDPstar

AP/naphthol-AS-azo-dye, diazonium salt

β-Gal/AMPGD

Fluorescein, rhodamine, hydroxycoumarin, AP/AMPPD, fluorescein/rhodamine/hydroxycoumarin β-Ga/AMPGD, fluorescein/rhodamine/hydroxycoumarin

AP/p-NPP

AP/AMPPD, LumiphosTM, CSPD® , CDPstar®

AP/4-MUF-P

β-Gal/CPRG POD/ABTSTM

β-Gal/AMPGD, POD/luminol, isoluminol

β-Gal/4-MUF-β-Gal, POD/homovanillic acid-o-dianisidine/H2O2

GOD:POD-pair/ABTSTM

Xanthine-oxidase/cyclic dihydrazide

Aromatic peroxalate compounds/H2O2

Hexokinase: G-6-PHD-pair ABTSTM

G-6-PDH/phenazinium salt

Lanthanoid-(Eu3+Tb3+)-complex,-micelles-, chelates

TM

HRP/TMB, immunogold Solution

Hydrolysis of acridinium ester, rhodamine:luminol pair, AP/D-luciferin-O-phosphate: firefly-luciferase/ATP/O2 Renilla-luciferase green fluorescent protein, Ru2+(bpy)3 In situ

ABTS: AMPGD: AMPPD: BCIP: CPRG: CSPD:

AP/BCIP, NBT

Fluorescein

POD/TMB

Rhodamine

Immunogold

Hydroxycoumarin

2,2-azino-di(3-ethyl)-benzothiazole sulfate 3-(4-methoxyspiro(1,2-dioxetan-3,2´ -tricyclo[3.3.1.13,7]-decan)-4-yl)phenylβ-galactoside 3-(4-methoxyspiro(1,2-dioxetan-3,2´ -tricyclo[3.3.1.13,7]-decan)-4-yl)phenylphosphate, Na2 5-bromo-4-chloro-indolylphosphate chlorophenol-red β-galactoside 3-(4-methoxyspiro(1,2 dioxetan-3,2´ ,5´ -chloro)-tricyclo[3.3.1.13,7]-decan)4yl)-phenylphosphate Na2

Naphthol-AS: Diazoniums salt: GOD: MUF: NBT: NPP: POD: TMB:

2-hydroxy-3-naphtholicacid-anilide Fast blue B, Fast Red Tr, Fast Brown RR glucose oxidase methylumbelliferyl phosphate nitro-blue tetrazolium salt p-nitrophenyl phosphate horseradish peroxidase 3,3´ ,5,5´ -tetramethyl-benzidine

28

Techniques for the Hybridization and Detection of Nucleic Acids

745

Figure 28.19 Digoxigenin detection system with its detection marker alternatives: fluorescent, luminescent, and optical.

of DNA or RNA. Alkaline phosphatase is often used as the reporter group, which is covalently conjugated to the antibody. Alkaline phosphatase catalyzes the conversion of optical or chemiluminescent substrates (BCIP/NBT, AMPPD; Table 28.3 below). The DIG system is shown in Figure 28.19. Enzyme-linked immunosorbent assays (ELISAs) employ detection with universal antibody-marker enzyme conjugates. The high specificity of detection and the low background level of the DIG system are due to the fact that digoxigenin is naturally only present in Digitalis plants (foxglove) in the form of lanatoside compounds; the antibodies employed thus do not recognize any cellular component from other biological materials. This is particularly important for in situ hybridizations. DIG-specific antibodies have very few unspecific cross-reactions with cellular components. Only human serum contains components that are known to interact with digoxigenin-specific antibodies; these serum components can be specifically removed, however, by pretreatment of the polyclonal anti-DIG antibodies with serum.

Due to the high specificity of the digoxigenin antibodies used, even the structurally similar steroids are only recognized to a very limited extent or not at all; examples are the bufadienolide K-strophanthin (cross-reactivity < 0.1%) and the steroids estrogen, androgen, or progestogens (cross-reactivity < 0.002%). The important difference between digoxigenin and the sexual hormones is that the C and D rings are in the cis isomer conformation, rather than trans. Digoxigenin is coupled to the nucleotide via the –OH group to the C3 position of the cardenolide frame with a linear spacer. This does not interfere with antibody binding so detection with antibody conjugates is still possible after incorporation of digoxigenin into the probe. Digoxigenin is isolated from leaves of the plant Digitalis lanata or Digitalis purpurea by the cleavage of three digitoxose and a glucose unit of the natural substance deacetyl-lanatoside C. To further reduce unspecific effects, only the Fab fragment of the antibody is used, rather than the complete antibody. The Fab fragment is isolated by papain cleavage of the constant Fc fragment; it contains just the short antibody arms with the highly variable binding sites. The complete antibody is only used in coupled systems with secondary reporter antibodies, which recognize the Fc portion. For example, a secondary antibody from mouse, which recognizes the Fc portion of the DIG-specific sheep antibody, serves to amplify the signal. The reporter groups are not coupled to the primary DIG-specific antibody, but instead to the Fc-specific secondary antibody. The Biotin System With the biotin:avidin (or streptavidin) (BIO) system, the ubiquitous biotin, also known as vitamin H, is used as the tag (for the structure of biotin see Figure 28.17a). Coupling

Immune Binding, Section 5.3.3

746

Part IV: Nucleic Acid Analytics

employs linear spacers attached to the terminal carboxy group of the biotin side chain. Avidin, from egg white, or streptavidin, from Streptomyces avidinii bacteria, have four high affinity binding sites for biotin; their binding constants are among the highest known natural affinities, at 1015 mol–1. Avidin or streptavidin is covalently coupled with a reporter group to allow detection. After binding to the biotin-modified hybrid complex, secondary biotinylated detection components can be coupled to the remaining free biotin binding sites to amplify the signal. Although the BIO system is as sensitive as DIG, the disadvantage of the system is that biotin is ubiquitously present in biological material. As a result, unspecific binding of free endogenous biotin is possible with almost all cellular material. This results in a high background, particularly for in situ formats. Another problem with this system is the tendency of the two binding proteins to stick to membranes, even after the proteins have been saturated with blocking reagents. This stickiness is caused by two factors: the high basicity of avidin and the presence of multiple tryptophan residues in the binding pockets of avidin and streptavidin. Both factors lead to unspecific polar or hydrophobic interactions with proteins of the membrane blocking reagents. In the case of avidin, unspecific binding is also caused by glycan chains on the surface, which can bind to sugar-binding proteins on the surface of cells (for in situ procedures) or the membrane blocking reagents (in blotting procedures). Since streptavidin is isolated from bacteria, this factor plays no role, since there is no protein glycosylation in prokaryotes. These unspecific background reactions can be reduced in several ways:

 Acetylation or succinylation of the lysine chains or complex formation between avidin and the acidic protein lysozyme reduces the basicity of avidin and thus its unspecific polar interactions.

 Deglycosylation of avidin reduces unspecific adsorption by binding to sugars.  Pre-incubation of the blocked membrane in a high ion concentration buffer reduces unspecific protein interactions.

Despite the high binding affinity and the other measures to reduce the unspecific background, a poor signal-to-background ratio is observed at low target concentrations, which limits the sensitivity of the entire system. While this may matter when using BIO in a detection system, for other applications this is less important. The biotin–streptavidin binding system is often employed in the isolation of nucleic acids by hybridization with biotin capture probes. Samples are coupled to a streptavidincoated surface (e.g., beads or microtiter plates) and the unbound analytes are washed away. The sequence-specifically bound nucleic acids can subsequently be amplified and detected, for example, after incorporation of DIG, specifically and without interference. The Fluorescein System The fluorescein:anti-fluorescein (FLUOS) System (Figure 28.17a) is another antibody-based system. The hapten, fluorescein, is coupled by amide binding between the spacer and the free carboxy group of the hapten. The sensitivity of the fluorescein-specific antibody is also high. Since fluorescein is light-sensitive, however, exposure to light during storage causes loss of sensitivity. This can be avoided by storage in a cooled place protected from light. The Dinitrophenol System The 2,4-dinitrophenol:anti-2,4-dinitrophenol (DNP) system is also based on antibody binding (Figure 28.17b). The hapten is bound to the aromatic C1 position, for example, by chemical conversion of 1-fluoro-2,4-dinitrobenzene with a protected amino terminal spacer into DNP-labeled phosphoramidites. These are then built into oligonucleotides during chemical synthesis. Since DNP is a synthetic compound it also has few unspecific interactions with biological material. DNP-labeled oligonucleotides are often used for more complex in situ investigations, like multiplex FISH.

Non-radioactive Detection in Indirect Systems In indirect detection systems, the entire repertoire of detection systems can be conjugated to the component that binds the tag on the hybridizing nucleic acid, depending on the format. Besides the reporter groups used in direct detection systems (fluorophores, luminescent dyes, gold atoms, see above), marker enzymes are often coupled to the tag-binding component, which create an optically visible luminescent or fluorescing reaction product through a catalytic substrate reaction. Well-known marker enzyme systems are bacterial alkaline phosphatase (AP), horseradish peroxidase (POD), β-galactosidase (β-Gal), luciferase, and urease.

28

Techniques for the Hybridization and Detection of Nucleic Acids

747

The alternatives for detection can be summarized as follows:

 optical systems: mixtures of indolyl derivatives and tetrazolium salts or diazonium salts for coupled redox reactions;

 chemiluminescent systems: dioxetane derivatives, (iso-)luminol derivatives, acridinium        

esters, aequorin; electrochemiluminescent systems: Ru2+ or phenanthroline complexes; bioluminescent systems: luciferase derivatives; fluorescent systems: Attophos, Eu3+ or La3+-complexes, -micelles, or -chelates; FRET systems: fluorescein and rhodamine derivatives; fluorescence quencher systems: fluorescein (FAM) and rhodamine (TAMRA) derivatives as the fluorescing component, Black Hole Quencher (BHQ), TAMRA-, dabcyl-, cyaninederivatives as quenchers. metal precipitating systems: silver deposition on antibody-bound gold atoms (immunogold); electrochemical systems: urease-catalyzed pH changes; in situ detection systems: fluorescence in situ hybridization (FISH), primed in situ hybridization (PRINS), chromosome painting, multiplex FISH (M-FISH with SKY probes), comparative genome hybridization (CGH).

In addition, the following amplification reactions can be coupled to the primary signal generator (Section 28.5):

 probe crosslinking: generation of crosslinked structures (e.g., Christmas trees, probe brushes);

 crosslinking of binding components: examples include polystreptavidin, polyhapten, PAP, APAAP;

 poly-enzymes: crosslinked marker enzymes, for example, poly-AP;  signal cascades: for example, NAD+/NADH + H+ cycles coupled to a redox color reaction that recycles the substrate for the next round. Table 28.3 summarizes the most important systems for optical, luminescent, and fluorescent detection. Optical Detection Optical enzymatic detection systems are based on the conversion of colored substrates, coupled with a change in the wavelength of absorption. Either colored precipitates (blot or in situ applications) or colored solutions (for quantitative measurements) are used. The most important substrates for colored precipitates are mixtures of 5-bromo-4-chloro-indoxyl phosphate (BCIP, Figure 28.20) or β-galactosidase (X-gal) and nitro-blue tetrazolium salt (NBT, Table 28.3), giving rise to a deep blue–violet precipitate (indigo/diazonium salt) after cleavage of the phosphate in a coupled redox reaction. Luminescence Detection Luminescent systems are based on the chemical, biochemical, or electrochemical activation of substrates, which emit light as the atoms return to their ground state. Chemiluminescence Common substrates like AMPPD or AMPPG (Table 28.3) are used for the detection of chemiluminescence; they form intermediate dioxetanes that decompose under chemiluminescence. After enzymatic cleavage of the phosphate or β-galactoside residue, the dioxetane coupling becomes labile. The AMPD anion is created in an unstable, excited state and it decays by emitting light at a wavelength of 477 nm (Figure 28.21).

Figure 28.20 The coupled optical redox reaction BCIP/NBT.

748

Part IV: Nucleic Acid Analytics

Figure 28.21 Mechanism of the dioxetane chemiluminescence reaction. The target nucleic acid is anchored to a solid support by a biotin–streptavidin (SA) interaction, thus marking it with the DIG/ AP system. AP then activates the chemiluminescent substrate.

Several derivatives differ in terms of the stabilizing moiety and the stabilizers in the substrate solutions (e.g., LimiphosTM, CSPD® , CDPstar® ). The different formulations lead to different rates of decay or to different light intensities. The light emission can also be modulated by the addition of fluorescent detergents, which surround the dioxetane molecules like micelles. Depending on the type of additive, blue, red, or green secondary fluorescence (aquamarine, ruby, emerald) is created. POD catalyzes the oxidation of luminol. The sensitivity of the chemiluminescence is increased in enhanced chemiluminescence by the addition of certain phenols (e.g., p-iodophenol), naphthols (e.g., 1-bromo-2-chloronaphthol), or amines (e.g., p-anisidine). In addition to their use in direct detection systems, Ru2+ complexes (Section 28.4.3) are also used in indirect systems, after coupling with hapten-specific antibodies, for chemiluminescent detection after electrochemical excitation. Electrochemiluminescence

Bioluminescence Owing to their high sensitivity, luciferase enzymes from fireflies (Photinus pyralis) or bacteria are used. The eukaryotic enzyme catalyzes the conversion of luciferin into oxyluciferin. In an AP-coupled indicator reaction, D-luciferin-O-phosphate is the substrate for AP. After cleavage of the phosphate, luciferase converts the formed D-luciferin in the presence of ATP, O2, and Mg2+ ions into oxyluciferin, thereby emitting light (Figure 28.22).

Figure 28.22 Mechanism of the luciferin bioluminescence reaction. Analogous to the chemiluminescence reaction in Figure 28.21, the target nucleic acid is anchored to a solid support by a biotin: streptavidin (SA) interaction marking it with the DIG:AP system. AP then activates the bioluminescent substrate.

28

Techniques for the Hybridization and Detection of Nucleic Acids

Alternatively, a combination of the enzyme glucose-6-phosphate-dehydrogenase, NAD(P) H-FMN-oxidoreductase, and bacterial luciferase can be used. The FMNH2 formed in the redox reaction is oxidized in the presence of decanal and O2, resulting in the emission of light. Renilla luciferase, from sea anemones, catalyzes the bioluminescent oxidation of coelenterazine. In the presence of green fluorescent protein (GFP) this leads to green secondary fluorescence at a wavelength of 508 nm. The Renilla enzyme is often used indirectly as a secondary reporter molecule in the form of a biotin conjugate.

749

Green Fluorescent Protein (GFP), Section 7.3.4 and 34.4.3

Fluorescence Detection Fluorescent light is created by absorbing primary light (broadband or monochromatic laser light); the excited fluorescent molecule returns to its ground state by emitting longer wavelength secondary light. Because of the possible overlap of the incident primary light and emitted secondary light, background signals can occur. Background effects are reduced by the geometry of the detector (perpendicular detector orientation) or by timeresolved fluorescence (TRF) with Eu3+- or Tb3+-complexes, micelles, or chelates (Section 28.3.3). Coupling several of these TRF fluorophores to the four binding sites via indirect coupling by biotin:streptavidin amplifies the resulting signal. Indirect fluorescence can be detected by coupling the previously described direct fluorescence markers (e.g., fluorescein or rhodamine) to hapten binding antibodies (e.g., anti-DIGfluorescein or anti-DIG-rhodamine conjugates). FRET Detection Probes marked with interacting fluorescent components are used for FRET detection. The two components form a fluorescence energy resonance transfer (FRET) pair. One component absorbs the primary light and transfers it in the form of energy to the second component, if it is in immediate physical proximity (Figure 28.8). The second component emits the absorbed energy in the form of longer wavelength secondary light. The longer wavelength light is selectively measured to separate the signal from the input light used to excite the FRET pair. A modern homogeneous detection system is the 5´ -nuclease system (TaqMan). In this wellknown homogeneous detection system the probe is labeled with a marker pair consisting of a fluorescent marker and a quencher (Figure 28.5). For HybProbes two directly adjacent probes are used, marked with a FRET pair at their vicinal ends. In the case of molecular beacons one probe carrying a FRET pair on both ends (termini) is sufficient; see Figure 28.23 for FRET components: fluorophores, and quencher (c, D).

Förster Resonance Energy Transfer (FRET), Section 7.3.7

In Situ Detection Fluorescein and rhodamine derivatives are usually employed as the fluorescent markers for in situ detection. These are directly or indirectly bound to the probes. DIG, biotin, or dinitrophenol (DNP) are used as the interacting components for indirect binding. For SKY probes, direct or indirect probe labeling with different marker combinations (Orange, Texas Red, Cy5, Spectrum Green, and Cy5.5 or DIG:Cy5.5 or FITC, Cy3, Bio-Cy3.5, Cy5, DIG-Cy7) as fluorescent markers can create up to 24 different colors.

Figure 28.23 Examples of FRET components: (a) 6-carboxyfluorescein (FAM), (b) tetramethylrhodamine (TAMRA), (c) Cy5, and (d) dabcyl.

750

Part IV: Nucleic Acid Analytics

28.5 Amplification Systems Often, detection is coupled with nucleic acid amplification procedures. Three types of amplification formats are known:

 target amplification: amplification of the nucleic acid to be detected;  target-specific signal amplification: signal amplification coupled with target hybridization;  signal amplification: amplification of the signaling components. Table 28.4 gives an overview of various amplification reactions. Target amplification procedures have many advantages. In most cases, they lead to an exponential amplification and Table 28.4 Overview of amplification reactions. Source: Kessler, C. (1994) Non-radioactive analysis of biomolecules. J. Biotechnol., 35, 165–189; and Kessler, C. (ed.) (2000) Non-radioactive Analysis of Biomolecules, Springer, Berlin, Heidelberg. Type of amplification

Examples

Target amplification Replication Temperature cycles

PCR: polymerase chain reaction

cDNA synthesis/temperature cycles

RT-PCR: PCR connected with cDNA synthesis

Isothermal cycles

PCR: in situ PCR

Transcription Cyclical isothermal cDNA synthesis

NASBA: nucleic acid sequence based amplification TMA: transcription mediated amplification TAS: transcription based amplification system 3SR: self-sustaining sequence replication

Increased rRNA copy number

16S/23S rRNA probes

Target specific signal amplification Replication Isothermal replication cycles

Qβ replication

Ligation Temperature cycles

LCR: ligase chain reaction

Replication/ligation Temperature cycles

RCR: repair chain reaction

Probe hydrolysis Isothermal cyclic RNA hydrolysis

CP: cycling probes

Displacement of indicator probes on flap structures by cleavase enzymes and amplification of indicator probes

Invader technology

Signal of amplification Tree structure Probe network

Hybridization trees, branched probes

Network of indicator molecules

PAP: peroxidase: anti-peroxidase complex APAAP: alkaline phosphatase:anti-alkaline Phosphatase complex

Enzyme catalysis Enzymatic substrate turnover

ELISA: enzyme linked immunosorbent assay

Polyenzyme

Enzyme gel conjugate

Coupled signal cascades Cyclic NAD+/NADP+ redox reaction

Self redox cycle

28

Techniques for the Hybridization and Detection of Nucleic Acids

therefore to high amplification rates (106–109). In addition, the complexity of the detection system is significantly decreased, since only the targets to be detected are amplified and not unspecific sequences. An example of a target amplification is the polymerase chain reaction (PCR), in which the section of nucleic acid to be detected is enzymatically reproduced in a primer-dependent reaction. This reduction in complexity is absent in plain signal amplifications; consequently, for complex genomic DNA, like the human genome, they are only used in combination with ceding PCR target amplification procedures. An example of a signal amplification combined with target amplification is the enzymatically catalyzed conversion of dyes or luminescencegenerating substrates (ELISA). In addition, plain signal amplification only leads to linear signal amplification and therefore lower amplification rates (10–103); the result is lower sensitivity of the detection reaction. Since unspecific hybridizations are also amplified, this often results in a less favorable signal-to-noise ratio, which decreases the specific detection of the target sequence. The sensitivity reached in systems without target amplification, but with ELISA signal amplification, lies in the range of picograms (10–12 g) to femtograms (10–15 g). In combination with target amplification systems like PCR, the sensitivity goes up to the attogram (10–18 g) range. Such reactions allow the detection of single molecules. In this range the sensitivity reached in practice is no longer limited by the sensitivity of the detection system, instead it is limited by statistical effects relating to preparation of the test sample.

Polymerase Chain Reaction, Chapter 29 ELISA, Section 5.3.3

While signal amplification is an important component of protein or glycan analysis, only nucleic acid analysis allows amplification of the target. This enables the combination of both types of amplification for the detection of nucleic acids, which leads to very high detection sensitivities.

28.5.1 Target Amplification Target Elongation Target elongations are thermocyclic reactions in which both strands of the DNA to be detected are replicated. The most important method for target elongation is PCR. RNA must first be transcribed into cDNA with reverse transcriptase (RT PCR) prior to use in a PCR. In homogeneous detection systems, PCR amplification is often carried out in TaqMan or LightCycler format and fluorescence detection is employed. Strand displacement amplification (SDA) is another means to amplify DNA, in this case under isothermal conditions. In the case of in situ PCR, these amplification reactions can be carried out in fixed tissues or cells for the specific amplification of target sequence fragments. Target Transcription The amplification reactions involved in transcription are isothermal and cyclic through reverse transcription and transcription of an intermediately formed transcription unit. The starting material is target RNA, which is converted into cDNA with the help of promoter-containing primers, forming intermediate double-stranded DNA molecules containing a T7, T3, or SP6 promoter. These intermediary transcription units are transcribed into the starting RNA. There are different ways to carry out the transcription amplification. The reactions differ, among other things, in that only one or both primers carry promoter sequences. One variant, nucleic acid sequence-based amplification (NASBA), is described in the following section. An important alternative is transcription-mediated amplification (TMA).

Nucleic Acid Sequence-Based Amplification (NASBA), Section 29.6.1

In Vivo Amplification Special systems for the detection of bacteria exploit the fact that ribosomal RNA includes species-specific sequences and in vivo are present at a copy number of 103 to 104, so they can be easily detected. The indicator sequences are the variable regions of the rRNA or intergenic spacer regions, i.e. segments between the rRNA genes.

28.5.2 Target-Specific Signal Amplification This type of reaction does not involve the amplification of the target itself, instead a nucleic acid tag or oligonucleotide, which hybridizes with the target, is replicated or changed. These reactions have several disadvantages relative to direct amplification of the target: For one thing, it involves a pure detection reaction, since no new target sequences are created. For another, these detection reactions have a limited specificity, since a target-coupled signal is increased and therefore unspecific amplification products cannot be filtered out as, for example, they would be with a subsequent hybridization with a target-specific probe. An example for target-specific signal amplification is the ligase chain reaction (LCR), which will be briefly described in the next chapter. Another example is the cleavase reaction, which

751

Ligase Chain Reaction (LCR), Section 29.6.4

752

Part IV: Nucleic Acid Analytics

involves cleavage of a mismatched end of a probe, a flap, by the enzyme cleavase, which recognizes and cuts such ends. This leaves an extra base on the end of the cleaved free end of the oligonucleotide, which then allows the amplification of a signal sequence after binding to a complementary indicator probe, which allows elongation of its 3´ end.

28.5.3 Signal Amplification This sort of amplification includes reactions to boost the resulting signal, instead of the number of targets, and brings only limited increases in sensitivity (10–103). Unspecific reactions mentioned for the target-independent amplifications can lead to unspecific signals. Nevertheless, the signal amplifications are of great value, particularly when combined with target amplification reactions. Branched Structures Branched structures are built up from forked probes, which contain target-specific sequences as well as sequences for the generation of signal (Christmas trees, probe brushes). The branched probes are complemented by universal detection probes, which are linked to a marker enzyme, such as AP. The primary, bivalent probes are composed of sequences that have specific sequences complementary to the branched secondary probes. The secondary branched probes are composed of sequences complementary to the primary probes as well as branched structures complementary to a third probe. The third components are universal, enzyme-labeled AP probes. By hybridization between the target, primary (bivalent), secondary (branched), and tertiary (AP-labeled) probes, complex structures are built up that lead to amplification of the signal. An example is the use of branched DNA (bDNA) for the identification of hepatitis C virus (Figure 28.24). Additional sensitivity can be achieved through the cassette-like attachment of several branched structures to a target. A disadvantage of this system is, besides the limited amplification rates, the complex set up and therefore the limited ability to control the branched structures, as well as the possibility of the occurrence of unspecific hybridization. These can lead to unspecific background reactions and therefore to a poor signal-to-noise relationship, which limits its sensitivity. Enzyme Catalysis The amplification of signals through the enzymatic conversion of substrate with the help of marker enzymes is used in direct and indirect detection systems (e.g., AP, POD, β-Gal; Section 28.4.3). Enzyme catalysis as part of the detection reaction

Figure 28.24 Branched structures for signal amplification. The structural components are the capture probe, the bivalent primary detection probe, the branched secondary probes, and the universal AP detection probes. The capture probe can also be covalently bound directly to a membrane.

28

Techniques for the Hybridization and Detection of Nucleic Acids

753

Figure 28.25 Signal amplification by coupled cyclic ADH: diaphorase redox reaction. ADH: alcohol dehydrogenase; DP: diaphorase.

leads, depending on the reaction format and type of marker system, to up to a 103-fold increase in sensitivity. The use of polyenzymes leads to an additional increase in sensitivity by a factor of 3–5. In this context, coupling of PCR and enzyme catalysis with luminescent substrates is often the best choice. For example, the DIG hapten (Section 28.4.3) is often incorporated into the amplicon by the primers or nucleotides during PCR; the DIG-marked amplicon is subsequently detected with high sensitivity with the help of AP-antibody conjugates by the catalytic conversion of indolyl or dioxetane substrates. By the combination of amplification types, sensitivities of up to the detection of single molecules are achieved. Coupled Signal Cascades

For this reaction type, a primary substrate conversion is coupled to a secondary enzymatic reaction. If the secondary reaction is cyclic, a signal cascade results. An example of a signal cascade is the self-redox cycle shown in Figure 28.25. The primary reaction is catalyzed by the marker enzyme alkaline phosphatase (AP) that is linked, either directly or indirectly, with the hybridization probe. Primary NADP is dephosphorylated by AP to NAD. The formed NAD activates a secondary redox cycle, in which it is reduced to NADH by the enzyme alcohol dehydrogenase (ADH), coupled with the oxidation of ethanol to acetaldehyde. The secondary reaction cycle is completed by the enzyme diaphorase (DP), which re-oxidizes the NADH to the original NAD in a further coupled reaction and at the same time reduces the dye NBT-violet to deeply colored formazine. Through the enzyme diaphorase, the secondary reaction cycle is closed. The soluble formazine can be quantitated photometrically (λ = 465 nm); the signal amplification leads to a 10- to 100-fold increase in sensitivity or, alternatively, to much shorter reaction times. Traces of NAD contamination in the primary substrate NADPH, which accumulate during extended storage, generate a background, which can critically compromise the sensitivity.

Further Reading Ausubel, F.M., Brent, R., Kingston, R.E., Moore, D.D., Seidman, J.G., Smith, J.A., and Struhl, A. (1987– 2005) Current Protocols in Molecular Biology, vol. 1–4, Suppl. 1–69, Greene Publishing Associates and Wiley-Interscience, New York. Clarke, J.R. (2002) Molecular diagnosis of HIV. Expert Rev. Mol. Diagn. 2, 233–239. Griulietti, A., Overbergh, L., Valckx, D., Decallonne, B., Bouillon, R., and Mathieu, C. (2001) An overview of real-time quantitative PCR: applications to quantify cytokine gene expression. Methods, 25, 386–401. Hares, B.D. and Higgins, S.J. (1995) Gene Probes, vol. 1 and 2, IRL Press, Oxford.

754

Part IV: Nucleic Acid Analytics International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature, 431, 931–945. Keller, G.H. and Manak, M.M. (1993) DNA Probes, 2nd edn, Stockton Press, New York. Kessler, C. (1991) The digoxigenin (DIG) technology – a survey on the concept and realization of a novel bioanalytical indicator system. Mol. Cell. Probes, 5, 161–205. Kessler, C. (ed.) (1992) Non-radioactive Labeling and Detection of Biomolecules, Springer, Berlin, Heidelberg. Kessler, C. (ed.) (2000) Non-radioactive Analysis of Biomolecules, Springer, Berlin, Heidelberg. Kidd, J.M. et al. (2008) Mapping and sequencing of structural variation from eight human genomes. Nature, 453, 56–64. See also http://www.hgsc.bcm.tmc.edu/ Kricka, L.J. (1995) Nonisotopic Probing, Blotting, and Sequencing, Academic Press, San Diego. Kricka L.J. (2002) Stains, labels and detection strategies for nucleic acids assays. Ann. Clin. Biochem,. 39, 114–129. Lander, E.S., Linton, L.M., and Birren, B. et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921 and 411, 720 and 412, 565. Lee, H., Morse, S., and Olsvik, O. (1997) Nucleic Acid Amplification Technologies, Eaton Publishing, Natick. Marshall, A. and Hodgson, J. (1998) DNA chips: an array of possibilities. Nat. Biotechnol., 16, 27–31. McPherson, M.J., Quirke, E., and Taylor, G.R. (1996) PCR – A Practical Approach, vol. 1 and 2, IRL Press, Oxford. Nuovo, G.J. (1997) PCR In Situ Hybridization: Protocols and Applications, 3rd edn, Lippincott-Raven Press. Nussbaum, R.L., McInnes, R.R., and Willard, H.F. (2001) Thompson & Thompson Genetics In Medicine, 6th edn, W.B. Saunders, Philadelphia. Ørum, H., Kessler, C., and Koch, T. (1997) Peptide nucleic acid, in: Lee, H., Morse, S., and Olsvik, O. (eds) (1997) Nucleic Acid Amplification Technologies, Eaton Publishing, Natick. Persing, D.H., Smith, T.F., Tenover, E.C., and White, T.J. (1993) Diagnostic Molecular Microbiology: Principles and Applications, American Society for Microbiology, Washington. Ramsay, G. (1998) DNA chips: state-of-the art. Nat. Biotechnol., 16, 40–44. Reischl, U., Wittwer, C., and Cockerill, F. (eds) (2002) Rapid Cycle Real-Time PCR – Methods and Applications, Springer-Verlag, Berlin, Heidelberg. Tyagi, S., Bratu, D., and Kramer, E.R. (1997) Multicolor molecular beacons for allele discrimination. Nat. Biotechnol., 16, 49–53. Venter, J.C., Wittwer, C.T., and Herrmann, M.G. et al. (2001) The sequence of the human genome. Science, 291, 1304–1351; Venter, J.C., Wittwer, C.T., and Herrmann, M.G. et al. (2001) Science, 292, 1838.

Polymerase Chain Reaction Nancy Schönbrunner,1 Joachim Engels,2 and Christoph Kessler3 1

Roche Molecular Systems, Inc., 4300 Hacienda Dr., Pleasanton, CA 94588, USA Goethe University Frankfurt, Institute of Organic Chemistry and Chemical Biology, Department of Biochemistry, Chemistry and Pharmacy, Max-von-Laue Straße 7, 60438 Frankfurt am Main, Germany 3 PD Christoph Kessler, Consult GmbH, Icking, Schloßbergweg 11, 82057 Icking-Dorfen, Germany 2

The polymerase chain reaction (PCR), a method for the amplification of nucleic acids, is one of the greatest scientific discoveries of the recent past. Without exaggeration it can be said that this discipline, though still young, has revolutionized molecular biology. The possibilities of PCR appear to be almost unlimited. The number of articles in scientific journals about improvements, new applications, and breakthroughs in the areas of basic and applied research, as well as medicine, diagnostics, and other areas, grows daily. The history of its discovery, during a night drive through the mountains of California, is portrayed in an impressive article by Kary B. Mullis, its inventor. He writes: “The polymerase chain reaction was not the result of a long development process or a lot of experimental work. It was invented by chance on a late evening in May, 1983 by the driver of a gray Honda Civic during a drive along Highway 130 through the mountains between Cloverdale and Anderson Valley in California.” Since the discovery of PCR in 1983, around 800 000 publications in diverse journals and magazines have appeared as of March 2014. PCR applications lie in the most diverse areas: molecular biological basic research, cloning of defined sequence segments, generation of samples, genetics, medicine, genome diagnostics, forensics, food sector, plant breeding, agriculture, environment, and archeology, to just name a few. The PCR reaction has been integrated into the most diverse processes: PCR amplification with subsequent electrophoretic separation of the amplification products, cloning and sequencing of the amplified sequences, allelic PCR for the elucidation of mutations, analysis of the PCR products in blot detection formats, coupling with quantitative heterogeneous and homogeneous detection methods, or in situ PCR for the amplification of particular target sequences directly in tissues or cell cultures. For biological samples an important requirement is for the upfront sample preparation to extract and concentrate nucleic acid from the biological matrix and remove inhibitors. Alternatives to PCR have also been established, with NASBA (nucleic acid sequence-based amplification) and TMA (transcription-mediated amplification) amplification procedures being the best known. While PCR begins with DNA as the starting nucleic acid and the amplification is accomplished by temperature cycles, NASBA and TMA amplifications begin with RNA as the starting nucleic acid and the amplification cycles are carried out at a constant temperature, such as 42 °C.

29.1 Possibilities of PCR The PCR method as a means to amplify any given nucleic acid segment relies on an idea as brilliant as it is simple. To estimate the value of such an amplification, we will begin by considering conventional analysis procedures. For example, gel electrophoresis has a lower Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

29

756

Part IV: Nucleic Acid Analytics

Figure 29.1 Schematic of the polymerization of DNA. In the presence of a primer with free 3´ -OH ends and free nucleotides (dNTPs) DNA polymerases convert single-stranded (ssDNA) into double-stranded DNA.

limit of detection of about 5 ng DNA. If you calculate the number of molecules it ends up being 1010 for a fragment 500 base pairs (bp) in length. To increase the sensitivity, the DNA in such gels can be transferred to solid supports in order to detect them with radioactive or nonradioactive probes. This increases the sensitivity such that approximately 108 molecules can be detected; however, for many diagnostic purposes this is not even close to being sufficient. In viral diagnostics titers of under 1000 particles per milliliter blood are often present. Even the most sensitive current analysis procedures do not come close to reaching the sensitivity of PCR: it is theoretically capable, under optimal conditions, of generating up to 1012 identical molecules from a single nucleic acid segment in a few hours, which then are available for diagnostics or other analytical method (Section 29.5). How is this accomplished? PCR takes advantage of the ability of DNA polymerases to duplicate DNA. A requirement for this is a short segment of double stranded DNA with a free 3´ OH end, which can correspondingly be extended (Figure 29.1). Mullis realized that such a short segment can be created artificially by adding DNA fragments of about 20 nucleotides in length, also referred to as oligonucleotides or primers. These bind, or anneal, to the ends of the DNA strand to be amplified and can now be extended by the polymerase. If the newly synthesized double strand DNA is denatured by increasing the temperature, new primer molecules can bind upon cooling and the process can begin again. If two primers are added to the sample, one of which binds to the sense strand, the other the antisense strand, after each cycle of new synthesis and denaturation a doubling of the segment between the primers takes place. PCR leads to an exponential amplification, since the new strands are available as templates for the next round of amplification (Figure 29.2). And something else was recognized by Mullis’s group: if one uses a temperature stable DNA polymerase, such as those found in organisms that live in hot springs, it is possible to run the reaction without interruption.

29.2 Basics 29.2.1 Instruments The required reagents and tools for PCR are quite simple. In the course of time they have been steadily improved in terms of data security, throughput, and user comfort. The first thermocyclers consisted of three water baths set to different temperatures and the samples were moved by hand and with the aid of a stopwatch from one bath to the next. Later, robotic arms took over this task. Today, they are relatively compact devices that hold the PCR samples in a 96-well plastic micro-volume plate in a metal block, which is systematically heated and cooled. A significant difference of modern thermocyclers is their heating technology, which employs either Peltier elements or works with the aid of liquids. The newest developments in this field are aimed at a drastic increase in speed through miniaturization – PCR in a glass capillary tube with a very low volume (LightCycler® ) or the amplification and detection in a single device (PCR combined with simultaneous FRET detection (TaqMan®)). Devices of this sort allow real time detection of a PCR product during the amplification (Section 29.7).

29 Polymerase Chain Reaction

757

Figure 29.2 Schematic of PCR. The number of DNA segments can theoretically double in each cycle of a process of denaturation, primer annealing, and primer extension. The first two cycles are shown. The number of the copies grows exponentially with each round.

758

Part IV: Nucleic Acid Analytics

29.2.2 Amplification of DNA Probe Preparation

If the starting material for PCR comes from biological material, the nucleic acids must first be released from, for example, virus particles, bacteria, or cells and separated from interfering components like proteins, lipids, or inhibitors, such as hemoglobin degradation products in blood. Probe preparation, besides releasing and purifying the samples, leads to the concentration of the nucleic acids. Various methods are used: A common method uses the property that DNA binds to glass in the presence of chaotropic salts. If the glass particles have a magnetic core, they can be captured by placing a magnet on the vessel wall, separated from interfering substances by wash steps, and the surface bound nucleic acids eluted and concentrated into a small volume of solution by suitable elution buffer. The advantage of this method is that it is generic, thus it can be used on different target sequences. Another method uses hydroxyapatite columns. If the goal is to purify particular target sequences for PCR, capture beads, which have capture probes on their surface, can be used. The capture probe binds to the particle by the binding pair streptavidin (adsorption to the particle surface) and biotin (tagging the capture probe). Complementary target sequences bind to the capture probe, and if the particles employed are also magnetic then separation from interfering substances can take place by applying a magnetic field. Cycles A typical PCR run consists, as a rule, of three stages at different temperatures. This is particularly easy to visualize as a temperature–time profile (Figure 29.3). The reaction is started by heating to 92–98 °C. This step serves to denature DNA into its single strands. Since the initial DNA is present in a complex, high molecular weight structure, a time of 5–10 min is chosen to ensure that even GC-rich sequences are denatured. The second step of the reaction is annealing of the primer. For this to take place, the reaction must be cooled to a probe-specific temperature. Annealing of the primer to a single strand of the target sequence critically influences the specificity of the PCR. The annealing has a critical influence on the specificity of the PCR. After the annealing step the temperature is increased to 72 °C, the optimum temperature of the enzyme used, Taq DNA polymerase. For both the annealing and extension steps a time of less than a minute is usually sufficient. Only for very long PCR products of over a kilobase is the extension time lengthened in order to be sure that the complete strand is synthesized. This is important, since only completely extended DNA strands can function as templates in the next cycle of PCR. In the next step of the reaction it is again heated to 92–95 °C, in order to separate the product doublestranded DNA into single strands. Since in the ideal case only the newly synthesized segments are

Figure 29.3 Temperature/time profile of PCR (two and three step).

29 Polymerase Chain Reaction Table 29.1 Summary of PCR Master mixes (100 μl end volume). Reagent

Final concentration

Taq DNA polymerase

2–5 units 1×

10 × Taq-Buffer (100 mM Tris/HCI pH 8.3; 500 mM KCI)



10 mM nucleotide mixture (dATP, dCTP, dGTP, dTTP)

0.2 mM

MgCI2

0.5–2.5 mM

Primer Ι

0.1–1 μΜ

Primer II

0.1–1 μΜ

Η2O

Variable

Templates

Variable

present as double-strands, after the second cycle a much shorter denaturation time of 10–60 s is necessary. Newer protocols combine annealing and extension into a single step, usually at 62–72 °C. This makes a two-step PCR out of a three step one. For most applications, enough product for further analysis is present after 30–35 cycles. Amplification only requires 40–50 cycles in exceptional situations or for nested PCR (Section 29.3.1). Enzyme

The most important requirements for DNA polymerases in PCR are to have a high processivity, the ability to synthesize long stretches of DNA, and/or rapid binding and extension kinetics at 72 °C, as well as a very high temperature stability at 95 °C. The polymerases that have these characteristics are Taq, Tth, Pwo, and Pfu DNA polymerases. Taq DNA polymerase is the most commonly used in standard protocols. Tth DNA polymerase, like Taq DNA polymerase, has a high 5´ -3´ polymerization activity, but in addition Tth possesses reverse transcriptase activity under certain conditions. This is explained in more detail in Section 29.2.3. Pwo and Pfu DNA polymerases have, in addition to their polymerization activity, a 3´ exonuclease activity. This activity is referred to as proofreading activity, since these enzymes are capable of recognizing and removing an incorrectly incorporated nucleotide so that the mistake can be corrected with a new round of polymerization. Besides the single polymerases, mixtures of the enzymes are often offered commercially. Examples of the reagents needed for amplification according to standard protocols are listed in Table 29.1. Buffer

The buffer conditions need to be set according to the requirements of the polymerase. The ion concentration of the buffer, usually provided with the enzyme, is very important, since it influences the specificity and processivity of the total reaction. The buffers usually provided with Taq DNA polymerase usually come with and without magnesium chloride. For the optimization of a new PCR (Section 29.2.4), buffers without magnesium chloride are better, since the range of conditions is much greater with an additional magnesium chloride solution. Other possible additives are bovine serum albumin (BSA), Tween 20, gelatin, glycerol, formamide, and DMSO. This can lead, in some cases, to stabilization of the enzyme and to optimized primer annealing.

Nucleotides The concentration of the four deoxynucleotide triphosphates (dATP, dCTP, dGTP, dΤΤP) is usually in the range 0.1–0.3 μΜ. All four dNTPs should be present in an equimolar amount. The only exception is when other nucleotides are used for amplification, for example dUTP. Nucleotide analogs are often used in excess, mostly 3 : 1, since they are not incorporated as well by Taq DNA polymerase. Primers

Important prerequisites for an optimal PCR, measured on the basis of specificity and sensitivity, is the selection of the primers. There are four basic types of primers:

   

sequence-specific primers, degenerate primers, oligo(dT) primers (only for RNA), short random primers (usually for RNA).

759

760

Part IV: Nucleic Acid Analytics

Oligo(dT) primers and random primers (hexanucleotides) are usually used for the amplification of RNA. They are described in the following section. The use of degenerate primers is limited to particular questions and will be treated separately in Section 29.3.3. The primers used most often, by far, are sequence-specific primers. To ensure that they actually bind to their target sequences, there are several rules that must be followed. Some of the most important are mentioned briefly here.

Primer requirements 1. 2. 3. 4. 5. 6. 7. 8.

Specificity of the Hybridization and Stringency, Section 28.1.2

at least 17 nucleotides long (usually 17–28 nt); balanced G/C to A/T content; melting point between 55 and 80 °C; melting point of the forward and reverse primers should be as close as possible; no hairpin structures, particularly on the 3´ -end (Figure 29.4); no dimerization: neither by itself nor with the second primer (Figure 29.4); no G/C nucleotide on the 3´ -end, if possible, since this increases the danger of mispriming; no “strange” base sequences like poly(A), more than four Gs, or long G/C stretches, if possible.

Today diverse computer programs support the user in the search for suitable primer sequences and, in addition, show their melting points. This can be calculated with various formulas. The simplest of these formulas calculates a temperature of 2 °C for each A or T and for every G or C a temperature of 4 °C. For a primer of 20 nucleotides in length, a 20-mer, with a balance number of A/T and G/C, it calculates a melting point of 60 °C. Templates (Genomic DNA, Plasmids, Viral DNA) The most important influence of the target on the success of PCR is the length of segment to be amplified, the sequence of the primer binding sites, and the number of input molecules. A microgram of human genomic DNA contains 3 × 105 target sequences, provided it is a single copy gene and not a repetitive element. The same mass of a plasmid of 3 kb DNA contains 3 × 1011 molecules. In other words 1 μg genomic DNA contains as many molecules as 1 pg of plasmid DNA. This needs to be taken into account when using different templates, particularly when used for preparative purposes. The maximum amplification length of a DNA is primarily determined by the processivity of the polymerase used. Today there are enzymes and enzyme mixtures that allow amplification of fragments up to 40 kb. In such cases the extension time must be much longer, up to 30 min per cycle. In general, short segments of 0.1–1 kb in length are favored, since these can be optimally amplified with PCR. Besides the length and the number of molecules, the primer binding sites also determine whether PCR is successful. To avoid mispriming, repetitive sequences should be excluded and instead a single copy site should be selected for the primers.

Figure 29.4 Secondary structures of primers. For the selection of primers it is important to avoid secondary structures. Complementary segments between the sense and antisense primers must also be factored into the analysis.

29 Polymerase Chain Reaction

761

29.2.3 Amplification of RNA (RT-PCR) Many methods to analyze RNA exist in molecular biology, such as Northern blots, in situ hybridization, RNAse protection assays, and nuclease S1 analysis, to name just a few. All these methods have the disadvantage, however, that they are time consuming and often not sensitive enough. This is particularly true for the analysis of low copy number transcripts or for viral RNA, which is only present at very low starting concentrations. Adaptation of PCR technology to allow the amplification of RNA led to many new discoveries and more sensitive diagnostics. It can be used to investigate gene expression in cells or, with the aid of quantitative RT-PCR, to determine the amount of a specific mRNA or viral RNA. In addition, with oligo(dT) priming (see below), complete cDNA banks can be created, which enables an overview of tissue-specific expression.

In situ Hybridization, Section 35.1.4 Ribonuclease Protection Assay (RPA), Section 34.1.3 Nuclease S1 Analysis of RNA, Section 34.1.2

Enzymes Since the starting RNA cannot be directly used as a temple by Taq DNA polymerase, the RNA must first be transcribed into DNA to be amplified. There are several enzymes, called reverse transcriptases (RTases) or RNA-dependent DNA polymerases, for this purpose. The newly synthesized strand is termed complementary DNA (cDNA) and the step in which this cDNA is created is called reverse transcription (RT). The complete reaction of RT and amplification is therefore called RT-PCR (Figure 29.5). Several different reverse transcriptases can be used for this purpose. MMLV RTase The enzyme comes from Moloney murine leukemia virus, has an optimum temperature of 37 °C, and is able to synthesize cDNA up to a length of 10 kb due to its high processivity. The optimum pH is 8.3. AMV RTase Isolated from avian myeloblastosis virus (AMV) from birds, its optimum temperature is 42 °C and it has a similarly high processivity as MMLV RTase. Its optimum pH is 7.0.

Figure 29.5 Schematic portrayal of RTPCR. Since RNA cannot be amplified directly by PCR, it must first be transcribed into cDNA. Enzymes that catalyze this step are MMLV RTase, AMV RTase, and Tth polymerase.

762

Part IV: Nucleic Acid Analytics

Tth DNA Polymerase This heat stable enzyme comes from the bacteria Thermus thermophilus. In contrast to the other two enzymes, Tth DNA polymerase possesses two activities: In the presence of manganese ions it has both RT and DNA polymerase activity. Since the Tth DNA polymerase comes from a thermophile, just like Taq polymerase, it has a temperature optimum of 60–70 °C. It is the only enzyme capable of carrying out both steps of an RT-PCR under the same buffer conditions. For the RT step a high concentration of manganese ions is optimal, but it tends to inhibit DNA polymerase. This affects the processivity of the enzyme. For this reason, Tth DNA polymerase is only able to synthesize 1–2 kb of cDNA.

Procedure After using different enzymes, there are also different possibilities to carry out the reaction for a RT-PCR. The first RT step is carried out in a relatively small volume (to increase the sensitivity). This has the advantage that the reaction conditions can be set optimally for the RT enzyme used. After the RT step, the entire reaction, or an aliquot, is removed and a “normal” PCR is carried out. The disadvantage of this procedure is an increased danger of contamination (Section 29.4), since an additional pipetting step is required. For such a two-step process, AMV or MMLV RTase is used for the RT and Taq DNA polymerase for the PCR.

In Two Reaction Tubes

In a Single Reaction Tube For the reason mentioned above, it is advantageous to carry out the whole RT-PCR reaction in one tube without the need to pipette from one tube into another. This is possible, in principle, with all three reverse transcriptases, partially in combination with Taq polymerase; however, Tth DNA polymerase is particularly well suited. In almost all cases, the lower processivity is an acceptable price to pay, since the length of the segment to be amplified is less than 2 kb. The main advantage, however, of using Tth DNA polymerase is the possible increase of the reaction temperature of the RT step to 60 °C. This temperature helps to resolve secondary structures and thereby eases annealing of the primer.

Primer

For RT-PCR, three different types of primers can be used (Figure 29.6):

 Sequence-specific primers anneal specifically in both the RT step and the following  

Figure 29.6 Different priming methods in RT-PCR.

amplification to the same site of the RNA or cDNA. These are most often used for diagnostic tests for the detection of viral RNA. Oligo(dT) primers consist of a segment of 12–18 dTs, which semi-specifically bind to the polyA tail of eukaryotic mRNAs. They are only useful for the RT step. For the following amplification steps further, sequence-specific primers are needed. Short random primers are a mixture of hexanucleotides of different sequences. They bind “randomly” to the RNA and lead to a pool of cDNAs of various lengths, which, like the oligo (dT) primers, are then amplified with a second set of sequence-specific primers.

29 Polymerase Chain Reaction

763

29.2.4 Optimizing the Reaction One of the most important and time-consuming parts of the establishment of a new PCR is optimization of the reaction. Among the analytical aspects, the sensitivity is particularly important. This is critical in forensic medicine or the detection of very low amounts of DNA or RNA in infectious disease. In cases where the amplification is for preparative purposes, such as by the synthesis of probes or templates for sequencing, the yield, or amount of the PCR product formed, is of primary importance. This section will explain a few of the important adjustments that can be made and a few strategies for the optimization. DNA Amplification Choice of the Primers The selection of primers requires a substantial investment of time. If the sequence segments allow, several primers pairs should be tested, since secondary structure of the target sequence is, in principle, always to be expected. If no PCR product is synthesized, it can be helpful to successively lower the annealing temperature. However, one must pay particular attention to the possibility of nonspecific amplification. Magnesium Ions An important factor, which determines the processivity and overall activity of Taq polymerase, is the concentration of magnesium ions. In the first experiments a concentration of between 0.5 and 5 mM should be explored. Additives

Many additives to the reaction solution can help stabilize Taq polymerase or annealing of the primers. These include glycerol, BSA, and PEG. Denaturation is aided by the addition of DMSO, formamide, Tween 20 Triton X-100, or NP40. Detergents may also stabilize the DNA polymerase.

Hot Start PCR

If nonspecific amplifications are a problem, in some cases a hot start PCR can help: To suppress polymerization of nonspecifically hybridized primers at low temperatures, the activity of Taq polymerase is controlled such that it only begins at higher temperatures. One possibility is to add the enzyme only after the sample has been heated. There are also commercially available antibodies against Taq polymerase that only denature at higher temperatures and set the enzyme free. There are also chemically modified versions of Taq DNA polymerase (Taq GOLD), which is inactive below 60 °C; the chemical adducts are hydrolyzed under the specific buffer conditions at elevated temperatures. Finally, some DNA polymerases are supplied with specific aptamers that inhibit activity below the melting temperature of the aptamer.

Templates

The quality of the template is also very important for a successful PCR. Sample preparation should ensure that PCR inhibitors are efficiently removed. Of particular significance are degradation products of hemoglobin, for blood preparations, and ethanol, which is often used to precipitate DNA.

RNA Amplification In addition to the points mentioned above, there are a few more aspects to factor in for RT-PCR: Single-stranded RNA often has more secondary structure than DNA. Since the formation of secondary structure is still a very complex and poorly understood process, the current computer programs are only of limited use in this aspect of primer selection. Even in cases where primer design appears optimal, the synthesis of the PCR product may fail due to secondary structures that either prevent the annealing of the primers or block extension. In this context, the use of Tth DNA polymerase has often proven advantageous, since the RT reaction takes place at 60 °C, which helps to melt such secondary structures. In some cases, adding an RNase inhibitor is recommended, since RNA is fundamentally a much more vulnerable template than DNA.

29.2.5 Quantitative PCR The quantification of nucleic acids with PCR or RT-PCR has become an essential component of diagnostic questions. This is particularly true for the diagnosis and monitoring of infectious diseases. Two examples of great significance are the AIDS-causing HIV (human

RT-PCR is principally less efficient than PCR. Even under optimal conditions only about 10–30% of the RNA present is transcribed into cDNA, which is then available for further amplification.

764

Part IV: Nucleic Acid Analytics

Figure 29.7 Typical course of a PCR. The kinetics of the PCR runs through an exponential phase into a plateau phase. The plateau comes about as a result of the build-up of inhibitors, competition between strand re-annealing with primer annealing, and because the reagents become limiting.

immunodeficiency virus) and the liver inflammation-causing hepatitis C-virus (HCV). In addition, in oncology there is interest in the quantitative measurement of mRNAs. The quantification is complicated by the fact that PCR is not a linear amplification. The exponential nature of the amplification means that small differences in reaction efficiency, such as in the presence of inhibitors in a particular sample, have profound effects on the yield of amplicon. The following equations demonstrate this: N ˆ N 0 2n

(29.1)

where N is the number of amplified molecules, Ν0 the number of molecules prior to amplification, and n the number of cycles. The number of molecules doubles under these (idealized) conditions with each cycle. In practice, however, the following formula applies: N ˆ N 0 …1 ‡ E †n

(29.2)

where E is the efficiency of the reaction and has a value between 0 and 1. This value depends very strongly on the degree of optimization of the PCR. Experimentally, values for an optimized PCR have been found to be between E = 0.8 and E = 0.9. The quantitative measurement is also made more difficult by the fact that towards the end of the amplification the exponential phase begins to plateau; this means that the value of E changes during the PCR (Figure 29.7). The maximum amount of product that can be generated during PCR is around 1013 molecules, but can deviate relatively strongly downwards. Many methods for quantitative measurement have been developed and continuously improved over the years:

 Limiting dilution: A reference standard of known concentration is diluted in multiple steps,



amplified, and the concentration is determined that was required in the preceding PCR to generate a barely detectable amplification. A sample to be measured is then also diluted in multiple steps to measure this same point. The number of dilutions then allows conclusions to be drawn about the starting concentration of the solution. External standard: The concentration of a sample to be measured is determined by comparing the signal it generated with that of a standard of known concentration

Both methods are not able to recognize internal interference in the amplification efficiency of an individual sample. The next generation of quantitative tests was focused on internal controls and standardized amplification reactions. There are two types of standardization:

 Internal endogenous standard: quantification of a so-called “house-keeping” gene.  Competitive (RT) PCR: quantification using mimic fragments, which are added to the reaction and amplified along with the actual target sequence. The following explains the last three options for quantification in more detail. External Standard Samples of known concentrations are used to generate a curve to provide an external standard. The standard should be relatively similar to the intended

29 Polymerase Chain Reaction

765

Figure 29.8 Quantification using an external standard. Shown is the measurement of a sample against a standard curve that was created with a known concentration of HIV from a cell line (geq, genome equivalents).

target sequence and should use the same primers for amplification. Very suitable, for example, is an HIV cell line that contains a known number of proviral genomes. This cell line is serially diluted, processed, amplified, and the signal plotted compared to the starting concentration (Figure 29.8). After amplifying the unknown sample, the signal is looked up on the standard curve to determine the starting concentration. The disadvantage is, as mentioned, the lack of an internal control of the reaction, ensuring that it ran properly and like the other samples. It is easy to imagine that even a low level of inhibition can lead to dramatic under-quantitation. Internal Standardization DNA amounts in the samples are calibrated to internal sequences of the genome to provide internal standardization. The β-globin gene is usually used for DNA measurements (e.g., HIV provirus genomes). This requires the use of two primer pairs in a multiplex PCR (Section 29.3.4). One pair amplifies the test DNA, the other a segment of the β-globin gene. Since the amount of the β-globin gene is known and the signal after amplification is constant, this allows conclusions to be drawn about the amount of the target DNA. In contrast to external standards, this procedure allows the detection of inhibitory substances, as long as they have the same influence on both PCRs and are not sequencespecific. For the quantification of RNA, the signal can be calibrated against the signal from socalled “house-keeping” genes. These are genes that are thought to be expressed in all cells and tissues at the same level at all times. An important aspect that makes the quantitative measurement of RNAs significantly more difficult is the great variation in the efficiency of reverse transcription, in particular when cDNA is synthesized from two different RNAs. A possible way to limit this variance is by the use of artificial standards, so-called mimic fragments. Competitive (RT) PCR

An artificial, cloned standard of known concentration and containing the same primer binding sites is added to the reaction in this procedure. Since it is coamplified with the same primers as the target sequence and thus “mimics” the target it is called a mimic fragment and the reaction is called competitive (RT) PCR. Ideally the amplified mimic fragment is the same size as the target sequence. After amplification, the two are separated, either by a differential hybridization or by different restriction sites, and analyzed. Alternatively, differentially labeled probes are used to detect the target and mimic products by TaqMan PCR (Section 29.2.1). Competition between the two target sequences occurs whenever the starting amount differs by more than about three to four orders of magnitude (Figure 29.9). The starting sample for measurement is divided into about four equal aliquots and an increasing amount of RNA mimic fragment is added to each. After amplification the two signals are compared to the starting concentration of the RNA mimic fragment and at the intersection of the two curves, the starting amount of the test sample is extrapolated (Figure 29.9).

In general RNA mimic fragments, and not DNA mimic fragments, should always be used, since the greatest variance comes from the RT step, as mentioned above. In addition the RNA mimic fragments should be added at the beginning in order to control each step (sample preparation, amplification, and detection) along the way.

766

Part IV: Nucleic Acid Analytics

Figure 29.9 Competitive (RT) PCR: The sample to be measured is aliquoted and spiked with an increasing, known amount of mimic fragment. After the amplification, the same pieces of the amplicon are hybridized with the corresponding probe and the signals are compared to the starting concentrations of the mimic fragments. The starting concentration is given by the point of equivalence of the sample.

29.3 Special PCR Techniques 29.3.1 Nested PCR Nested PCR involves the use of two sets of primer pairs, an outer and an inner pair (Figure 29.10). The advantage of these methods is the increased specificity and sensitivity of the entire reaction, since the outer pair is first used to synthesize a larger amplicon (PCR

Figure 29.10 Nested PCR: The starting DNA is successively amplified in two separate PCRs. First an out primer pair produces a somewhat larger segment, the first amplicon, then an inner primer pair binds. This pair amplifies a smaller internal segment, the second amplicon, in a further 20–25 cycles. Nested PCR can lead to significantly increased sensitivity and specificity of the PCR.

29 Polymerase Chain Reaction

product) and in a second reaction the inner primer pair is used to amplify the first amplicon, which serves to eliminate the by-products of the first amplification. In general the inner pair is added to the mix after 15–20 cycles and the reaction is allowed to continue for another 20–25 cycles. The serious disadvantage of this method, however, is the drastically increased chance of contamination (Section 29.4), caused by the pipetting of the amplicon. The possibility of avoiding this danger exists with one tube nested PCR, which involves adding all four primers to the reaction at the beginning. The outer primer pairs must have a higher melting point than the inner pair so that at first only the outer pair can anneal under the reaction conditions in use. After the correct number of cycles, the annealing temperature is lowered so that the inner pair can now anneal and produce the inner product.

29.3.2 Asymmetric PCR When one of the two primers is present in excess relative to the other primer this is referred to as asymmetric PCR. These conditions lead to a selective amplification of one of the two strands. This technique is used for, among other things, sequencing PCR products (Section 29.3.5). If the goal is to hybridize the PCR product with a labeled probe after amplification, it can be advantageous to use asymmetric PCR. The strand that hybridizes to the probe is preferentially amplified, which creates a more favorable situation in the competition between renaturation of the two amplified strands and hybridization with the labeled probe. This comes at the price of the amplification no longer being exponential, but instead it quickly become linear.

29.3.3 Use of Degenerate Primers Degenerate primers are a mixture of individual molecules that differ at certain points in their sequence. They are used whenever the sequence of the amplification target is not exactly known or the sequences diverge from one another. The first case comes into play when, for example, a gene segment from a different species, whose sequence is only known for other species, needs to be amplified. Homology searches allow variable positions to be identified and corresponding degenerate primers synthesized that contain all the expected nucleotide variations. They are then used to attempt to amplify the corresponding segment from the desired species. Another use is when only the amino acid sequence of the protein is known. In such cases the amino acid sequence can be used to narrow down the possibilities of the base sequence enough so that degenerate primers can be created. Degenerate primers are also often used when the targets sequences vary from one another. This can occur in the amplification of different HIV subtypes. Even in regions that are otherwise very strongly conserved, single nucleotides can differ from subtype to subtype. The more degenerate the primers are, the greater the danger of nonspecific amplification. This is a fundamental disadvantage of this approach.

29.3.4 Multiplex PCR Multiplex PCR refers to the procedure of using multiple specific primer pairs to generate the same number of amplicons, all in one tube at the same time. It is apparent that the throughput increases drastically and the amount of work decreases at the same time. Classic uses for multiplex PCR are especially routine diagnostics. For example, the disease cystic fibrosis (CF) is due to certain mutations in the CFTR gene. However, over 100 mutations of this gene are known today, which can be spread across all 24 exons. Multiplex PCR is able to amplify many of the exons at the same time, in order to then investigate the products for point mutations. The situation is similar for other inherited diseases, such as familial hypercholesterolemia, Duchenne muscular dystrophy, polycystic kidneys, and many more.

767

768

Part IV: Nucleic Acid Analytics

Another very attractive indication is the diagnosis of several viral infections (HBV, HCV, HIV) at the same time from a single blood sample, which is of particular interest for blood banks. Similarly to the use of degenerate primers, however, it can also be difficult here to get the high complexity of the total reaction under control, which often leads to nonspecific amplification. The newest multiplex protocols have succeeded in measuring up to five viral parameters (HIV-1M, HIV-10, HIV-2, HBV, HCV) at the same time, without nonspecific by-products. Another example of a multiplexing application is a new test for sepsis, which uses primers that bind to the spacer region of rRNA of pathogenic bacteria to detect and differentiate between a whole palette of pathogenic organisms. This allows the rapid selection of a suitable antibiotic that would otherwise require days with conventional tests, such as selective cultivation of the bacteria, by which time life threatening complications such as septic shock or multiple organ failure may have arisen. The test detects Gram negative (e.g., Klebsiella pneumoniae), Gram positive pathogenic bacteria (e.g., Staphylococcus aureus), and pathogenic fungi (e.g., Candida albicans). The high specificity is demonstrated by the lack of nonspecific cross reactions with over 50 closely related bacteria.

Multiplex PCR is also used for DNA analysis in forensics and paternity testing. Up to 16 PCR amplicons are generated at the same time, which can be separated into different peaks on the basis of different primer fluorescence markers and length after gel electrophoresis. The individual heterogeneity of the amplified repetitive sequence segments show PCR multiplex patterns specific to the individual. These patterns can be stored in binary data banks and rapidly identified with the aid of search programs.

29.3.5 Cycle sequencing

Sequencing According to Sanger: the Did, eoxy Method Section 30.1.1

To sequence a PCR product, the sequence need not necessarily first be cloned into phages (M13) or plasmids; instead it can be analyzed directly. Either the product is sequenced subsequent to the PCR or sequencing takes place during the amplification reaction. The latter is referred to as cycle sequencing. Since only one primer is used in each reaction tube, the amplification is linear instead of exponential. As in Sanger sequencing, the chain termination method is usually used, with the aid of dideoxynucleotides. The sequencing takes place, therefore, in four reaction tubes, which differ in the corresponding termination mix (ddATP, ddCTP, ddGTP, DDTTP). The reaction can be started with very small amounts of DNA, which is a decisive advantage of the cycle sequencing method compared to older sequencing methods. In addition, any sort of double- or single-strand DNA can be used as the template. Cycle sequencing is used particularly often for mutation analysis, since it allows the easy investigation of certain genome segments without the need to first clone the region. A disadvantage of this method, however, is that a polymerization error of the Taq polymerase that occurs in an early cycle of the linear amplification can be interpreted as a suspected “mutation” and lead to false conclusions. In such cases the opposite strand should always be sequenced. After the reaction is complete, the products are electrophoretically separated on the basis of their differing lengths and the sequence determined. Depending on the label used for the primer, the reaction can be radioactive or non-radioactive. Modern approaches to cycle sequencing utilize differentially fluorescently labeled ddNTPs so that only a single tube reaction needs to be carried out.

29.3.6 In Vitro Mutagenesis PCR is ideally suited to introduce mutations into DNA strands in vitro and produce sufficient amounts of mutations to be useful for many purposes. The creation of mimic fragments for competitive PCR (Section 29.2.5) with such a substitution mutagenesis involves the exchange of nucleotide sequences. Mutagenic PCR can also be used to generate a diversity library that can be screened for desired features.

29.3.7 Homogeneous PCR Detection Procedures To avoid contamination (Section 29.4), real time procedures are increasingly used in closed systems, in which amplification and detection take place in a single step, without the need to

29 Polymerase Chain Reaction

open the reaction vessel. This avoids contamination during pipetting steps or by the formation of aerosols by opening of the reaction vessels. The use of closed systems and direct detection of the fluorescent signal through the thin glass wall of the capillary tubes or the tops of film-sealed microwell plates at any time during the amplification reaction has greatly reduced the danger of contamination. The formation of amplicons is followed in real time. Another advantage of this homogenous format is its high dynamic range of up to 8–10 orders of magnitude in comparison to heterogeneous formats (see Chapter 28).

29.3.8 Quantitative Amplification Procedures Besides the quantitative TaqMan format (5´ nuclease assay) with TaqMan, HybProbe, or molecular beacon probes, homogenous amplifications are also carried out with the aid of fluorescent intercalating dyes. Different amplicons can be detected in multiplex procedures by creating amplicons of different lengths, and thus different melting points, by measuring the resulting hysteresis curves during the PCR cycles. Another quantitative homogeneous amplification technique is the measurement of amplification products by fluorescence depolarization. In this format, detection takes place through an increase in the polarization, which is a result of binding of the single-strand probes to the generated amplicon. The hybridization of the detection probe to the generated amplicon increases the molecular weight of the resulting complex and thereby the polarization of monochromatic light (e.g., xenon light with a monochromator at 495 nm) shone through it. This technique is also used in association with strand displacement amplification (Section 29.6.2).

29.3.9 In Situ PCR Recently protocols for in situ PCR, amplification within cells (e.g., on histological slices) have been published. The difficulty in this procedure is to stabilize the tissue structure on the slide by suitable fixation such that the structure remains intact through the thermal cycles. This is accomplished with special protocols that fix the paraffin-embedded tissue structures with a 10%, buffered formalin solution. In situ PCR leads to a large increase in the detected signals in histological slices. There are already thermocyclers on the market that directly heat the slides.

29.3.10 Other Approaches Besides the special PCR approaches described above, there are many other techniques that are often implemented to answer specific questions or are used for preparative purposes. Only a brief overview can be presented here:

 Digital PCR: An alternative approach to quantitate starting copy DNA is to perform limiting

 



dilutions of the sample, followed by hundreds or even thousands of PCRs, such that each reaction contains at most one template copy. Poisson statistics are applied to the fraction of positive reactions to extrapolate the starting concentration without the need for a standard curve. Sophisticated microfabricated devices partition the reaction automatically either into droplets or chambers on a chip, coupled with sophisticated post-PCR detection formats. Such dilution of sample also enables the differentiation and detection of rare mutant sequences. RACE PCR: rapid amplification of DNA ends is a method to amplify and clone the 5´ ends of cDNAs, in particular those of long mRNAs that were not completely synthesized during in vitro transcription. Inverse PCR: An important method for the amplification of unknown DNA sequences. Primers binding in opposite directions to a known section of sequence are synthesized and used to amplify the DNA in both directions. The products are digested with a restriction enzyme and self-ligated to form circular DNA sequences, which can be amplified and sequenced with the starting primers. Vectorette PCR: this is used frequently for the characterization of unknown DNA segments.

Hybridization Methods, Section 28.1.3

769

770

Part IV: Nucleic Acid Analytics

 Alu PCR: Alu elements are short repetitive elements that are more or less evenly distributed

cot Curves, Basic Principles of Hybridization, Section 28.1

 



throughout the genome of primates. By amplification with primers that bind to the Alu repeats, a characteristic pattern of bands is created that is so distinctive it can be regarded as a genetic fingerprint of an individual. cot Curves: DOP PCR: Degenerated optimized PCR (see also Section 29.3.3) is used for the analysis of micro-amplifications and deletions in comparative genome hybridization. In this method, the entire genetic material of a test cell (e.g., cancer cell) and a control cell are amplified and different detection markers are incorporated, such as rhodamine/digoxigenin- and fluorescein/ biotin-labeled, Chapter 28). After pre-hybridization of the chromosomes of the target cell with cot DNA, to saturate the repetitive sequences, the mix of digoxigenin and biotin labeled amplicons is hybridized with the chromosomes of the target cell. Detection is accomplished by use of the different filters of a fluorescence microscope for fluorescein and rhodamine. Overlapping the signals reveals a fluorescence pattern which marks those spots where the fluorescein or rhodamine signal predominates due to a micro-amplification or deletion. PRINS PCR: Primed in situ PCR is precursor of in situ PCR, in which a primer is extended once after in situ hybridization to the target DNA in a fixed target cell.

29.4 Contamination Problems The high analytical sensitivity of PCR represents, for obvious reasons, enormous progress for science and diagnostics. However, the ability to create many millions of molecules from just a few in a very brief span of time creates an enormous danger of false positive results, since each molecule is an optimal template for further amplification. In addition, the danger of contamination is particularly great when laboratories frequently work with the same primers and always amplify the same target with them. Since PCR will become increasingly common in routine laboratory use in coming years, we will examine the problem in more detail here. There are three basic types of contamination with DNA:

 cross contamination from sample to sample during isolation of the DNA;  contamination with cloned material that contains the amplification target sequence;  cross contamination with already amplified DNA. In general the greatest danger is presented by aerosols. Table 29.2 demonstrates what sort of contamination is probable from aerosols, which are fine drops of liquid suspended in air. Beginning with the assumed size of such aerosols, the volume of such particles are given. For an amplification reaction with a volume of one hundred microliters, the table shows that already a picoliter contains 10 000 amplifiable molecules that could potentially cause contamination. Each of these amplicons is a perfect target for a new amplification.

29.4.1 Avoiding Contamination Avoiding contamination should have the highest priority in diagnostic as well in research laboratories, even before decontamination. To accomplish this, it is important to be clear about the possible sources of contamination: Table 29.2 Danger of contamination by aerosols. Type of contamination

Size

Volume

Amplicon per volume

100 μΙ

1012

∼1 μΙ

1010

∼100 μm

∼1 nΙ

107

∼10 μm

∼1 pl

104

∼1 μm

∼1 fl

10

Splashes

Aerosol

29 Polymerase Chain Reaction

 aerosol formation by centrifugation, ventilation, uncontrolled opening of the sample and PCR tubes;

 transfer by contaminated pipettes, disposables, reagents, gloves, clothes, hair, and so on;  splashes while opening tubes or pipetting liquids. Numerous measures that can help to minimize the risk of contamination can be inferred from these points.

General measures to minimize the risk of contamination:

      

aliquot frequently used reagents and samples; only use autoclaved reagents, pipette tips, and reaction tubes; minimize manual steps to the greatest extent possible; avoid pipetting; avoid strong drafts; clean and decontaminate devices and pipettes from time to time with dilute bleach solutions; if possible, avoid the use of nested PCR, since pipetting the samples drastically increases the risk of contamination (Section 29.3.1).

Sample handling:

      

open tubes with cotton wool, a cloth, or something similar to avoid contamination of your gloves; if the tube has liquid in the lid, centrifuge briefly; work as closed as possible, which means only one tube should be open at a time; only use pipette filter tips or positive displacement pipettes; change gloves frequently; pipette slowly and with care; open tubes slowly and with care.

Waste disposal:

 inactivate used (contaminated) pipette tips with HCl or bleach (sodium hypochlorite);  do not dispose of remaining samples and amplicons in the regular trash, dispose of separately after inactivation;

 close PCR tubes before disposal. Separation of work areas:

 strict division into three areas:

  

– Area 1: preparation of the amplification mixes; for this purpose a laminar flow hood can be used; – Area 2: sample preparation; – Area 3: amplification and detection, must be in a separate room; under all circumstances, separate clothing must be worn in areas 1 or 2 and area 3; separate hardware (pipettors, tips, etc.) for each area; Sample flow should always be one way: Area 1 → Area 2 → Area 3 → autoclave or waste.

In general, samples and amplified material should be handled with the same amount of care as infectious or radioactive material. In addition, a suitable number of negative controls should be taken through each and every step (sample preparation, amplification, detection) in order to detect contamination early on.

29.4.2 Decontamination Decontamination includes two different types of measures:

 Chemical or physical measures to clean equipment or the laboratory. This includes substances that destroy DNA directly or at least inactivate it such that it can no longer be amplified. Examples are HCl, sodium hypochlorite and peroxide.

771

772

Part IV: Nucleic Acid Analytics

 Measures that are integrated into the routine operation of the tests and take place before or after every amplification. These can be further subdivided into physical, chemical, and enzymatic measures. Physical Measures UV irradiation: bombarding amplicons after PCR with UV light at a wavelength of 254 nm leads to the formation of pyrimidine dimers (T-T, C-T, C-C) within and between the DNA strands of the amplicon. Such inactivated DNA is no longer usable by Taq polymerase as a template. This measure has, however, a few disadvantages. There is a correlation between DNA length and the efficiency of the irradiation: The shorter the amplicon, the less effective the UV light is. In addition, the decontamination is less effective for GC-rich templates than for AT-rich templates. Chemical Measures Isopsoralens are intercalating dyes, which lead to crosslinking of the two strands when irradiated with long wavelength UV light (312–365 nm). This also blocks the polymerase activity. 3´ -Terminal ribonucleotides in primers create a base-sensitive position in the amplicon. A subsequent treatment with base hydrolyzes the primer binding sites. Enzymatic Measures

   

Restriction digestion, DNase I digestion, exonuclease III digestion, UNG system.

Digestion with uracil-N-glycosylase (UNG) is the most efficient method for the decontamination of previously amplified DNA. This measure relies on the incorporation of dUTP instead of dTTP during amplification. The resulting PCR product contains uracil bases in both strands and is therefore different than all starting DNAs to be amplified. UNG is an enzyme that cleaves the glycosidic bond between uracil and the sugar phosphate backbone of the DNA. Through subsequent heating or base treatment such abasic DNA hydrolyzes into small fragments and thus can no longer be amplified. The UNG system is particularly effective for two reasons: first, every newly amplified molecule contains uracil bases and, thus, is a substrate for UNG. Second, UNG decontaminates before a new amplification, when possible contaminations are smallest. Many other decontamination measures have the disadvantage that they either occur after the amplification and then need to be quantitatively effective or they require additional steps, which bring with it an additional risk of contamination. Since uracil bases are only substrates for UNG in single- or double-stranded DNA, but not as individual nucleotides or in RNA, the enzyme can be added directly to the amplification mixture and is suitable for use in RT-PCR.

29.5 Applications Many applications of PCR have already been mentioned in Sections 29.2 and 29.3. Most of these address particular questions in a research laboratory. In the following section, applications in medical diagnostic laboratories will be described in more detail and the possibilities of PCR in genomic analysis will be sketched out.

29.5.1 Detection of Infectious Diseases The detection of disease causing agents is an ideal application for PCR, since many bacteria and viruses can either not be cultivated at all or only very slowly and conventional tests are not nearly as sensitive as PCR (Figure 29.11). Consequently, such tests have entered into the routine of molecular laboratories in food quality control, as well as veterinary and human medicine. Examples are the viruses HCV (hepatitis C virus), HIV (human immunodeficiency virus), HBV (hepatitis B virus), and CMV (cytomegalovirus), as well as the bacteria Chlamydia, Mycobacterium, Neisseria, and Salmonella. By the detection of such pathogenic organisms with PCR, there are three critical aspects: A sufficient specificity of the reaction to avoid false positive results, a very high but also clinically

29 Polymerase Chain Reaction

773

Figure 29.11 Schematic of the detection of the trinucleotide (CAG-) expansion typical for Huntington’s disease. The amplification takes place with specific primers that flank the CAG-repeat. The size of the repeats in a polyacrylamide gel provides the diagnostic result (see Figure 29.12).

relevant sensitivity of the test, and a clear, verified result. The challenge of the specificity of the complete reaction results from the question of how specifically a primer pair needs to bind, in order to only amplify HIV but at the same time to recognize all the subtypes, for example. The sensitivity is decisively influenced by the method of sample collection and the volume of the sample. The former must guarantee an efficient separation from inhibitors, to enable an undisturbed reaction. This is particularly important for difficult sample material like sputum, stool, and urine. In addition, the amount of sample naturally has an effect on the sensitivity of the reaction. With ultrasensitive tests such as for the diagnosis of HIV it is often necessary to enrich the virus in the samples. This usually involves ultracentrifugation. In this way a sensitivity of around 20 genome equivalents per milliliter can be achieved. On the other hand, in considering sensitivity, it is always important to factor in the clinical relevance. For example, when a dose of more than 105 bacteria from the Salmonella group are necessary to trigger an acute gastroenteritis, there is no need for an ultrasensitive test. The target sequence also plays a decisive role in the accuracy of the test. HIV, like all retroviruses, replicates its genome via a DNA intermediate, the proviruses in the host genome. Only during an acute infection can replicating RNA be found in the blood of the host organism. Besides the relatively qualitative yes–no answer of PCR, for certain parameters, such as HCV and HIV, more and more quantitative tests are gaining prominence. These allow monitoring of the success of a therapy and thereby help to recognize early on the success of certain therapeutic measures on the course of disease.

29.5.2 Detection of Genetic Defects In the area of molecular medicine, PCR has provided the prerequisite to diagnose many genetic or acquired diseases on the level of DNA or RNA, prior to the appearance of symptoms. Many methods for this purpose have been developed and refined. In general this is a very new and innovative field that is subject to rapid changes. This section will only provide a general overview of current methods and highlight a few examples. The detection of known genetic defects can be classified based on the type of mutation, with the exception of translocations, into point mutations or length variations, such as insertions, deletions, and expansions. It is also important to differentiate between simple single site mutations or diseases that are caused by complex mutation patterns. For example, over 300 mutations are known for cystic fibrosis and familial hypercholesterolemia, while Huntington’s disease (see below) is caused by a single mutation. Each of these mutation types requires a different method. Length Variation Mutations Mutated and wild-type alleles of this mutation type can be differentiated on the basis of the length of the PCR product. A well-known example is the detection of trinucleotide expansions for a few neurodegenerative diseases. The causal mutation in Huntington’s disease (HD) is the expansion of a trinucleotide repeat (CAG) in the affected HD allele in the IT15 gene. The normal allele already varies in length, with up to 32 repeats being common. Research has shown, however, that a repeat length above 36 CAGs can be described as a positive result. The principle of the test is shown in Figure 29.12. Since the

Figure 29.12 Detection of the expansion of the Huntington gene locus. Shown is the trinucleotide expansion (n = 55) of an affected woman and her pre-symptomatic children. While her son’s expansion is also 55 CAG repeats, transmission of the allele to the first daughter reduced the repeat length (n = 51), while the second daughter experienced an expansion (n = 59). The healthy father is homozygous, with each allele 19 CAG repeats in length.

774

Part IV: Nucleic Acid Analytics

disease behaves in an autosomal dominant manner and homozygous carriers practically do not exist, a second, healthy gene is always found. After amplification of both alleles, the PCR products are separated on an electrophoretic gel and the corresponding repeat length is determined (Figure 29.12). Larger length variations, such as those found in Fragile X Syndrome, can also be detected in Southern blots after hybridization with specific probes.

Sequencing, Chapter 30

In vitro Restriction and Application, Section 27.1.4

Specificity of the Hybridization and Stringency; Section 28.1.2

Figure 29.13 Reverse dot blot: The detection of known mutations makes use of allele-specific hybridization probes (wt = wild type; mut = mutant), which are immobilized on a membrane. The position of hybridization is visualized by the label and allows genotyping.

Point Mutations Sequencing The surest way to identify and characterize known, as well as unknown, mutations is to sequence the PCR product (Section 29.3.5). Since this is technically demanding and laborintensive, it is not suitable for screening procedures. Restriction Fragment Length Polymorphisms RFLPs can be used for analysis whenever mutations have led to the creation or loss of a restriction site. After amplification, the PCR product is cleaved with the corresponding restriction enzyme and the fragments are separated by gel electrophoresis. This method is only suitable for the detection of known mutations. Reverse Dot Blot (Allele-Specific Hybridization) Reverse dot blots involve immobilizing allele specific probes to the surface of a membrane and hybridizing the PCR product to it. Only in the case of a perfect match between the probe and the amplicon is hybridization allowed. This means that mutated alleles are not bound by the wild-type probe and the wild type does not hybridize to the mutant-specific probe. The location of the hybridization is visualized with specific labels and reveals the gene type. The principle is shown in Figure 29.13. To avoid unspecific binding, the stringency of the hybridization needs to be precisely adjusted (salt concentration, time, temperature). Exact knowledge of the mutations is also required to apply this method.

29 Polymerase Chain Reaction

775

Figure 29.14 OLA (Oligonucleotide Ligation Assay) technique: After a PCR, allelespecific oligonucleotides (wt, mut) bind to the single-stranded amplicon. Only when the 3´ ends hybridize perfectly can the oligonucleotide be ligated to another universal, labeled oligonucleotide. Since the mut oligonucleotide and the wt oligonucleotide differ in length, the ligated oligonucleotides can be separated electrophoretically and detected by their label.

If the 3´ end of a primer cannot bind to a template, the amplification is inhibited, since Taq polymerase only extends a hybridized 3´ -OH end efficiently. Allele-specific PCR takes advantage of this fact and uses two different forward or reverse primers. The amplification of the DNA to be investigated takes place in two separate PCR tubes with one of the two sets of primers in each. This allows the elegant characterization of the genotype (homologous wild type, heterozygous, homologous mutants). The exact mutations must be known for design of the primers. Allele-Specific PCR

OLA Technique The oligonucleotide-ligation assay (OLA) also takes advantage of the fact that only perfectly hybridized, neighboring oligonucleotides can be ligated together. To analyze a known mutation, oligonucleotides are created that differ in length or labeling, and bind to the PCR product in an allele-specific manner. Depending on the presence of the mutation, the ligase connects one or the other allele-specific oligonucleotides with the universal oligonucleotide. This provides the information for the genotyping (Figure 29.14). Single-Strand Conformational Polymorphism (SSCP) Single-stranded DNA (ssDNA) forms unpredictable intramolecular secondary structure under renaturing conditions, which is strictly dependent on the primary sequence. Since a mutated allele carries a different sequence, it will also adopt a different conformation upon renaturation. In single-strand conformational polymorphism (SSCP) analysis the PCR product is denatured after amplification and immediately loaded onto a renaturing gel. Even the smallest changes in conformation of the single strands leads to a different mobility in the gel, which is observed as a band shift (Figure 29.15). With this method, unknown

Gel Electrophoresis of DNA, Section 27.2.1

776

Part IV: Nucleic Acid Analytics

Figure 29.15 SSCP analysis: After amplification the PCR products are denatured and loaded onto a renaturing sequencing gel. Owing to the different refolding, the mobility of the mutated allele (mut) is different from that of the control DNA (wt). In general four bands appear, since the single strand of each allele refolds differently based on its sequence.

mutations or polymorphisms can be recognized, but not characterized. That requires subsequent sequencing. Denaturing Gradient Gel Electrophoresis (DGGE) This approach is based on a very similar principle to that of SSCP. The double-stranded amplicon is loaded onto a gel that contains a gradient that is more and more denaturing. Depending on the sequence of the allele, denaturation of the mutated spot takes place before or after wild type. The altered mobility is also observed here as a band shift.

29.5.3 The Human Genome Project

Physical and Genetic Mapping of Genomes, Chapter 36 Generation of a Physical Map, Section 36.2.3

Identification and Isolation of Genes, Section 36.2.4

In October 2004 the sequence of the human genome was published in the journal Nature: The result of 13 years of work involving more than 2800 scientists. An analysis of the data and the 2.85 billion base pairs revealed the presence of 20 000 to 25 000 genes. The standard of quality applied required 99% of the gene-containing sequences to be included and the accuracy was given as 99.999%. Sequence identification was only the first step, now understanding the function of the genes is the focus. PCR also strongly drove the development of the Human Genome Project. PCR allowed the introduction and use of sequence tagged sites (STSs), which were a tremendous aid to the mapping work. STSs are specific DNA segments on chromosomes, which are defined by the sequence of two corresponding primers. Such information is readily available through databanks and directly available for use by every researchers involved in mapping the human genome or cloning genes. If an STS is part of an expressed sequence, it is referred to as an expressed sequence-tagged (EST) site. A particular form of STS are short tandem repeat polymorphisms (STRPs), short dinucleotide repeats, usually CA, which can vary in length from individual to individual. STRPs allow the determination of recombination frequencies and therefore conclusions about the separation of such markers. They also make the characterization of haplotypes possible and the isolation of genes by a positional cloning approach. The first gene that was cloned using a positional cloning approach was the CFTR gene, which is responsible for cystic fibrosis, in 1989. The modern DNA/RNA sequencing methods (next generation-, deep-, sequencing) would not have been possible without the developments in the field of PCR. The parallelization and miniaturization of PCR were decisive. The use of ever smaller volumes (down to picoliters), allows use of less sample and fewer reagents. Speed and precision also increase. The current focus of development is the commercialization of microfluidics, in which preparative amounts

29 Polymerase Chain Reaction

of nucleic acids can be created for sequencing purposes. By suitable choice of primers, tags and barcodes can be incorporated into PCR amplification. They can then, depending on the sequencing protocol, be further amplified by, for example, emulsions PCR or bridge PCR.

29.6 Alternative Amplification Procedures Besides PCR, other amplification procedures exist for the multiplication of nucleic acids. Even if such procedures have found a use in individual laboratories in the past, it is becoming increasingly clear that a broad, routine application can only be accomplished through the use of PCR. Therefore, only a few of the most important alternatives will be discussed in this book.

29.6.1 Nucleic Acid Sequence-Based Amplification (NASBA) This procedure involves an isothermal amplification of nucleic acids at 42 °C. Enzymatic components of the NASBA reaction are reverse transcriptase, RNase H, and T7 RNA polymerase. Another special aspect is the incorporation of a T7 promoter by means of a primer (primer A). This occurs when the promoter is attached to the specific nucleotide sequence of the primer. The reaction is started with the binding of Primer A to the RNA template. Reverse transcriptase (RTase) converts the template into cDNA and the RNA of the hybrid is digested by the RNase H. Primer B then binds to the opposite strand and the DNA-dependent, DNA polymerase activity of the RTase completes synthesis of the strand. This leads to the replication of the promoter. T7 polymerase, as the third enzymatic component of the mixture, binds to the promoter and synthesizes around 100 new RNA molecules (dependent on the length of the amplicon), and the process repeats from the beginning (Figure 29.16). Self-sustained sequence replication (3SR) is based on the same principle. NASBA and 3SR can be slightly modified and DNA templates transcribed into RNA prior to beginning the reaction to allow use of DNA templates. Transcription-Mediated Amplification (TMA) This procedure is an alternative isothermal amplification of RNA at 42 °C. Enzymatic components of the TMA reaction are a reverse transcriptase and T7 RNA polymerase. RNase H is replaced by the partial RNase H activity of the reverse transcriptase in this reaction. This amplification procedure also involves the incorporation of a T7 promoter attached to a primer (Primer A) into the amplicon.

29.6.2 Strand Displacement Amplification (SDA) This method of nucleic acid replication also involves an isothermal reaction. It is based on the ability of DNA polymerases to begin new synthesis at single-strand breaks and displace the old strand in the process. Amplification involves a cyclical cleavage of single strands and the subsequent strand replacement. Since restriction enzymes normally completely cleave double-stranded DNA, the creation of the single strand breaks, called nicks, is done by incorporating a nucleotide analog into the opposite strand. After cleavage the single stranded (inactive) remaining sequence of a restriction site is made double-stranded by extending a primer. New synthesis does not employ the four natural dNTPs, which would lead to a double-stranded reaction site subject to complete cleavage, but occurs instead in the presence of three of the natural dNTPs and a thio-deoxy nucleotide. This results in the formation of a double-stranded hybrid between the normal and the newly synthesized sulfurcontaining strand. This site cannot be completely cleaved by the restriction enzyme, which leads to the desired single strand nicks. The principle is illustrated in Figure 29.17.

29.6.3 Helicase-Dependent Amplification (HDA) As an alternative to strand displacement, DNA helicase is employed and the resulting single stranded DNA is protected from reassociation with single-strand DNA binding proteins. In the next step two primers are used, as for PCR, and DNA polymerase generates the two daughter strands. These strands then become available for the helicase and the next round of amplification is begun. In a sense, it is just like PCR at a constant temperature. One advantage is that it

777

778

Part IV: Nucleic Acid Analytics

Figure 29.16 Nucleic acid sequence-based amplification (NASBA): The starting point of the amplification is a single-stranded RNA that binds to Primer A. This primer contains a T7 promoter sequence on its 5´ end. A cDNA is synthesized by reverse transcriptase and the RNA in the hybrid is digested immediately by RNase H. Primer B now binds to the single-stranded DNA and synthesizes the opposite strand, in so doing creating a functional T7 promoter. The T7 RNA polymerase recognizes its promoter and synthesizes around 100 RNA molecules, dependent on amplicon length, which then trigger the cyclical phase of the NASBA reaction, in which the described steps are repeated. The complete reaction takes place at a constant temperature and in a single amplification buffer.

29 Polymerase Chain Reaction

779

Figure 29.17 Strand displacement amplification (SDA): This is a cyclical process consisting of synthesis, restriction digestion, and strand displacement. The primers contain the recognition sites for the restriction enzyme. Since new synthesis is carried out in the presence of a thio nucleotide, nicks result, since such thiobonds are resistant to restriction enzymes. SDA, like NASBA, runs under isothermal conditions. Source: according to Persing, D.H. et al. (1993) Diagnostic Molecular Microbiology: Principles and Applications, American Society for Microbiology, Washington D.C.

does not require a thermocycler. A disadvantage is, however, that the choice of primers and optimization of reaction conditions involves a more complicated search.

29.6.4 Ligase Chain Reaction (LCR) The ligase chain reaction (LCR) does not lead to increasing the amount of the actual target sequence, instead it leads to an amplification of two oligonucleotides ligated together, which are complementary to the original strands (Figure 29.18). After the initial hybridization of two immediately neighboring oligonucleotides, they are linked together by a thermostable ligase. These form the target for two complementary oligonucleotide pairs, which also hybridize and are linked by the ligase. In about 30 cycles LCR reaches a similar sensitivity to that of PCR. To increase the amplification specificity, LCR protocols have been developed in which the two inner 5´ ends of the oligonucleotides are selectively phosphorylated, to avoid unspecific ligation.

780

Part IV: Nucleic Acid Analytics

Figure 29.18 Ligase chain reaction (LCR): Theoretically each round of the cycle, consisting of denaturation, annealing of four oligonucleotides, and ligation, leads to the doubling of the number of the oligonucleotides linked together. Analogous to PCR, LCR also leads to exponential amplification.

Repair Chain Reaction (RCR)

The repair chain reaction is related to the ligase chain reaction. In contrast to the ligase chain reaction, the two complementary oligonucleotide pairs do not abut, but instead are separated by a gap of one or more nucleotides (Figure 29.19). The gap is selected such that the addition of dGTP and dCTP or dATP and dTTP, a polymerase and the ligase repairs the missing nucleotides of the gap in a double strand-dependent reaction. This combined, limited, elongation and ligation increases the specificity of the amplification, since the two oligonucleotide pairs cannot be ligated together without gap filling taking place.

29.6.5 Qβ Amplification Qβ amplification does not involve the elongation of a primer (like PCR or NASBA/TMA), instead new synthesis is triggered by the structure of the Qβ genome. Copying the Qβ structures leads to (–) and (+) copies in each reaction cycle and therefore leads to exponential amplification. The proliferation takes place isothermally, after the triggering Qβ are coupled to a parameter-specific probe, which hybridizes with the target sequence (Figure 29.20). A disadvantage of this method is that it is a signal amplification, which can also lead to the amplification of falsely hybridizing probes, causing false positive results.

29 Polymerase Chain Reaction

781

Figure 29.19 Repair chain reaction (RCR): In contrast to LCR, RCR does not lead to the annealing of the two primers next to one another, instead it leaves a gap of several nucleotides. Ligation only becomes possible after a polymerase has filled in the gap in the presence of dGTP and dCTP or dATP and dTTP, which increases the specificity of the reaction. The gap is chosen such that only dG and dC or dT and dA are contained in the opposite strand of the target DNA in the gap.

Figure 29.20 Qβ amplification: Qβ amplification involves a structure-initiated replication of the Qβ indicator sequence by the enzyme Qβ replicase at an isothermal temperature. The amplification is exponential and leads to a rapid increase in the number of amplification products through (–) and (+) strand intermediates. In contrast to PCR or other target amplifications, Qβ amplification leads to the proliferation of the Qβ indicator sequence; this is a signal amplification of a coupled, target-independent indicator sequence, not the target sequence itself.

782

Part IV: Nucleic Acid Analytics

29.6.6 Branched DNA Amplification (bDNA) Signal Amplification, Section 28.5.3

The branched DNA amplification (bDNA) method is a more recent means by which to amplify signals. The target nucleic acid to be detected is coupled to a solid surface with specific capture probes. Other nucleotides, extenders, then hybridize to the nucleic acids. The extenders bind target-independent amplifier molecules, which then stick out from the original target like antenna. These amplifiers bind many more oligonucleotides, which carry an alkaline phosphatase label. After several washes, the phosphatase substrate is added and chemiluminescence is used to detect the nucleic acids. The bDNA method allows the detection of around 105 target molecules. A disadvantage of this method, however, is that it involves a signal amplification, which likewise amplifies the signal resulting from a falsely hybridizing probe and therefore can lead to false positive results.

29.7 Prospects PCR has become a central bioanalytical method of molecular research laboratories. If current trends continue, it will become increasingly important in routine diagnostics. There are many reasons why PCR is one of the cornerstones of modern molecular biology. Increased automation, not only of PCR, but also of the critical preceding sample preparation, plays an important role. The enzymatic contamination control systems have been particularly important for routine diagnostic use. In addition, closed tube detection formats, such as TaqMan, allow routine use with fully automated PCR analysis devices with sensitivity in the range of a few copies. RT-PCR has allowed application to retroviral diagnostics (e.g., for HIV and HCV) and expression profiling. Another trend is shortening reaction times with faster thermocyclers and minimizing the reaction volume, as has already been realized in the LightCycler® . One also hopes to reduce amplification times down to minutes by miniaturizing the PCR reaction vessel in the form of chips (lab on a chip). The continuous development of microfluidics technology is promising for future routine analytics. Digital PCR offers the possibility of precise absolute quantitation, rare mutation detection, and quantitation of small differences in copy number. It can be expected that further automation of the entire workflow together with miniaturization will enable different and more challenging sample types, such as whole blood, sputum, urine, and spinal fluid to be amenable to PCR. Future applications that will take advantage of further automation and miniaturization of both sample preparation and PCR are on the horizon. In particular, point of care PCR devices that allow minimally trained operators to perform PCR from sample to answer in what is currently relegated to sophisticated laboratories with highly trained technicians is particularly exciting. Such rapid, simple RT-PCR would facilitate the spread of cost-effective DNA diagnostic methods for diseases such as Ebola and HIV in countries with limited financial resources and the detection of foodborne contaminants and fraudulently labeled foodstuffs on-site.

Further Reading Dieffenbach, C.W. and Dveksler, G.S. (2003) PCR Primer. A Laboratory Manual, 2nd edn, Cold Spring Harbor Laboratory Press, New York. Logan, J., Edwards, K., and Saunders, N. (eds) (2009) Real-Time PCR: Current Technology and Applications, Caister Academic Press. International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature, 431, 931–945. Kessler, C. (ed.) (2000) Non-radioactive Analysis of Biomolecules, Springer, Berlin, Heidelberg. Larrick, J.W. and Siebert, E.D. (1995) Reverse Transcriptase PCR, Ellis Horwood, London. Yang, S. and Rothman, R.E. (2004) PCR-based diagnostics for infectious diseases: uses, limitations, and future applications in acute-care settings. Lancet Infect. Dis., 4 (6), 337–348. Rabinow, P. (1996) Making PCR: A Story of Biotechnology, University of Chicago Press. Espy, M.J., Uhl, J.R., Sloan, L.M., et al. (2006) Real-time PCR in clinical microbiology: applications for routine laboratory testing. Washington Clin. Microbiol. Rev., 19, 165–256.

29 Polymerase Chain Reaction Millar, B.C., Xu, J., and Moore, J.E. (2007) Molecular diagnostics of medically important bacterial infections. Curr. Issues Mol. Biol., 9, 21–40. Reischl, U., Wittwer, C., and Cockerill, F. (2002) Rapid Cycle Real-Time PCR. Methods and Applications, Springer, Berlin, Heidelberg. Saunders, N.A. and Lee, M.A. (eds) (2013) Real-Time PCR: Advanced Technologies and Applications, Caister Academic Press. Hugget, J.F., Foy, C.A., et al. (2013) The digital MIQE guidelines: minimum information for publication of quantitative digital PCR experiments. Clin. Chem., 59 (6), 892–902. Niemz, A., Ferguson, T.M., and Boyle, D.S. (2011) Point-of-care nucleic acid testing for infectious diseases. Trends Biotechnol., 29, 240–250. UNITAID (2014) HIV/AIDS Diagnostic Landscape, 4th edn, World Heath Organization. http://www .unitaid.eu/images/marketdynamics/publications/UNITAID-HIV_Diagnostic_Landscape-4th_edition. pdf.

783

DNA Sequencing Jürgen Zimmermann and Jonathon Blake EMBL Heidelberg, Meyerhofstraße 1, 69117 Heidelberg, Germany

In 1975, Fred Sanger laid the foundation for the most powerful tool for the analysis of the primary structure of DNA, with the development of an enzymatic sequencing method. At that time, neither the far-reaching implications for the understanding of genes or whole genomes nor the rapid development of this method could have been foreseen. Back then, Fred Sanger was happy about the sequencing of five bases in one week, as he himself noted in retrospect at a reception in the Sanger Center (Cambridge, England) in 1993. In comparison to these five bases, genome sizes reach astronomical dimensions. The average length of a small viral genome is in the range of 105 base pairs (bp). With the increasing complexity of organisms, further magnitudes are quickly exceeded: Escherichia coli already reaches 4.7 × 106 bp, Saccharomyces cerevisiae 1.4 × 107 bp, Drosophila melanogaster 1.8 × 108 bp, and humans 3.2 × 109 bp. Genomes of plants and even lower organisms can achieve even greater lengths: wheat (Triticum aestivum) 1.6 × 1013 bp and Amoeba dubia 1.2 × 1014 bp. The actual number of bases to be sequenced can easily reach hundred times the genome size depending on the strategy used. In this consideration, the needs of diagnostic DNA sequencing, which are increasingly gaining in importance, are not yet included. In addition, even though DNA sequencing analyzes only small fragments, it processes these in large numbers. At the same time as the development of the Sanger method, cloning methods became available in M13 phages, which allowed both the biological amplification of DNA fragments in a size range of up to two kilobase pairs as well as the generation of “easily” sequenceable singlestranded DNA. A maximum read length of 200 bp could be achieved. Therefore, only a fraction of the entire sequence could be determined during a sequencing run. This disparity forced the development of sequencing strategies that – at a reasonable expense – enabled the reconstitution of the whole sequence. Equipped with the tools of the Sanger method, it began with the analysis of whole genomes in the 1970s. In 1977, Sanger and his coworkers published the 5386 bp DNA sequence of the phage phiX174. In 1982, the complete sequence of the human mitochondrion with a length of 16 569 bp was determined. By 1984, they had already achieved, with the sequence of the Epstein–Barr virus, a length of 172 282 bp. Only 25 years after the breakthrough of Sanger, in 2001, the first sequence of the human genome was published. The cost of the Human Genome Project (HGP) exceeded the limit of one billion US dollars by far. The project made one thing clear: a cost-effective and rapid sequencing of larger genomes would have hardly been possible with existing technologies, and a lot of questions remained unanswered, such as the dynamics of genomes, even if the data constituted a milestone and the results made many procedures possible, such as microarray analysis. It would take many years for massively parallel sequencing (MPS) approximation methods, also known as next generation sequencing (NGS), to become available. The gross output of 300 Gb and more per experiment and the associated – and substantial – cost reduction created the chance for new procedures to allow for new objectives covering all aspects of genomes, which were previously not possible (Table 30.1). Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

30

786

Part IV: Nucleic Acid Analytics Table 30.1 Application areas in the field next generation sequencing (NGS). Structural genomics

De novo sequencing Resequencing

SNPs, structural variants, exome, tumor/ normal tissue, personalized medicine, GWAS

Metagenomics Pharmacogenomics

Functional genomics

Transcriptome sequencing sRNA

miRNA

Protein–DNA interactions

ChIP

Epigenetic modifications

DNA methylation

The dynamics of the new sequencing systems is found in the arbitrarily increasing coverage/ depth, with which the areas are analyzed. In a microarray experiment, this is limited and it is difficult to find rare transcripts in a sample. With a sequencer, it is in principle easier to detect rare transcripts by manipulating the variables mentioned above. Most of these methods are still based on the Sanger method in their basic approaches, and despite their substantial improvement in throughput previous sequencing processes are still used in their own right, for example, for sequencing individual genes to validate expression constructs and PCR products. In addition, it should not go unnoticed that the new methods have their own systematic errors: enzymatic amplification is necessary in most cases to obtain enough products that can be detected in an instrument later on. These include preferences in the amplification techniques used, a guarantee that the generation of sequencing libraries did not occur fully randomized (e.g., negative GC-selectivity in the generation of sequencing libraries) or even preferences for sequencing enzymes that distort the results. This may result in a non-representative occurrence of all sequence regions; hence a complete picture is questionable. Even in methods that are not based on amplification the instruments themselves suffer from selectivity. While the MPS method is reaching an increasing methodological maturity, the first single molecule sequencing devices are already available and other methods are in development. In 1996, more than 300 mega-base (Mbp) of sequence information was newly recorded in the EMBL Nucleotide Sequence Database. This is nearly as much as had been registered in the 13 years since the founding of the institution. In June 2005, 95 giga-bases (Gb) in 54 × 106 entries could be published. This is the same as the daily production of an MPS machine. As of March 2011, there are 206 × 106 entries and 319 × 106 bp. These figures illustrate both the technological advances of the processes used as well as the increasing use of these techniques. As great as the numbers may seem, they represent sequences from different organisms, different versions, and also sequence fragments of small size, of which the location is not known in all cases. The path to a final single and complete sequence as the representation of a genome is not to be underestimated and requires a considerable effort. Recently the first projects for “Platinum genomes” have been started.

30.1 Gel-Supported DNA Sequencing Methods As already mentioned, gel-supported methods are still broadly used in DNA sequencing of smaller portions and diverse objectives such as the screening of expression constructs, PCR products, or structures that do not parallelize for use on a MPS (massive parallel sequencing) device, have overly low length, or cannot be resolved on the new sequencers at all until now. The original shotgun method was based on the statistical reduction of the genome into small fragments of 1–2 kb in length, their cloning, and sequencing as well as the assembly of the individual sequences like a puzzle. The overall picture of the sequence thus arises only at the end of a project. This uncertainty during the project was addressed by developing orderly and

30 DNA Sequencing

purposeful methods. Primer walking, nested deletions, Delta restrictions cloning, chromosome walking or combinations of methods allow, even during sequencing, a localization of the obtained information. Hybrid procedures combine the high initial data rate of accumulation of a random strategy with the reduction of the sequencing effort of a directional strategy. The starting point for genome-wide DNA sequencing is a correspondingly precise physical map that acts as a rough guide. Positions and relations of individual clones are reviewed with fingerprinting methods at a large scale and with fine-mapping they are reviewed in detail. The entire process of cloning and sequencing is based on statistical events that lead to an unpredictable sequence representation of individual results. Accordingly, gaps between sequence contigs are to be expected. These may have their origin both in sequence gaps (e.g., a too short read length) as well as in physical gaps (section missing from the corresponding clone library). Sequence gaps are usually closed by primer walking on the cloning, in which the neighboring but not associated contigs come to rest. Physical gaps can only be closed with the aid of a second clone library, one that was created using a different cloning vector, size selection, and fragment generation. Using an independent library is usually necessary to bypass possible instability of target sequences in a vector/host system. For identification of the missing sequences, the second library is sampled with oligonucleotides (PCR) whose sequences correspond to the ends of the previously associated contigs. A PCR product can arise only where the two end sequences come to rest. The corresponding clone thus contains the missing sequence to connect the two contigs in question. Sequence repeats (repeats, inverted repeats), which exceed the reading length of a read, can also make it difficult to reconstruct the original sequence. Only the use of additional map information of different resolutions as well as the examination of 5´ and 3´ border sequences can help in such cases.

Primer walking Primer walking is a directed DNA sequencing strategy. In a first step, the DNA fragment that is to be sequenced will be sequenced from both ends in one reaction. Starting from the obtained information, a new primer is placed in the same reading direction at each end. In this way a prolonged sequence section can be determined in each step. In addition, primers are placed in the opposite reading direction to determine the sequence of the complementary DNA strand. At each point of a primer walking project, the position and the running direction of the sequence is uniquely determined. The redundancy of the obtained sequence reaches a value close to two in the optimal case. The sequence information, however, can only be generated serially and in steps and is dependent on the synthesis of a new primer for each reaction.

Nested deletions The generation of unidirectional nested deletions was developed by Steven Henikoff in 1984. This sequencing strategy allows an ordered sequence extraction using merely a standard primer. The starting material uses plasmids with a known priming sequence, which must have a certain structure: Between the priming point and the cloned insert, there must be singular recognition sites for two different restriction endonucleases whose sequence is not reflected in the insert itself. The cutting site, which is closer to the direction of the priming point, must generate a 3´ overhang. The second, closer to the direction of the insert, must generate a 5´ overhang. After a double digestion with both restriction endonucleases, the linearized DNA molecule is treated with exonuclease III (Exo III). Exo III attacks the 5´ overhanging end of a DNA double strand, where it acts as 3´ -5´ exonuclease and successively removes the 3´ recessive DNA. At specific time points, an aliquot is removed and processed parallel in the next steps. In the next step, the 5´ overhang is removed by S1 nuclease to produce a blunt end. After reparation of ends, recircularization and transformation plasmids occur, whose inserts are shortened to a certain number of bases. The sequence shortening takes place right next to the binding site for the sequence primer. Thus, after the primer, a section of unknown sequence begins. Fragments with a maximum length of about 3 kb can be processed. A second series of deletions must be generated to determine the sequence of the complementary DNA strand.

Delta restrictions cloning Delta restrictions cloning allows for the generation of numerous sub-clones – whose position to each other is known – through simple digestion with polylinker restriction endonucleases. In the first step, the

787

Contig A continuous (contiguous) DNA sequence that is generated during the mathematical assembly from overlapping DNA fragments in the computer.

788

Part IV: Nucleic Acid Analytics

Cosmids Circular DNA molecules such as plasmids. The name cosmid refers to the DNA sequences that are referred to as cos and are derived from the phage lambda. These sections make it possible to add larger genes to cosmids than to plasmids. BAC Bacterial artificial chromosome (BAC) comes from the single-copy F-plasmid of the bacterium Escherichia coli and permits stable cloning of longer inserts of more than 300 kbp in bacteria.

Possibilities of PCR, Section 29.1

fragment that needs to be sequenced is cut and analyzed on an agarose gel. For all clones, which have an internal restriction site for a polylinker enzyme, a piece is removed by this digestion. Simple recircularization of the fragments results in clones that have a deletion relative to the position of the primer and thus deliver de novo sequences during the reaction with standard primers.

Transposon-mediated DNA sequencing This method enables the introduction of primer binding sites in a simple enzymatic step, which in turn allows the sequencing of longer DNA sections without subcloning or primer walking. As means of support transposons are used, which provide both the primer binding and encode a selectable marker (e.g., kanamycin, tetracycline) and allow for bidirectional sequencing. The insertion is carried out by simple incubation with the corresponding transposase and the DNA target mixture. After transformation into Escherichia coli, the selected clones can be selected and sequenced. In principle, the systems have no sequence-specific preference. However, in some cases accumulation in the area of certain hotspots can occur.

While the sequencing of whole genomes or genes was carried out with classical sub-cloning strategies as a shotgun with a framework of directional clones until the turn of the century, in phylogenetic analyses and in clinical diagnostics direct template production through PCR has prevailed. The use of PCR products as sequencing templates only requires a simplified purification (e.g., silica material or magnetic particles) prior to the actual DNA sequencing reaction. Parallel to the development of sequencing strategies was the development of sequencing techniques. While the first sequencing reactions were still performed with radioactive markers, current techniques make use of fluorescent markers. In addition, the increased understanding of DNA polymerases led to the use of other and new enzymes. After initial work with DNA polymerase I and the Klenow fragment and the use of genetically modified T7 DNA polymerases, thermostable DNA polymerases or even mixtures are now being used. Meanwhile, read lengths of 1000 bp and more can be achieved. The classic retrieval of RNA/DNA sequences occurs primarily in six steps: 1. Isolation and purification of nucleic acid: Genomic DNA is extracted with an adapted method for the target organism(s) and/or the target region. The direct sequencing of RNA is used rarely nowadays. As a rule, a cDNA copy is generated and subsequently sequenced through reverse transcription. 2. Cloning or PCR amplification: The DNA obtained in the first step is in several respects not suitable for DNA sequencing: The length of the genomic DNA is too large to be able to process with a conventional procedure. At present, a sequence of ∼1 kb can be produced in one sequencing run but only in favorable cases. During the analyses of, for example, a human gene of 50 kb, one only gets a fraction of the total information. For this fragment, it is still necessary to arrange its position and reconstruct it to the original sequence with the other generated fragments. For all types of DNA isolates, the number of copies contained in a preparation is not sufficient for sequencing. The automated DNA sequencing systems that are currently available have a detection limit of about 10 000 molecules, which cannot be achieved in a simple DNA isolate. To obtain a sufficient output amount for the DNA sequencing, it is subject to amplification. For the reconstitution of longer DNA sequences (>2 kb) this process must be conducted in vivo in cloning systems. The existing limitations of this process can be avoided with appropriate sequencing strategies. For shorter sequence segments that are frequently analyzed in medical diagnostics, PCR reactions are sufficient. 3. DNA purification for the sequencing reaction: To obtain optimal DNA sequencing results, a further purification is required. Contaminating proteins, carbohydrates, and salts can influence the set environment in an uncontrolled manner and can lead to dramatically reduced read lengths. For a more detailed description of the procedures, please refer to the relevant chapters of this book. 4. DNA sequencing and electrophoresis: The reaction products of the sequencing reaction of a DNA fragment are fractionated by gel electrophoresis, the generated band patterns are recorded online or offline and subjected to the following analysis. 5. Reconstitution of the original sequence information: As mentioned above, the sequence generated in a sequencing reaction is in most cases smaller than the entire sequence that must

30 DNA Sequencing

789

be determined. From many fragments generated by sequence reactions, the original image must be restored in the form of a genetic puzzle. For this, computerized and automated methods are used. 6. Error correction and sequence data analysis: The obtained sequence information is subjected to quality control. To suppress possible errors of sequencing from individual experiments, the DNA sequences that are to be determined become multiply redundant and, in addition, both complementary strands are sequenced. Subsequently, the sequence is examined for possible contamination by vector, bacterial, or foreign DNA in a first step and, where appropriate, removed from the sequence. With the aid of codon usage tables, potential ORFs (open reading frames), other reference sequences as well as various other tools, possible sequencing errors can be detected. These steps ensure the correction during the sequencing and the subsequent assembly of errors that have arisen. The final generated sequence can be compared to the DNA sequences of already known databases and can undergo a detailed sequence analysis. The overall process, as outlined in the list above, requires the combined use of a multiplicity of methods that have been described in detail elsewhere in this book. The following sections of this chapter are limited to the actual processes of gel-supported DNA sequencing methods, as well as their labeling and detection methods. Gel-supported DNA sequencing methods are mainly based on the production of base-specific terminated DNA populations that are separated according to their size in a subsequent, denaturing polyacrylamide gel/linear acrylamide gel electrophoresis. It is therefore basically an endpoint analysis. Up to 96 samples can be processed parallel and read lengths of up to 1000 bp per sample can be achieved. These populations can be generated in two different ways: The reaction products can be prepared by synthesis of a DNA strand (dideoxy sequencing, Sanger method) or by base-specific fission (chemical fission, Maxam–Gilbert method). Gel electrophoresis is performed in capillaries filled with linearly polymerized acrylamide gels.

30.1.1 Sequencing according to Sanger: The Dideoxy Method The dideoxy method (also known as the chain termination method, terminator method) is based on the enzyme-catalyzed synthesis of a population of base-specific terminated DNA fragments that can be separated by gel electrophoresis according to their size. From the resulting band patterns of a denaturing polyacrylamide gel, the sequence can be reconstructed. The basic principle will be described in the following section. Starting from a known start sequence, the synthesis of a complementary DNA strand is initiated by adding a sequencing primer (a short DNA oligonucleotide of about 20 bp), a nucleotide mix, and a DNA polymerase. To detect the reaction products, they must be labeled with either radioactive isotopes or fluorescent reporter groups (Section 30.1.2). On one hand, the use of a primer is necessary to obtain a defined starting point for sequencing, and on the other hand it is necessary for the formation of the initiation complex and thus to initiate the start of synthesis of the DNA polymerase. The reaction is started simultaneously in four parallel aliquots, which differ only by the use of different nucleotide mixes. Reactions labeled A, C, G, and T contain a mixture of the naturally occurring 2´ deoxynucleotides and in each case only one type of synthetic 2´ ,3´ -dideoxynucleotides, the so-called terminator (Figure 30.1). During strand synthesis through stepwise condensation of nucleoside triphosphates in the 5´ 3´ direction, two different reaction events can occur: During the condensation of the 5´ triphosphate group of 2´ -deoxynucleotides (dNTP) with the free 3´ -hydroxyl end of the DNA strand, an elongated DNA molecule is produced with the release of inorganic diphosphate (pyrophosphate), which in turn has a free 3´ -hydroxyl group. The synthesis can therefore be continued in the next step (Figure 30.2a). If, however, a condensation reaction occurs between the free 3´ -hydroxyl end of a DNA strand and the 5´ -triphosphate group of a 2´ ,3´ -dideoxynucleotide, an extension of this strand is no longer possible, because a free 3´ -hydroxyl group is no longer available. The strand is terminated (Figure 30.2b). The characteristics of the DNA polymerase used and the structure of dideoxynucleotide determine the mixing ratio, leading to the production of a population of base-specific terminated

Figure 30.1 Structural comparison of a 2´ ,3´ -di-deoxynucleotide and a 2´ -deoxynucleotide. The illustrated 2´ ,3´ -di-deoxyCTP nucleotide analogue differs from its naturally occurring analogue 2´ -deoxyCTP by the lack of the hydroxy group at C3´ of the sugar.

790

Part IV: Nucleic Acid Analytics

Figure 30.2 Synthesis of a DNA strand with incorporation of a 2´ -deoxynucleotide (a) and a 2´ ,3´ -dideoxynucleotide (b). In case of the latter, further polymerization is no longer possible.

Gel Electrophoresis of DNA, Section 27.2.1

Figure 30.3 Principle of the chain termination method according to Sanger. In a primed DNA synthesis reaction catalyzed by a polymerase, base-specific terminated DNA fragments of different lengths are synthesized. These fragments produce a particular band pattern in the gel electrophoresis that is used for reformation of the base sequence.

reaction products that in each case differ only in one base. At a molar ratio of dNTP : ddNTP of 200 : 1 termination events are relatively rare in the case of catalyzed T7 DNA polymerase reactions. Long reaction products of up to 1000 bp in length are then created. As mentioned above, the reaction is carried out in four aliquots. Each of these partial reactions contains only one type of terminator (ddATP, ddCTP, or ddGTP, ddTTP), which statistically replaces each one of the naturally occurring nucleotides in the synthesized DNA chain. In every reaction vessel products are generated that only end on one base type, such as A. The reaction products are separated by electrophoresis according to their size in a denaturing polyacrylamide. Gels of the labeled reaction products in all four partial reactions create a “ladder” of bands that each differ by one base (level). From this series of rungs of reactions A, C, G, and T, the base sequence can be read from the bottom (position 1) upwards (Figure 30.3).

30 DNA Sequencing

791

Figure 30.4 Autoradiogram of a sequencing run. In each case four adjacent tracks represent the partial reactions A, C, G, and T of a sample. According to the running direction of the gel, the smaller reaction products are at the lower end of the figure. The sequence is read from bottom to top. Each band is different, ideally by one base from the previous one. Source: adapted according to Nicholl, D.S.T. (1994) An Introduction to Genetic Engineering, Cambridge University Press.

Figure 30.4 shows a classic, old-fashioned but instructive, autoradiography of a sequencing gel with radioactively labeled reaction products. The other figures in this section show pseudo chromatogram and fluorescently labeled reaction products. In this illustration, the bands of a track are connected through a cut line in the running direction of the gel and the corresponding band intensities are determined. In this way, the data is reduced by one dimension, creating a representation known as trace data, which illustrates the intensity curve according to time. This reaction principle has remained unchanged since the developments of Sanger from 1977 until today. The discovery and modification of DNA polymerases led to the refinement of the method described above to protocols in which T7 DNA polymerase is used, and to the development of cyclical process that combine signal amplification and sequencing into one reaction. T7 DNA Polymerase-Catalyzed Sequencing Reactions T7 DNA polymerase belongs to the class I DNA polymerases, such as Escherichia coli DNA polymerase I and Taq DNA polymerase. Nevertheless, they differ in their function, their properties, and in their structure. T7 DNA polymerase is the replicating enzyme of T7 phage genome, while the other polymerases are responsible for repair and recombination. Accordingly, the native enzymes contain exonuclease activity: T7 DNA polymerase has a 3´ -5´ exonuclease activity, DNA polymerase I 3´ -5´ and 5´ -3´ exonuclease activity, while Taq DNA polymerase shows only a 5´ 3´ exonuclease activity. Through N-terminal deletion, the 3´ -5´ exonuclease activity could be deleted in T7 DNA polymerases and some thermostable DNA polymerases, as it would otherwise lead to deterioration in the sequencing results. T7 DNA polymerase is distinguished from other non-modified enzymes by their significantly lower discrimination against modified nucleotides. This has the advantage of lower costs and improved sequencing results. It is the ideal model DNA polymerase for DNA sequencing, the stability at higher temperatures is low, however. T7 DNA polymerase, because of its higher and better distribution of signal intensity of sequence patterns and the low signal background, has made the Klenow fragment, which was

792

Part IV: Nucleic Acid Analytics

Figure 30.5 Trace (raw) data of a T7 DNA polymerase-catalyzed sequencing reaction. A plasmid was alkaline denatured, neutralized, and subjected to Sanger reaction with fluorescently labeled primers. The data were generated on an automated DNA sequencing apparatus and analyzed.

used as the original DNA sequencing enzyme, largely obsolete. In Figure 30.5, the results of a T7 DNA polymerase-catalyzed reaction are exemplarily represented with fluorescent marking margins. The representation is different from than the top view on a sequencing gel as selected in Figure 30.4. Figure 30.5 shows a longitudinal section through the four lanes of a DNA sequence ladder. The individual bases (marks) can be reproduced in color. Specifically, a sequencing reaction is composed with a dsDNA molecule from the following partial steps: denaturation and neutralization, primer annealing, strand synthesis, reaction stop, and final denaturation, which will be considered below in more detail. Denaturation and Neutralization DNA sequencing reactions can be carried out both by double-stranded (ds) and single stranded (ss) DNA. They differ only in the denaturation as carried out in the first step. A denaturation in the presence of an added primer is a prerequisite for the subsequent enzymatic DNA synthesis. Single-stranded DNA is denatured by the action of heat, but only short term, and for further reaction it is brought to the optimum reaction temperature of DNA polymerase (37 °C.). In the case of the double-stranded DNA, as described here, the protocol begins with an alkaline denaturation, since heat incubation alone is not sufficient for the complete denaturation. Only combination with a strong alkaline agent such as NaOH can bring about the desired strand separation. The subsequent neutralization allows for adjustments of the reaction conditions required for the strand synthesis.

Basic Principles of Hybridization, Section 28.1

Primer Hybridization However, the ordered synthesis initiation only takes place if the DNA used is converted into a linear, single-stranded form and subsequently a primer is hybridized at the designated location. The kinetics of oligonucleotide hybridization is described by the formula derived by Lathe. It provides a connection between the hybridization time t1/2 in which 50% of

30 DNA Sequencing

793

the oligonucleotides hybridize to a template, as well as the length, size, and the concentration of oligonucleotides: t ˆ 1 2

Nln2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi 3:5  105 L  Cn

(30.1)

where: t1/2 = the hybridization time, N = number of base pairs in a non-repetitive sequence seconds, L = structure length of the oligonucleotides, Cn = oligonucleotide concentration in mol l 1. If one now uses the most common parameters occurring in a sequencing reaction for a primer length of 18–25 base pairs and a concentration of 10 7 M, hybridization times of between 3 and 5 seconds will occur. Strand Synthesis and Pyrophosphorolysis The synthesis of DNA catalyzed by a DNA polymerase is an equilibrium reaction. The equilibrium reaction is shifted to the side of the condensation products, that is, the reverse reaction runs considerably slower than the forward reaction. With increasing reaction times and depending on the amount of DNA used, however, the reaction can run essentially backward. In these cases there may be a reduction of terminal dideoxynucleotides. This effect, which is called pyrophosphorolysis, manifests itself in disappearing sequencing bands in the sequencing gel (Figure 30.6). Pyrophosphorolysis therefore seems to be sequence-specific at selected locations in both T7-catalyzed as well as cycle sequence reactions. However, the removal of diphosphate from the reaction equilibrium can suppress the

Figure 30.6 Comparison of a sequencing run with diphosphatase (a) and without diphosphatase (b). The complete lack of reaction products (position 91, 92) and the reduced signal strengths at specific positions (position 68, 86, 106) are easy to detect.

794

Part IV: Nucleic Acid Analytics

reverse reaction almost quantitatively. This can be achieved by adding a diphosphatase (pyrophosphatase), which cleaves the diphosphate into monophosphates. Strand Synthesis and Cofactors The process of enzyme-catalyzed DNA polymerization is Mg2+-dependent. It is speculated that Mg2+ ions are required during catalysis for the stabilization of the α-phosphate group of the incorporated nucleotide. Sequencing reactions that contain Mg2+ as counter-ions for nucleotides are characterized by strong variations in the signal levels of the individual sequencing bands. The partial substitution of Mg2+ by Mn2+ in T7 DNA polymerase reactions leads to homogenization of the signal intensities and thus facilitates reading of sequences, especially regions in which the resolution of the sequencing gel decreases. This effect, however, cannot be applied to thermostable DNA polymerases as Mn2+ inhibits the entire reaction in all the ten cases that were studied. Strand Synthesis and Additives For additives in DNA sequencing reactions, different groups of substances are considered, such as proteins or detergents. The addition of one protein from single strand binding Escherichia coli (SSB) or the T4 gene 32 product could in fact stabilize single-stranded DNA structures. However, the addition shows only marginal improvements in the reaction with a high concentration necessary. We have already discussed the combination of DNA polymerase and diphosphatase above. Further possibilities include combinations of DNA polymerases with different characteristics. In a mixture, a polymerase for the DNA brand marking can be combined with another polymerase to synthesize the sequencing products. The reduction of background signals with the addition of DMSO in T7 polymerase reactions is attributed to its denaturing effect. The addition of formamide, and detergents such as Triton X-100, reduces the background of thermostable enzymes of catalyzed sequencing reactions. Strand Synthesis and Nucleotide Analogs In dideoxy sequencing reactions the nucleotide analogues C7 deaza-dGTP (Figure 30.7a) and dlTP (Figure 30.7b) are used. Both analogues are accepted by DNA polymerases and built into the polymerized strand. However, their derivatization prevents the formation of hydrogen-bonds. This effect is necessary in areas or structures with high GC content. Otherwise, an effect may occur that is referred to as compression. In sequence gels a zone is then observed, in which the band gap is continuously shortened and is eventually cumulated in a considerably broader zone. After this compression, the next bases occur again only after a significantly wider, empty zone. In the compression zone, more than one sequencing product can be located within a visible band. The usually appropriate spacing pattern

Figure 30.7 Structure of deoxynucleotide analogues in dideoxy sequencing reactions: (a) C7 deaza-dGTP and (b) dITP.

30 DNA Sequencing

795

for a base is severely disrupted. This compression is caused by the interaction of highly GCcontaining, complementary sections. It can result, for example, from hairpin structures, which differ drastically from regular sequences in their mobility. Final Denaturation For exact sizing of DNA fragments a complete denaturation is necessary to prevent a sequence-dependent folding of DNA molecules or the formation of conglomerates among multiple DNA molecules and thus an uncontrollable runability. For the denaturing agent urea is usually used at a concentration of 7–8 M. Because of their polar properties, the carbonyl group and the amino groups of urea compete with the individual bases for the formation of hydrogen bridge bonds and can thus prevent the formation of base pairs. The additional use of formamide – after completion of secondary sequencing reactions and subsequent heat treatment – already results in a broad denaturation in the reaction vessel before applying the sample onto the sequencing gel. Chelation of the divalent metal ions (Mg2+, Mn2+) present in the reaction environment through EDTA leads to the dissolution of the DNA polymerase complexes.

Cycle Sequencing with Thermostable DNA Polymerases The information obtained with T7 DNA polymerase about the structure and function of the enzyme could be transferred to thermostable DNA polymerases in wide ranges. By way of genetic engineering it was thus possible to completely remove 3´ -5´ exonuclease activity. The positive feature of T7 DNA polymerase by which it discriminates only slightly between dNTPs and ddNTPs could be attributed to the tyrosine residue 526 of the nucleotide binding point. An exchange of the corresponding phenylalanine residue at position 667 of the Taq DNA polymerase also allowed for a lower discrimination. Today, thermostable DNA polymerases are available that match the properties of T7 DNA polymerase in essential points, but have considerable advantages because of their thermal stability. Mixtures of polymerases, additives, and modifications that can mediate, for example, hot-start properties make cycle sequencing the method of choice. The application of thermostable DNA polymerases allows for amplification and an additional sequencing in DNA sequencing analogous to PCR (Figure 30.8). This process is called cycle sequencing. Unlike PCR there is only one primer in the reaction, so it will be amplified only linearly and not exponentially. The repeated heat incubation is also sufficient for the denaturation

Figure 30.8 Principle of a cycle sequencing reaction. In a cyclically (about 30 times) accumulated sequencing reaction, a mixture is incubated consisting of primers, template, thermostable DNA polymerase, dideoxynucleotides, and the deoxynucleotides at 97, 60, and 72 °C. Because of the high temperatures, primer and template are thermally dissociated. At the lowest temperature occurring in the process, primer and complementary DNA section are associated, to be extended and terminated at a medium temperature through DNA polymerase. Source: adapted according to Strachan, T. and Read, A.P. (2005) Human Molecular Genetics, 3rd edn, Oxford University Press, Heidelberg.

796

Part IV: Nucleic Acid Analytics

Figure 30.9 Comparison of enzyme processivity in highly repetitive sequence sections. The shown repeat has a degree of repetition of 200(!). The structure is clearly resolved with T7 DNA polymerase (a), while a thermostable DNA polymerase (b) shows an unspecific break after the first repeat.

of double-stranded DNA. In a mixture of DNA templates, primers, thermostable DNA polymerase, and a dNTP/ddNTP mixture, a thermal profile, consisting of primer denaturation, primer annealing, and DNA synthesis, is repeated up to 30 times, thereby repeating a sequencing reaction up to 30 times. A correspondingly large amount of sequencing fragments is produced. The use of genetically modified enzymes, detergents, and diphosphatase allows for the production of sequencing data that reach the quality of T7 polymerase reactions (Figure 30.9). Despite the now improved cycle sequencing conditions there are structures, in particular repetitive sections, that cannot yet be accurately determined through these protocols. Figure 30.10 gives an example of this. Even today, DNA sequencing requires the use of different methods, since each individual method cannot cover all areas satisfactorily. Even the expensive methods of chemical fission of Maxam and Gilbert (Section 30.1.3) are rarely but still in use in difficult cases. Figure 30.10 Introduction of fluorescent markings in DNA sequencing reactions: (a) labeled primers, (b) labeled deoxynucleotides and (c) marked dideoxynucleotides. In cases (a) and (b), further deoxynucleotides can be fused after installation of the selected group.

DNA Probes, Section 28.2.1

30.1.2 Labeling Techniques and Methods of Verification In the following, the incorporation of marker groups in the sequencing products and their detection in automated systems is described. Isotopic Labeling The use of radioactive isotopes in DNA sequencing is based on the fact that DNA polymerases do not discriminate between different isotopes and the incorporation into a DNA strand does not lead to changes in mobility in the gel. The isotopes 32 P and 35S are used. The radiation from 32 P is of higher energy than that of 35S. The exposure times in the autoradiography

30 DNA Sequencing

797

that follows the gel electrophoresis can therefore be shorter with 32 P. However, the spatial resolution is significantly lower, due to the greater energy-related blackening areas. Therefore, especially for longer DNA sequencing runs, 35 S is preferred. The label may be introduced by a radioactively labeled primer or during the sequencing reaction itself. Radioactive labeling of DNA sequencing primers is performed by phosphating with γ-32 P-ATP and polynucleotide kinase. The primers produced by chemical synthesis have free 5´ OH groups, which can be phosphorylated directly, without prior dephosphorylating. The mark during the dideoxy sequencing reaction is performed by adding α-32 P-dATP. The isotope is incorporated into the synthesized DNA strand, which thus is detectable. Detection is carried out either by autoradiography or the use of image plates. The autoradiograms can be evaluated manually or semi or fully automatically with a digitizer or scanner. This method is now nearly outdated. The introduction of fluorescent markings is more difficult than that of radiotracers. The fluorescent groups have a considerable size and are only marginally integrated in many cases by these enzymes due to steric conditions. If the detection groups are accepted, however, it is necessary to prevent statistical multiple incorporation. Owing to the other charge ratios and the additional mass they would inevitably lead to a change in the gel electrophoretic mobility and thus exclude a sequence determination. Labeling can be carried out with a labeled primer, internal labeling, or labeled terminators (Figure 30.10). Fluorescein, NBD, tetramethylrhodamine, Texas Red, and cyanine dyes are used for labeling. The dyes can be linked to the appropriate components (nucleotides, amidites) and are accepted as either 2´ -deoxynucleotides or 2´ ,3´ -di-deoxynucleotides of DNA polymerases. Through laser excitation in detection systems, they are only slightly bleached and are sufficiently stable under coupling, sequencing, and electrophoresis conditions. Fluorescent Labeling and Online Detection

Combinations of fluorescent dyes are also used. These so-called energy transfer group systems are based on the idea of being able to excite individual dyes for fluorescence whose excitation spectrum does not correspond to the excitation wavelength of the system or to use the emission wavelengths of a color that does not interact with other dyes used in the system (minimization of spectral overlap). The laser-induced emission of a dye is used to excite the actual fluorescent marker (Figure 30.11).

Energy Transfer Dyes

The use of a 5´ fluorescently labeled primer in a DNA sequencing reaction is not critical. The unlabeled products that exist in each preparation of a primer are not visible in the analyzers and therefore do not disturb the results. A rarely occurring self-priming of the template DNA, caused by partial self-complementarity, would not be visible. Non-specific termination, which is generally due to inadequate reaction conditions or the structure of the DNA that is to be sequenced, can, however, be recognized as a non-readable structure. The marking of the primers generally takes place by linking a fluorescent amidite in the last step of the synthesis (at the later 5´ end). An amino link can also be added through another amidite and these are linked after the conclusion of the synthesis with the fluorescent dye. However, the efficiency of this process is less Primer Labeling

Figure 30.11 Energy transfer dyes: (a) 5- (and 6-) carboxytetramethylrhodamine succinimidyl ester (5(6)-TAMRA, SE); (b) 5-(and 6-)carboxy-X-rhodamine succinimidyl ester (5 (6) -ROX, SE); (c) 5-(and 6-)carboxyfluorescein succinimidyl ester (5(6)-Fam, SE).

798

Part IV: Nucleic Acid Analytics

than the simple linking and requires further purification steps. The labeled primer is set in the reaction similarly to a radioactive primer. Internal Labeling with Labeled Deoxynucleotides For the internal label, fluorescently labeled 2´ -deoxynucleotides are used (e.g., fluorescein-15-dATP), and a DNA polymerase. The sequencing process must run in two stages. In a first step, with a low concentration of the labeled 2´ -deoxynucleotides, a maximum of one is added to each strand. A prerequisite is that the position following the primer allows the installation of this nucleotide. The concentration of marker component should be kept so low that in the next step of the reaction no further nucleotides of this type are installed. Another random incorporation would lead to uncontrolled mobility changes. The actual sequencing reaction is carried out according to the classical Sanger principle. This technique offers the advantage of using inexpensive unlabeled primers for a reaction. However, undesirable reaction products that are the result of self-priming, can also be labeled and complicate the sequencing results by overlays of multiple sequences. This method has become increasingly obsolete with the broader usage of better cycle sequencing methods. Labeled Terminators

Fluorescently labeled terminators have prevailed as a standard and offer the possibility of using non-labeled primers in DNA sequencing reactions. In the reaction mixtures, the labeled 2´ ,3´ -dideoxynucleotides replace the completely unlabeled analogues. The marking is done through the simple incorporation of a labeled dideoxynucleotide. Another advantage is that reaction products that are not properly terminated with a dideoxynucleotide, due to the lack of fluorescence group, are not detectable. Such false, sequence-dependent reaction products are visible with labeled primers or deoxynucleotides in the system. The availability of base-specific color-labeled dideoxynucleotides allows the execution of the reaction and gel electrophoresis in one reaction vessel and a single track. The disadvantage, however, is that the DNA polymerases hardly accept these modified nucleotides, so that a high working concentration and thus a subsequent purification of the reaction products is required. Furthermore, the dyes in gel electrophoresis have different mobilities, which must be corrected by software. For selfpriming products the same as was said for deoxynucleotides applies. Duplex DNA Sequencing

Online Detection Systems In 1986 and 1987, online DNA sequencing systems were developed based on laser-induced fluorescence. Significant developments were made by L. Hood in the USA and W. Ansorge in Europe. Online detection systems consist mainly of a vertical electrophoresis system, an exciting laser, a detector, and a recording computer system (Figure 30.12). The laser is either linked transversely to the longitudinal sides perpendicular to the detector or at a certain angle from the front or rear side in the sequencing gel. The locally resolved bands of the classic method are now seen as time-resolved band patterns. The first system of Smith, Hood, and their colleagues was based on the detection of reaction products, which were produced with the use of labeled primers in different fluorescent colors in the dideoxy method. All products of a reaction were applied to the gel on one track. The fluorescence dyes used, such as fluorescein, Texas Red, tetramethylrhodamine, and NBD, have a sufficient spectral distance to enable a secure distinction of the bases. In a further development fluorescence labeled terminators were available, which made the complex primer marker virtually superfluous. The mobility differences between the dyes used make a manual analysis of the primary data impossible, but can be corrected automatically by software programs. The differing absorption spectra also require excitation with two different wavelengths. For observation of a gel in its entire width, a scanning mechanism has been developed. The original electrophoresis systems were based on planar gels and were later replaced by capillary electrophoresis systems, which allow a higher speed and better resolution at lower thermal load. Automated DNA sequencing systems support a much higher throughput of sequencing reactions than traditional methods (Figure 30.13). While an average of four to six reactions take place on a radioactive gel, online systems now have a payload capacity of 96 clones and read lengths up to 1000 bp per sample. Automated Sample Preparation Automation strategies are often based on a flexible liquid handling system, which allows various applicative adaptations and accessories and thus a very wide range of applications with particular effectiveness. There are also devices available that have been specifically designed and optimized for a specific chemical reaction sequence. These

30 DNA Sequencing

799

Figure 30.12 Principle of an online DNA sequencing device. A vertical gel electrophoresis apparatus is observed at the lower end of a detector. The excitating laser beam is coupled at the level of the detector into the gel (not shown here). The signals obtained from the detector are sent to an analyzing computer. The spatially resolved band patterns, derived from radioactive DNA sequencing, are replaced by a time-dissolved listed banding pattern marked by the detection finish line. Source: adapted according to Smith, L.M. et al. (1986) Nature, 321, 674–679.

Figure 30.13 Automated capillary DNA sequencer. Source: adapted according to Perkel, J.M. (2004) The Scientist, 18, 40–41 (now LabX Media Group).

800

Part IV: Nucleic Acid Analytics

different automation strategies are derived both from the different sample throughput as well as from the level of complexity of the process. Flexible automation strategies are used in the field of small and medium numbers of samples ( A: A-cleavage reaction: After methylation of the adenine residue N3 with dimethyl sulfate, the N-glycosidic bond is cleaved. The added piperidine leads to the elimination of the base and simultaneous β-elimination of both phosphates.

30 DNA Sequencing

803

Figure 30.17 (a) Thymine- and cytosine-specific fission reaction. Hydrazine leads to a ring opening of the base between C4 and C6 and their fission. The added piperidine leads to the simultaneous β-elimination of both phosphates. (b) C-cleavage reaction. Hydrazine opens the ring between C4 and C6 and splits off the base residue. The added NaCl suppresses the reaction of thymine. The added piperidine leads to elimination of the base and simultaneous β-elimination of both phosphates.

804

Part IV: Nucleic Acid Analytics

 G > A: The methylated DNA is first heated at neutral pH for 15 min at 90 °C. This leads to



  



the elimination of all methyl-A and all methyl-G. The A-depurination occurs four- to sixtimes faster than the G-depurination. Because of the low adenine methylation degree, however, the G-depurination outweighs the A-depurination. Hot alkaline treatment (piperidine or NaOH) of the partially depurinated DNA therefore leads to a strengthening of the Gcleavage and weakening of the A-cleavage (G > A-pattern). The A-part of the G > A-reaction is shown in Figure 30.16b. Single-stranded DNA is also methylated at the N1-atom of adenine and at C3 of cytosines. Methylation on N1 of adenine does not lead to cleavage. At C3 methylated cytosines are cleaved and are visible as bands with about 1/10 to 1/5 of the intensity of the G-bands. A > G: The methylated DNA is treated at 0 °C with dilute acid (e.g., 0.1 M HCl) for 2 h. This leads to a preferential hydrolysis of the glycosidic bond of methyladenines. Hot alkali treatment of the depurinated DNA thus results in a stronger A- and in a weaker G-fission (A > G pattern). A > C: To achieve this specificity, the DNA is treated with strong alkali (1.2–1.5 M NaOH) at 90 °C for 15–30 min. This opens the adenine and cytosine rings. Hot piperidine treatment leads to the elimination of these bases and to A > C cleavage patterns. A + G: Limited treatment of DNA with formic acid leads to unspecific depurination. Alkali treatment generates the A = G or A + G pattern. C + T: Cleavage at the pyrimidine bases is performed by modification with aqueous hydrazine and subsequent cleavage with alkali. The chemical reactions of the thymine part of this combined reaction can be seen in Figure 30.17a. Piperidine reacts with all glycoside products generated by hydrazine. C: Inclusion of 1–2 M NaCl in the previous reaction suppresses the reaction of hydrazine with thymine. The steps of the cytosine reaction can be seen in Figure 30.17b. Piperidine reacts here with all the glycoside products of the hydrazine.

Except in the case of the G-reaction, NaOH (0.1 M) or piperidine (1 M) can be used as alkali. Piperidine is preferred because it can then be removed very easily by evaporation. When using NaOH, the DNA must be precipitated with ethanol. More base-specific reactions of DNA are presented in Chapters 31 and 32. However, for the methods discussed here, they are not equally important for the determination of the DNA sequence. Basically, the reactions G, G+A, C+T, and C are sufficient to be able to read a sequence clearly. Solid Phase Process The chemical cleavage could be greatly facilitated by the development of a solid phase process and its adaptation to the automatic DNA sequencing that works with fluorescence. After binding of the DNA template to an activated membrane surface, all cleavage reactions and required washing procedures up to the elution of the reaction products can be performed on a solid surface.

Figure 30.18 Structure of α-thionucleotides.

α-Thionucleotide Analogues The incorporation of α-thionucleotides (Figure 30.18), for example during a PCR reaction, allows a simple chemical cleavage reaction with 2,3-epoxy1-propanol and has the additional advantage that a DNA strand is not attacked by exonuclease activities; compared to the standard Maxam–Gilbert method, 5´ -O-1-thiotriphosphate shows good incorporation characteristics with major DNA polymerases. However, the band patterns generated only allow a reliable sequence determination in short areas. Multiplex DNA Sequencing Multiplex DNA sequencing (Figure 30.19) is based on the above chemical cleavage of DNA. While in the above case detection can take place immediately, it requires sophisticated hybridization patterns in multiplexing. The goal is the simultaneous performance of 50 sequences in a reaction, thereby minimizing the workload of the elaborate cleavage reactions and gel operations. The 50 DNA fragments to be sequenced are cloned in 50 different vectors. These vectors differ in the fragments of left and right flanking linker sequences. This vector-specific oligonucleotide follows a standardized restriction site. After the excision of the DNA fragments they are enclosed by the above-mentioned sequence regions. All fragments are united in a cleavage reaction, separated by gel electrophoresis, and transferred to a nylon membrane (blotting). The reaction products are detected by successive hybridizations of the

30 DNA Sequencing

805

Figure 30.19 Multiplex DNA sequencing.

fragments with labeled oligonucleotides that are complementary to the flanking regions. In this way, a filter can be recycled up to 50 times. The variant of the multiplex walking dispenses with the elaborate cloning at the beginning of the process, and instead fragments a cloned DNA and then essentially follows the above procedure. The first hybridization begins at the start point specified by the vector. From the obtained sequence a new oligonucleotide is then synthesized and hybridized again. By repeating the process, new starting points are generated that make it possible to travel along the fragment (primer walking). RNA Sequencing Typically, mRNA is not sequenced directly, but converted by reverse transcription into cDNA and enzymatically sequenced by Sanger’s method. For sequencing of rRNA, chemical cleavage is sometimes used. In chemical RNA sequencing, we are talking about a chemical cleavage by a Maxam and Gilbert method analogue. Owing to the lower stability of the phosphodiester bond and other chemical compositions of RNA, modified cleavage reactions are used. A 3´ -terminal marked RNA fragment is subjected to four parallel batches of base-specific modification reactions and the RNA strand is split with aniline instead of NaOH or piperidine:

 G-reaction: base methylation with dimethyl sulfate, followed by reduction with sodium borohydride and cleavage of the modified RNA strand in the ribose bond with aniline.

 A> G-reaction: ring opening of the bases with diethyl dicarbonate (diethyl pyrocarbonate) at N7 and subsequent strand cleavage with aniline.

 U-reaction: Treatment with hydrazine results in cleavage of the base by nucleophilic addition to the 5,6-double bond. The modified RNA strand is cleaved by aniline.

806

Part IV: Nucleic Acid Analytics

 C> U-reaction: Treatment with anhydrous hydrazine in the presence of NaCl leads to preferred release of cytosine. Again, the strand is cleaved by aniline.

30.2 Gel-Free DNA Sequencing Methods – The Next Generation Gel-free sequencing methods brought in a new wave of sequencing technologies starting in 2005 with a version of pyrosequencing technology named 454. One particular feature of this and the following technologies is that they produce large amounts of sequencing data, but it is made up of generally much shorter reads than the standard Sanger sequencing method. To obtain the mass of sequencing data that characterizes the Next generation sequencing (NGS) wave, all the techniques rely on a high level of parallelization of their sequencing method. Massively parallel sequencing (MPS) is another term used for the NGS sequencing. With the start of the NGS wave it became apparent that sequencing of individual genomes was within reach, with all the impact that this may have on health care and personalized medicine. As a result many different NGS methods have been proposed, and following on from them several attempts have been made to commercialize various methods (for a review of methods proposed and the relevant time line see Reuter, J.A., Spacek, D.V., and Snyder, M.P. (2015) Mol. Cell, 58 (4), 586–597). As a result of the intense method development and rapid commercialization many of the newer sequencing technologies are referred to by their commercial name – that is, Illumina sequencing and not the underlying method – sequencing by synthesis. The gel-free methods eliminate gel electrophoresis as the throughput and resolution limiting factor in DNA sequencing, but cannot hide their similarities to the Sanger sequencing method. Many are based on the incorporation of labeled nucleotides that have the label removed after detection and are converted into an actively extendable polynucleotide string. The detection occurs in situ bound to a solid surface instead of a measured end point in a gel. Therefore, it is possible to achieve an extreme size reduction in the reaction volume and the related detectors. The use of activated surfaces makes a dense packing of the reacting molecules possible and thereby a huge parallelization of the sequencing and a huge sequence output. As almost every single incorporation event is detected (HiSeq, 454), the throughput is mainly governed by the microfluidics of the system in question – how quickly can the sequencing cycles be captured versus the buildup of fluorescent noise in the system with time. However, there are already new systems available that are mainly governed by the progressivity of the polymerase (PacBio RS). The next generation of sequencing devices, those that use nanopores or other methods, are capable of measuring single molecules and are already to some extent available. Since the next generation of sequencing devices came to the market the amount of sequencing data has expanded hugely, as has the size of sequencing projects that are being attempted (e.g., 1000 genomes and 1000 cancer genomes). The rate of sequence data production continues to increase with systems such as the x10 from Illumina promising $1000 dollar human genomes. Further increases in the sequencing output are predictable, however, and the data produced from the latest sequencing instruments can measure over a terabase produced in less than 4 days. This has meant that not only are there new opportunities appearing for biological research but also new challenges appear dealing with the large amounts of data produced. New methods in bioinformatics for analysis of the data such as aligning large numbers of small sequences have had to be developed and IT infrastructure that holds and processes these data has had to be expanded. NGS data analysis can last for days or weeks even on a powerful cluster. It can have very high memory requirements (1 terabyte of RAM) and the transfer of 100 GBs of result data can push the boundaries of conventional WAN (wide area network) technology. In some aspects the preparation of probes has become easier, the classic cloning steps are removed that can lead to an unwelcome bias, although other biases are added such as PCR sensitivity to base content. In other aspects the process does lead, however, to a much greater effort in library preparation. The available systems have a detection level of around 25 000 molecules and therefore require amplification of the sequencing product to be able to detect it. The sequencing methods are also very dependent on the size distribution of the molecules to be

30 DNA Sequencing

807

sequenced, therefore shearing and accurate size selection of the input material is required to reduce the size of the starting molecules to a few hundred bases long.

30.2.1 Sequencing by Synthesis Classic Pyrosequencing Through analysis of the byproducts of a polymerization reaction one can determine the nucleotide sequence (pyrosequencing, Figure 30.20). As mentioned above, every polymerization reaction results in a free diphosphate and these can be detected by another chemical reaction. The enzyme sulfurylase catalyzes the conversion of the diphosphate into ATP, which when hydrolyzed with luciferase results in luciferin and oxyluciferin. The polymerization can be detected by adding single nucleotides to a polymerization mixture in a reaction chamber. A successful polymerization produces the expected emission of light. Through successive cycles of nucleotide addition, detection of polymerization by emitted light, and removal of the reaction products, the sequence of a nucleic acid can be determined. Light is only emitted when the correct nucleotide for the sequence is added. The presence of homopolymer stretches results in a larger signal based on the number of nucleotides incorporated although the increase in signal cannot be relied upon to be a linear representation of homopolymer length. In general, stretches of more than eight identical nucleotides cannot be reliably quantified. Each cycle time takes a few minutes per base and read lengths of around 60 bases are achievable, although in this form pyrosequencing has been generally regarded as a mini-sequencing technique. 454-Technology (Roche) The 454-system was the first second-generation MPS device to the market and uses pyrosequencing as its basis although the sequence read lengths are longer than in classic pyrosequencing and the high level of parallelization of the process gave a huge increase in sequence data output at the time. The sequencing process (Figure 30.21) is made up of three steps: 1. Creation of a library of DNA molecules in the range 300–800 bp. The starting material that is longer than the acceptable range, such as genomic DNA, is sheared by nebulization, purified, the overhanging ends repaired and then phosphorylated (Figure 30.21a, 1). Afterwards two adapters, A and B, are ligated to the target molecules (2). The adapters are each 44 bp long themselves and are not phosphorylated but carry the target sequences for amplification and sequencing. A 4 bp long key sequence identifies the library sequences to the system and facilitates the calibration of base recognition and calling. This step is left out when sequencing PCR products as the necessary adaptors should have been added during the PCR amplification itself. The B adaptor has an extra 5´ -biotin modification. This facilitates a cleaning step using streptavidin-covered paramagnetic beads (3). The library fragments without biotin are simply washed away. The gaps between the adapter and the fragment are filled by a strand-displacement DNA polymerase catalyzed reaction. During the cleaning procedure alkali denaturing of the one sided modified fragments and occasionally the unmodified and complimentary strand are set free for further steps, while double labeled fragments spontaneously reattach under the chosen conditions. 2. The DNA being sequenced has to undergo clonal amplification to give a clear signal above the detection system’s lower limit (4, Figure 30.21a). A bead dilution is performed so that ideally only one capture bead binds to one DNA molecule. The beads carry matching capture primer, allowing them to bind to a DNA fragment. The beads are then mixed in an oil and water emulsion, where an emulsion-PCR reaction (emPCR) takes place and the sequences are amplified (5). A further biotinylated primer is used to allow downstream purification of clonally amplified sequences. After the PCR reaction the emulsion is broken and the beads binding DNA, byproducts, and empty beads are freed. Only the beads with amplified sequences are bound by streptavidin beads and magnetically filtered out of the mixture. Through the breaking of the emulsion and following alkali denaturing a bead is produced that has a bound, single-stranded DNA molecule that can be sequenced. 3. For DNA sequencing (Figure 30.21b) the sequencing primer is hybridized to the bound template in a reaction mixture containing sequencing primers, DNA polymerase, and required cofactors. The mixture is transferred to a pico-titer plate with the aim of having just one bead occupying one well of the plate. Each well has a diameter of 44 μm and has a single

Figure 30.20 Principle of pyrosequencing: An immobilized DNA probe is incubated with a reaction mixture containing only one nucleotide (A, C, G, or T) in cycles. If the correct nucleotide is currently present in the mixture it is incorporated in the polymerizing nucleotide chain and a diphosphate is released which is converted into ATP by ATPsulfurylase. This is in turn used by luciferase to convert luciferin into oxyluciferin along with the emission of light. This light emission is used to measure the incorporation of the base or bases in this cycle. Source: Ronaghi, M. et al. (1996) Anal. Biochem., 242, 84–89. With permission, Copyright  1996 Academic Press. All rights reserved.

808

Part IV: Nucleic Acid Analytics

Figure 30.21 454-System Workflow: (a) Library preparation: 1, fragmenting the nucleic acid to a length between 300 and 800 bp; 2, ligation of A and B adaptors; 3, binding of template to a capture bead; 4, bead after clonal amplification; 5, bead after the emulsion is broken and clean-up. (b) Bead bound pyrosequencing. Source: adapted after Roche Diagnostics Corporation.

Phred: The calculation of Phred quality scores dates back to the automatic gel sequencing methods of the 1990s. The value is based on the error probability for each base in a read and is calculated using the following formula: Q ˆ 10log10 P

A Phred score of 40 represents an error probability of 1 in 10 000. The Phred scores are presented as a string of characters where the ASCII encoding value for the character minus a particular offset, 33 in most cases, is the calculated Phred score.

engraved glass fiber within the plate housing assigned to it. The fiber in each well transfers its signal to a point on a CCD sensor, allowing the pattern of sequencing to be detected. The dimensions of the well prevent more than one capture bead occupying it. Added enzyme beads contain the two enzymes luciferase and sulfurylase, which generate emitted light in the presence of PPi. The process follows the above-mentioned classic pyrosequencing; however, 400 000 reaction wells are simultaneously active and the achieved read length is significantly longer. During the runtime the detector pictures are analyzed and the signal intensities calculated from the pixel data. These data are reduced to give the intensity values assigned to the particular positions on the pico-titer plate. The series of single pictures is used to finally calculate the quality values for each base (Phred encoded probability of error) and the read sequence is presented as a fluorogram. Yields of the 454 high-throughput variant of this system are around 700 MB with an average read length of 700 bases in around 23 h. The 454 system was a major breakthrough in the world of sequencing technology.

30 DNA Sequencing

809

Illumina Sequencing by Synthesis (HiSeq, MiSeq, NextSeq500, ×Ten) The Illumina technology is based on the integration of fluorescent markers with a reversible terminator dNTP and the result of sequencing result is based on the optical detection of incorporated fluorescent markers through sequencing cycles. As with other sequencing methods a library preparation step is required here as well. Ultrasound is used to produce a set of DNA fragments of the correct length, targeting 200–500 bp. A mixture of T4-DNA-polymerase and Klenow fragment removes the 3´ -overhangs and fill up the 5´ overhangs. T4-polynucleotide kinase phosphorylates the blunt ends of the fragments and incubation with a 5´ -3´ -exo-deletion mutant and dATP leads to a simple adenylation of the ends. Finally, both the differing end adaptors possessing the necessary 5´ T overhang can be ligated (1 in Figure 30.21). The P5 adapter possesses a region complementary to the sequencing primer and the P7 adaptor attaches a complementary sequence to allow binding of the fragment to the flow-cell where sequencing will actually take place. Size selection and separation from unligated adaptor sequences are performed by gel electrophoresis followed by gel elution, or magnetic bead purification, and PCR can be used to further enrich the sample being sequenced. For the DNA to be actually sequenced it must be bound to the sequencing flow-cell and clonally amplified to allow detection. This is done through a process known as clustering (Figure 30.22).

Figure 30.22 Illumina sequencer workflow: 1, fragmentation of the nucleic acid and adapter ligation; 2, loading of the template library on to the flowcell; 3, strand initialization; 4, template preparation by denaturation; 5, clonal amplification; 6, incorporation of fluorescently labeled reversible terminators; 7, detection of the incorporated bases using the scanned images.

810

Part IV: Nucleic Acid Analytics

Clustering occurs on the transparent flow-cell where the sequencing takes place. The flowcell can have up to eight channels (HiSeq) allowing eight separate DNA libraries to be run without multiplexing with index barcodes. The surface of the flow-cell is coated with adaptors with complementary sequences to the P5 and P7 adaptors added in the library preparation above. The library material is denatured and added to the flow cell as single-stranded DNA. After hybridizing with surface oligonucleotides the template strand is copied (3). The product is denatured again and the newly synthesized strand stays bound to the surface of the flow-cell while the other material is washed out (4). In the next step the template can bind a nearby oligonucleotide on the surface and hybridize to this, allowing a further strand synthesis to occur. In this way initialization of the solid phase cluster PCR is achieved. A bridging PCR as shown above results in the required clonal amplification of the sequence template (5). On most Illumina sequencing systems the cluster formation is randomly distributed over the flow-cell and the density, be it over clustered and therefore unreadable or under clustered producing a poor yield, is highly dependent on the concentration of the DNA library added. The size selection of the fragments is also essential to ensure that the PCR products stay within the formed clusters, otherwise the sequence data cannot be unambiguously read. A general rule of thumb would be a cluster density of 610–678 K mm 2 for the HiSeq 2000 platform although other sequencers of the Illumina family require different densities. Illumina has produced patterned flow-cells for their later sequencing platforms HiSeq x10 and HiSeq 4000 that allow higher loading of the flow-cells by restricting cluster formation to fixed well positions on the flow-cell. Once prepared the flow-cell is loaded onto the sequencer. In the first sequencing step all DNA molecules with polymerization capable ends are blocked to stop unspecific primer extension. The sequencing process is then initialized by adding a mix of sequencing primers, polymerases, and four different fluorescent markers, reversible protected dNTPs. Owing to the extensionprotected nucleotide each cluster has just one base incorporated complementary to the template strand (6, Figure 30.22). Finally, at the end of each sequencing cycle the flow-cell is scanned, producing four pictures, one for each of the fluorophore laser excitement frequencies representing one for each base. Afterwards the fluorophores are cleaved off and the end of the sequence strands reactivated for further polymerization. The sequence is called based on the positional information relating to the cluster and the sequence of base specific fluorescence (7). The maximum output from this system is governed by the flow-cell surface available to cluster and the read length, which is itself dependent on the speed at which the fluorescent sequencing cycles can be performed. Illumina has various different models that offer higher throughput rates or longer read lengths. Currently, highest throughput occurs on a HiSeq 4000 with 2× 150 base reads giving around 1.5 terabase of data in 3 days. Longer read lengths but much less data is produced by the MiSeq, giving read lengths of over 2 × 300 bases. Adaptations to Library Preparation and Sequencing The amount and properties of the sequence reads produced by sequencing such as read length and read number are governed by the capabilities of the sequencing instrument and the chemistry it uses. This often means that sometimes the read sequence is too short to be used in certain steps such as spanning over a repeat region, or that there are not enough reads to cover a whole genome. Certain adaptations to the library preparation and sequencing techniques have been developed to deal with this. The examples given below are taken from the Illumina platform but can be applied to other platforms as well. Paired-End-Sequencing When the template is sequenced from both ends (Figure 30.23), the reactive group mentioned above is cleaved after the first end is read. The reverse strand can then cluster via bridging PCR and then be sequenced in the same way the first read is sequenced. The advantages of paired end sequencing are (i) the extra coverage of a second read and (ii) the positional information provided by having two anchor reads of known distance apart. Paired end reads are often used in genomic sequencing, allowing structural variations in the sequenced genome to be identified, and assist genomic assembly considerably (Figure 30.24). Figure 30.23 Pair-end sequencing: adapters A1 and A2 allow bridge amplification from either end of the strand and sequencing primers SP1 and SP2 allow sequencing to be started from each end sequentially.

Mate-Pair Library Creation This refers to the generation of a specific type of library to obtain sequence anchors spanning a wider sequence distance than can be handled with normal paired end reads. Hence the libraries are formed with a large inner distance between the sequence fragment end, circularized, and the shorter outer fragment is used to form paired end reads. Here the goal is to obtain positional information from separated anchor reads and is often

30 DNA Sequencing

811

Figure 30.24 Alignment of paired-end-sequences allows both ends of a sequenced fragment to be aligned. This allows unique alignment of one of the pairs outside a repeat when the other end aligns in a repeated region, allowing repetitive regions to be unambiguously aligned and making assembly of repeated sequences easier.Source: adapted after Illumina, Inc.

used in scaffolding of de novo assemblies (Figure 30.25). For this type of library the first step is to fragment the DNA to 5 kb or greater instead of 200–500 bp. This gives a large spanning distance between the read that can be used to span over repeated regions in the genome. As fragments of this length cannot be used for clustering on the flow-cell the ends are biotinylated and the fragments circularized. The circularized DNA is once more fragmented (400–600 bp) and the biotinylated region captured. The captured fragments are then sequenced as normal in the paired end fashion. Indexed Libraries As the number of sequence reads per sequencing run increases, the question arises: How deep does one have to sequence a particular sample? If it is the case that a desired sequence can be covered with less depth than would be delivered by one lane or run of the sequencing several samples can be run together with sequence barcodes included during library preparation (Figure 30.26). In the case of Illumina sequencing a library is created as described above. After adaptors are ligated that already carry the sequencing read primer 1 the fragments are PCR amplified. Here the index barcode is introduced to the library fragments. One of the PCR primers has the index sequence and the sequencing primer for the second read. A second primer contains the P5 structure for attachment to the flow-cell surface. A third primer carries the P7 attachment section and the index sequence. The first set of index barcodes consisted of 12 modified P7-adapter sequences to be used so several different samples could be run in the same flow-cell lane. The index is separately read through an index read and can be up to eight bases long. Now, the use of dual indexing barcodes of 16 bases is possible. Other methods involve the integration of the index after the standard sequencing primers. However, with this method there is the disadvantage that there is a loss in sample sequence length as the index is read at the start of the first sequencing run. Target (Exome) Enrichment When it is uneconomical to sequence a whole genome or genomes of individuals or if only a particular genomic region is required to be sequenced, target enrichment can be used to extract only the desired regions of the genome for sequencing. The most frequent example of this is exome enrichment or exome capture. This method reduces the sequencing to the coding regions of the genome and allows comparison of the exome sequences from a range of individuals. Illumina technology is also used as an example here (Figure 30.27). After denaturing, the fragments of the DNA library are hybridized to a capture library of biotinylated probes and then bound to magnetic streptavidin-coated beads and eluted. The enrichment process is repeated and the eluted capture product is amplified before sequencing. The produced inserts cover up to 460 bases around the center of the probe and therefore can also include exon flanking regions. The methods of other companies are based on similar principles but often use other probe sequences to capture a different subset of the genome.

Semiconductor Sequencing (Ion Torrent) This method is also based on DNA sequencing by synthesis. The material is also handled similarly to the above processes (creation of the library and clonal amplification). The technology here, however, is based on a different form of

Figure 30.25 Mate-pair sequencing uses biotinylation, circularization, a second fragmentation, and then ligation of paired end sequencing adaptors to the ends of the original molecule. This allows the sequencing of the ends of a fragment, which can be in the range of 5 kb long, giving spatial information about the genome. Source: adapted after Illumina, Inc.

812

Part IV: Nucleic Acid Analytics

Figure 30.26 Indexed libraries: (a) creation of the library. 1, Rd1-SP-adapter ligation; 2, PCR to add the P5 and index-SP and Rd2-SP and P7 sequences with index barcode sequence; 3, structure of the sequencing template. (b) During sequencing three independent read cycles are carried out: 1, read 1; 2, index read after the removal of the first strand; 3, read 2 of the library molecule. The index read is a separate read and in this way does not reduce the length of either read 1 or 2.

detection. With every DNA polymerase catalyzed extension of a nucleotide chain a hydrogen ion (proton) is released that leads to a change in the pH of the reaction volume. In other systems the pH changes are buffered by the reaction mixture, in this method they are used to detect an incorporation of a base. The design of the system transfers a great deal of the sequencing device onto a disposable silicon based sequencing chip. The sequencing reaction takes place on this microfluidic chip in which several cavities are loaded with DNA bound beads. Each nucleotide base is flooded in cycles into the reaction chambers in series individually so that a change of pH measured in the presence of a particular nucleotide indicates which base was incorporated. The charge changes on the sensory surface at the base of the chamber are measured as a voltage change and are used to produce a fluorogram of the sequence. The level of voltage change is used to measure how many bases were incorporated (Figure 30.28). Applications of Sequencing Technologies NGS sequencing has proven exceptionally useful in scientific research. Although the sequence reads are short the mass of them together provides a large amount of data that can be put to good use.

Figure 30.27 Exome enrichment for Illumina systems. (a) Denaturing of the dsDNA library. (b) Hybridization of biotinylated capture probes to the target regions. (c) Enrichment of the target regions using streptavidin beads. (d) Elution of the target fragments from the capture beads. Source: adapted after Illumina, Inc.

Genomic Sequencing and Resequencing Sequencing the first human genome was an exceptionally expensive and lengthy project – estimated cost $3 billion. NGS technologies now allow resequencing of a human genome for under $1000 and in under a week. With the availability of cheap mass sequencing the genomes of organisms that would not normally be sequenced can be constructed although several issues exist using the short read data. If no reference genome is available one has to be assembled de novo. This is a challenging task especially for short read technologies where the read length is often shorter than the length of repeated sequences that are widely spread through eukaryotic genomes. Several programs using a range of techniques such as de Bruijn or String figures have been published to deal with the task of constructing larger contiguous sequences, contigs, from short reads. One way to try and extend the capacity of short reads over the length of short repeats is by using paired and mate-pair sequencing strategies.

30 DNA Sequencing

813

Despite the best efforts of biologists and computer scientists the repeat nature of eukaryotic genomes has led to several newly genome assemblies that are unfortunately made up of fragmented contigs and would benefit from the promise of the third generation of sequencing technologies presented below. In cases such as humans where a good reference genome exists it is possible to produce a consensus by aligning reads to the reference. This is widely used in human genome resequencing where variation from the reference is being studied. Single base and small deletions can be detected by modern alignment programs whereas larger structural variants, inversions, deletions, or chromosomal rearrangements may be detected using paired or matepair information. In some cases it is not required to sequence the whole genome of an individual but only a reduced portion such as only the coding sequences for the genes, the exome. This approach has been heavily exploited in disease screening in humans where it was at one stage still impractical to sequence the whole genome. With the advent of higher throughput from the sequencers many large projects have moved to whole genome sequencing with the advantages of more structural variant and non-coding variant information. However, exome sequencing still provides a good way to sequence the gene coding information from several individuals. Tag Counting – RNASeq and ChipSeq The production of millions if not billions of small sequence reads lends itself very well to various semi-quantitative and qualitative techniques that use different ways to isolate nucleic acids with the specific aim of investigating a particular biological question. The field has expanded widely, encompassing 70 techniques (Illumina has published a summary of available methods – ForAllYouSeqMethods.pdf, available online). Two of the earliest and most widely used techniques, RNASeq and ChipSeq, are outlined below. RNASeq (Mortazavi, A. (2008) et al. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods, 5, 621–628) allows the investigation of gene expression by measuring RNA levels in a cell or tissue. Total RNA is isolated from the tissue and enriched for coding sequences by either capturing poly-A tail containing molecules or by depletion of ribosomal RNA with probes that perform specific sequestering. The RNA is then converted into a single stranded cDNA library using random hexamer priming, following which the second strand of the cDNA is synthesized, creating a library of cDNA molecules that can be further processed in library preparation. Once the library is prepared it is sequenced as normal on the sequencer. The short read tags are then aligned to a genome and the alignment positions mapped back to the relevant gene annotation. The number of aligned counts per gene, transcript, or exon is taken as a measure of gene expression, with higher tag counts indicating higher gene expression. Paired end sequencing or aligning with an aligner capable of splitting the short read alignments up so that they span exon junctions are used when investigating alternative transcript usage in gene expression. ChipSeq (chromatin-immuno-precipitation sequencing) (Robertson, G. et al. (2007) Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods, 4, 651–657) is used to investigate the binding sequences of DNA binding proteins such as transcription factors. DNA and bound proteins (chromatin) are reversibly crosslinked using formaldehyde. The chromatin is sheared and enriched by immunoprecipitation of the DNA binding the protein using an antibody specific for the protein in question. After enrichment the crosslinking is reversed and the DNA is processed into a sequencing library and sequenced. The short reads are aligned against a genome and the number of tags aligning above background and the shape of the peak of aligned reads is used to ascertain the binding position of the protein under investigation. In the case of transcription factors the question is often: Which gene is downstream of the binding site?

30.2.2 Single Molecule Sequencing The above established sequencing methods require an amplification, either by classic cloning in the “older” processes or by clonal amplification in the newer massively parallel sequencing techniques. The availability of new more sensitive detectors and methods based on other

Figure 30.28 Ion torrent: (a) Single reaction chamber of a sequencing chip with a single bead bound DNA template, the sensor, and electronics. With the addition of dNTPs, protons (H+) are set free that change the pH of the chamber. These changes are measured by the chamber’s sensor. (b) Variation of bases with time. The peak heights of the figure represent the number of bases detected. The bases incorporated are identified by the step in the synthesis and the dNTP added. Source: adapted after Ion Torrent Systems, Inc. (now Thermo Fisher Scientific GENEART GmbH).

814

Part IV: Nucleic Acid Analytics

Figure 30.29 PacBio RS single molecule real time DNA sequencing. (a) Single ZMW with a polymerase bound to the lower surface of the chamber. The illumination from underneath only affects the lower region of the chamber. (b) Diagram of the sequencing process (incorporation, fluorescence emission, cleavage of the fluorophore). Source: Eid, J. et al. (2009) Science, 323, 133–138. With permission, Copyright  2009, American Association for the Advancement of Science.

principles brings the possibility of measuring single molecules within reach. This has brought on the concept of third-generation sequencing techniques focused on single molecules. Single Molecule Real Time (SMRT) DNA Sequencing (Pacific Biosciences) This technique sees a single DNA molecule being sequenced in a very small chamber called a zero mode waveguide (ZMW) in a sequencing cell. There are thousands of ZMWs in the chip and thereby thousands of molecules are sequenced in parallel and also in real time. The special thing about the ZMW is that it is so narrow that light is trapped at its base when emitted from sets of fluorescently labeled nucleotides due to nucleotide incorporation: 1. The starting material is fragmented and end repaired. A hairpin adaptor is ligated on each end resulting in a closed circular molecule. The hairpin structure contains a sequencing primer complementary sequence. The final products are size selected and cleaned. 2. The sequencer primer is hybridized onto the hairpin structure. After which Phi-29 polymerase is added and an initiation complex formed that can be loaded onto the chip and sequenced. 3. Once immobilized in the ZMW the sequence of the DNA template is measured by the fluorescence of incorporated molecules. Each base has a different fluorescent label and when incorporated leads to a large burst of light emitted that is trapped for some time in the ZMW, giving a clearly measurable signal. This clear incorporated signal is much stronger than that of unincorporated labeled nucleotides that move in and out of the volume at the bottom of the ZMW quickly. The terminal phosphate carrying the base specific fluorophore is removed with the addition of the next labeled nucleotide and the subsequent cleaving of the phosphate. This results in a real time film of bases being incorporated, light being emitted, next base being incorporated, and the next burst of light being emitted (Figure 30.29).

Figure 30.30 Native hemolysin pore incorporated in a membrane. Source: Zwolak, M. and Ventra, M.D. (2008) Rev. Modern Phys., 80, 141–165. With permission, Copyright  2008, American Physical Society.

Nanopore Sequencing A nanopore is a very small pore that will allow ions to flow through when a voltage is applied to it in a conducting fluid. The pattern of current generated by the flow of ions is characteristic based on the pore shape and changes also in a characteristic fashion when DNA is threaded through the pore. By threading DNA through a pore the sequence of nucleotides is called by the characteristic changes observed due to different nucleotides as they pass through (Figure 30.30). This technique promises several potential advantages over other previous techniques in that it can measure a DNA molecule of any length without being restricted to a set of run cycles and that it measures a single molecule without need for amplification. The difficulties with the technique lie in the small changes in current, which are difficult to measure accurately enough to reliably call the base sequence. Despite the practical difficulties in measuring DNA flowing through nanopores some companies have attempted to bring a nanopore sequencer to market. Oxford Nanopore Technologies is the nearest to commercially launching a product at the time of writing. Using protein nanopores placed in an artificial non-conducting chip surface, DNA is threaded through the pore by a progressive engine attached to one end of the DNA during library preparation, while the other end receives a hairpin sequence. As the DNA passes through the pore at its narrowest point the sequence causes a change in the current flowing and this is measured by the chip’s detectors under each pore. When one strand of the DNA strand has been sequenced the hairpin passes through the pore and the second strand is also sequenced, allowing the information from both strands to be used in base calling.

30 DNA Sequencing

Further Reading Ansorge, W., Voss H., and Zimmermann, J. (1996) DNA Sequencing Strategies, John Wiley & Sons, Inc., New York. Bentley, D.R. et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456, 53–59. Blazej, R.G. et al. (2007) Inline injection microdevice for attomole-scale Sanger DNA sequencing. Anal. Chem., 79, 4499–4506. Branton, D. et al. (2008) Nanopore sequencing. Nat. Biotechnol., 26, 1146–1153. Brenner, S. (2000) Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol., 18, 630–634. Clarke, J. et al. (2009) Continuous base identification for single-molecule nanopore DNA sequencing. Nat. Nanotechnol., 4, 265–270. Craig, D.W., Pearson, J.V., Szelinger, S. et al. (2008) Identification of genetic variants using bar-coded multiplex sequencing. Nat. Methods, 5, 887–893. Deamer, D. (2010) Nanopore analysis of nucleic acids bound to exonucleases and polymerases. Annu. Rev. Biophys., 39, 79–90 Drmanac, R. (2010) Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays, Science, 327, 78–81. Fuller, C. et al. (2010) The challenges of sequencing by synthesis. Nat. Biotechnol., 27, 1013–1023. Hodges, E. et al. (2007) Genome-wide in situ exon capture for selective resequencing. Nat. Genet., 39, 1522–1527. Horner, D.S., Pavesi, G., Castrignano, T. et al. (2009) Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Briefings Bioinformatics, 11, 181–197. Korlach, J. et al. (2008) Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nanostructures. Proc. Natl. Acad. Sci. U.S.A., 105, 1176–1181. Levene, M.J. et al. (2003) Zero-mode waveguides for single-molecule analysis at high concentrations. Science, 299, 682–686. Mardis, E.R. (2003) Next-generation DNA sequencing methods. Annu. Rev. Genomics Human Genet., 9, 387–402. Margulies, M. et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437, 376–380. Maxam, A. and Gilbert, W. (1977) A new method for sequencing DNA. Proc. Natl. Acad. Sci. U.S.A., 74, 560–564. Niedringhaus, T.P. et al. (2011) Landscape of next-generation sequencing technologies. Anal. Chem., 83, 4327–4341. Pop, M. and Salzberg, S.L. (2008) Bioinformatics challenges of new sequencing technology. Trends Genet., 24, 142–149. Ramanathan, A., Huff, E.J., Lamers, C.C. et al. (2004) An integrative approach for the optical sequencing of single DNA molecules. Anal. Biochem., 330, 227–241. Richardson, P. (2010) Special issue: next generation DNA sequencing. Genes, 385–387. Rothberg, J.M., Leamon, J.H. (2008) The development and impact of 454 sequencing. Nat. Biotechnol., 26, 1117–1124. Rothberg, J.M. et al. (2008) An integrated semiconductor device enabling non-optical genome sequencing. Nature, 475, 348–352. Sanger, F., Nicklen, S., Coulson, A.R. (1977) DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U.S.A., 74, 5463–5467. Schadt, E.E. et al. (2010) A window into third-generation sequencing. Hum. Mol. Genet., 19, R227–R240. Schuster, S.C. et al. (2008) Method of the year, next-generation DNA sequencing. Functional genomics and medical applications. Nat. Methods, 5, 11–21. Shendure, J. and Ji, H. (2008) Next-generation DNA sequencing. Nat. Biotechnol., 26, 1135–1145. Stoddart, D. et al. (2009) Single-nucleotide discrimination in immobilized DNA oligonucleotides with a biological nanopore. Proc. Natl. Acad. Sci. U.S.A., 106, 7702–7707. Stoddart, D. et al. (2010) Nucleobase recognition in ssDNA at the central constriction of the alphahemolysin pore. Nano Lett., 10, 3633–3637. Sultan, M. et al. (2008) A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science, 321, 956–960. Tabor, S. and Richardson, C.C. (1990) DNA sequence analysis with a modified bacteriophage T7 DNA polymerase. J. Biol. Chem., 265, 8322–8328. Trepagnier, E.H. et al. (2007) Controlling DNA capture and propagation through artificial nanopores. Nano Lett., 7, 2824–2830. Venter, J.C., Adams, M.D., Meyers, E.W. et al. (2001) The sequence of the human genome. Science, 291, 1304–1351. Wang, H. and Branton, D. (2001) Nanopores with a spark for single-molecule detection. Nat. Biotechnol., 19, 622–623.

815

Analysis of Epigenetic Modifications Reinhard Dammann

31

Justus Liebig-University of Gießen, Institute of Genetics, Department of Biology and Chemistry, Heinrich-Buff-Ring 58-62, 35392 Gießen, Germany

With the complete sequencing of the human genome the number of genes that are involved in the complex interaction of the cellular development is now assessable. Basically, each genome consists of four bases: adenine, thymine, cytosine, and guanine, which, however, can be modified covalently. Since these modifications are inherited after DNA replication, the coded information of the genome changes considerably. The most important DNA modifications are methylation of cytosine at the C5-postion to 5-methylcytosine (5mC) and the methylation of adenine at the N6-position to N6-methyladenine (N6mA) (Figure 31.1). Methylation of adenine is mainly found in prokaryotes (Dam-methylation, (GN6mATC) and serves as a protective mechanism of its own DNA against sequence specific restriction enzymes. Cytosine methylation is frequently found in bacteria (Dcm-methylation, C5mCWGG) and is also detected as a modification in plants, invertebrates, and vertebrates (i.e., CpG-methylation, 5m CG). In mammals cytosines are mainly methylated, when the base is followed by guanine, which is designated as dinucleotide 5mCpG. This methylation is achieved in vivo by DNA methyltransferases (DNMTs). Interestingly, an additional 5mCpA- and 5hmC-methylation has been found in human stem cells. In addition, in the brain, 5-hydroxymethylcytosine (5hmC) has been detected and results from the oxidation of 5mC by TET enzymes. TET further oxidizes 5hm C to 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC), which can be converted into C by base excision repair. The function of 5hm C is not yet completely understood and may either represent an intermediate modification in the DNA demethylation pathway or a specific epigenetic mark. In human somatic cells 5m C base makes up only 1% of all DNA bases, but 70–80% of the CpG are methylated. In the human genome the dinucleotide CpG is underrepresented, but is often found in GC-rich sequences, so-called CpG islands. Nearly 60% of all human genes harbor a CpG island in their promoter region. Normally, these CpG-island promoters are unmethylated. Methylation of CpG in the promoter region has an important influence on the regulation of gene expression and leads to epigenetic inactivation of the affected gene. Epigenetic control plays an important role in inheritance of gene activity; since DNA methylation modifies the information of the genome without changing the primary sequence pattern of the DNA. Cytosine methylation directly influences the gene activity by the binding of regulatory proteins (e.g., methyl binding domain proteins) to the methylated sequence and indirectly influences gene expression by inactivation of the chromatin structure and through altered histone modifications. Thus, the fifth base 5m C acts as a reversible epigenetic switch and is essentially involved in the inheritance of the gene activity. Since hypermethylation of regulatory sequences results in inactivation of gene expression, these epigenetic changes are considered to be an important mechanism in the inactivation of tumor suppressor genes. DNAmethylation not only plays a fundamental role in carcinogenesis but also in cellular development and aging. Furthermore, DNA-methylation determines allele specific expression of paternal and maternal inherited genes, a mechanism that is termed imprinting. DNA methylation is also involved in the processes of dosage compensation by X-chromosome-inactivation. Moreover chromatin structure and modifications of nucleosomes are also important for the epigenetic gene Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

Figure 31.1 5-Methylcytosine, 5-hydroxymethylcytosine, and N6-methyladenine.

818

Part IV: Nucleic Acid Analytics

regulation. While active genes display an open chromatin structure (euchromatin) with acetylated histones, silenced genes show a closed chromatin (heterochromatin) with deacetylated histone. Additionally, nucleosomes are altered by methylation, phosphorylation, and other modifications of histones. This chapter describes several methods that can be used to analyze epigenetic modifications (DNA methylation and other chromatin changes).

31.1 Overview of the Methods to Detect DNA-Modifications There are six main techniques for analyzing DNA-methylation:

 Chemical modifications of the unmethylated bases with: 1. bisulfite.

 Protein specific analyses of DNA-sequences with:



Real Time PCR, (quantitative) PCR, Section 29.2.5

2. methylation sensitive restriction enzymes; 3. 5m C-binding domain (MBD) proteins; 4. 5m C, 5hm C, 5f C or 5ca C -specific antibodies. Analysis of the configuration of the bases of the complete DNA with: 5. DNA hydrolysis; 6. nearest neighbor-analysis (Table 31.1).

The different reactivities of modified cytosine and cytosine for bisulfite induced deamination to uracil is used to determine the methylation status of the DNA. With protein specific methods the different activity of restriction enzymes or binding of mC-binding proteins (MBD or antibodies) is used to analyze the methylation status of the DNA. Certain methylation sensitive restriction nucleases do not cut their recognition site if the DNA is methylated, whereas others only cut methylated sites in the DNA or are insensitive against methylation. These enzymes allow analysis of the methylation status of the respective restriction cutting sites. 5m C-Binding domain proteins (MBDs) and mC-antibodies (5m C or 5hm C) are utilized to precipitate the modified DNA and to quantify its methylation status by real time PCR. Moreover, antibodies that recognize 5f C or 5ca C are available. A further possibility is in analyzing the composition of the DNA bases. The genomic DNA is completely hydrolyzed and different modifications of a

Table 31.1 Overview of important methods for analysis of the DNA-modifications. Method Bisulfite modification

Concept 5m

Chemical resistance of C and C to deamination to uracil by bisulfite (5f C, 5ca C, and C are deaminated to U) 5hm

Methylation sensitive restriction enzymes

Different accessibility of methylated DNA for restriction enzymes

5m

Precipitation of DNA or chromatin with MBD proteins

C-Binding domain (MBD) protein-specific analyses (i.e., MIRA) Antibody specific for modified DNA (i.e., MeDIP)

Precipitation of DNA with 5m C-, C-, 5f C-, or 5ca C-antibodies

Identification

Scope

– – – –

Sequencing restriction analysis (COBRA) methylation specific PCR (MSP) TET assisted bisulfite sequencing (TAB-Seq) – Southern blotting – qPCR

All modified Cs in a DNA fragment can be analyzed as well in the upper as in the lower strand

– – – – –

The methylation status of a specific DNA fragment or region can be analyzed

5hm

DNA hydrolysis

Complete analysis of the different base modifications

Nearest neighbor-analysis

Analysis of the different DNA modifications in connection to the 3´ -base

– – – – – – – – –

Pulldown and sequencing qPCR microarray immunofluorescence Immunoprecipitation and sequencing qPCR microarray immunofluorescence HPLC chromatography mass spectrometry HPLC chromatography mass spectrometry

Only DNA-modifications within restriction site can be analyzed

The methylation status of a specific DNA fragment or region can be analyzed

Different modifications of the genomic DNA can be analyzed The amount of distinct modifications can be analyzed

31

Analysis of Epigenetic Modifications

819

single base are analyzed. With the nearest neighbor-analysis dinucleotides are labeled and subsequently their composition is dissected. However, DNA hydrolysis and the nearest neighbor-analysis are not able to reveal the exact sequence context of the modified bases. These methods are explained next in more detail.

31.2 Methylation Analysis with the Bisulfite Method The easiest and most effective way to analyze DNA methylation is by the bisulfite technique. This method has a very high resolution and it is possible to analyze the methylation status of the whole DNA-population at a specific sequence or to dissect the methylation pattern of single DNA fragments. This technique was developed in 1974, but it was only in 1992 that it became popular after further development by Frommer and coworkers. Meanwhile it has come into broad usage because of its high resolution and reliability. The principle of this method is the reaction of bisulfite (HSO3 ) with DNA, which converts cytosine (5f C, 5ca C, and C) into uracil. The C6 position of an accessible cytosine is sulfonated by a high bisulfite concentration (3.0 M) and acidic conditions (pH 5.0). In this process the amino group at the C4 position is hydrolyzed and uracil is generated (Figure 31.2). The particularity of this reaction is that methylated cytosines (5m C and 5hm C) are not converted and remain as cytosines. Thus, unmethylated cytosines (deaminated to U) and methylcytosines (remain C) can be distinguished. (Figure 31.3). However, this method cannot distinguish between 5m C and 5hm C. In a PCR reaction the bisulfite-treated DNA is amplified and analyzed by using primers, which are complementary to the deaminated DNA sequence. This PCR results in the substitution of uracil by thymine (Figure 31.2). Through the PCR amplification the bisulfite method is highly sensitive and only a small amount of genomic DNA (50 ng) is needed. It is also possible to analyze the DNA methylation in less than 100 cells. Before the bisulfite treatment it can be of advantage if the cells or the DNA are embedded in agarose so that the loss of DNA can be minimized. A problematic issue of the bisulfite treatment is the incomplete conversion of Cs into Ts. The partial denaturation of the DNA during the bisulfite treatment may lead to incomplete deamination of unmethylated Cs to Ts and this could lead to misinterpretation of methylated Cs. To overcome this problem, different modifications of the bisulfite method have been established. One possibility is to digest the DNA in short fragments with a restriction enzyme before its denaturation. (However, no restriction site should be in the investigated DNA region.) During the bisulfite treatment, by a repetitive denaturing in a thermocycler, the deamination of the DNA can also be improved. However, by too intensive treatment, the DNA could be degraded into small fragments. It is relatively simple to verify the complete bisulfite conversion of the DNA by PCR amplification and sequencing. The presence of methylated C in a non-CpG context is mostly likely an artifact of an incomplete bisulfite reaction and this should be verified with an alternative method. In practice the bisulfite technique has proved to be a very efficient and reliable procedure to analyze DNA methylation.

Figure 31.2 Sodium bisulfite catalyzes the deamination of unmethylated cytosine or to uracil.

31.2.1 Amplification and Sequencing of Bisulfite-Treated DNA The bisulfite method allows the gene specific analysis of methylation levels of a cell population and of the methylation pattern of single DNA molecules. This depends on whether PCR products are directly sequenced (i.e., pyrosequencing) or if they are first subcloned and then sequenced. Since the two DNA strands of the DNA are no longer complementary after bisulfite conversion, it is possible to investigate strand specific methylation with separate primer pairs (Figure 31.3). The PCR amplification leads to a conversion of U (unmethylated C) into T, and from methylated C into C. On the complementary strand G (opposite to an originally unmethylated C) is then converted into A. It is very easy to mimic the bisulfite conversion in silico with a word processing program to generate the deaminated sequence necessary to design primers for the amplification of the bisulfite-converted DNA. The following aspects should be considered when designing the primer: 1. To exclude the amplification of not bisulfite-modified DNA, primer pairs should include some deaminated C (T in forward-primer and respective A in the reverse primer).

Sequencing by Synthesis, Classic Pyrosequencing, Section 30.2.1

820

Part IV: Nucleic Acid Analytics

Figure 31.3 Principle of the methylation analysis with the bisulfite method. DNA is denatured and treated with bisulfite. With this method the methylated cytosines (5m C and 5hm C) are conserved while the unmethylated cytosines (C, 5f C and 5ca C) are deaminated to uracil (U) and appear as thymine (T) after PCR amplification. Note that after the bisulfite treatment the DNA strands are not complementary and can be amplified with different primer pairs (A, B or C, D).

2. The primer should not include CpGs in their original DNA sequence, to avoid a specific amplification of methylated or unmethylated DNA (see also methylation specific PCR). If this is unavoidable, it is possible to insert Y (pyrimidine: C or T) instead of C and in the complementary strand an R (purine: G or A) instead of G. 3. Since primers do not contain Cs (in the complementary strand no Gs) their annealing temperature is often low and therefore primers with a length of 25–30 nt should be used. 4. For pyrosequencing a biotinylated primer and a sequence primer are necessary. 5. The PCR product should not be longer than 500 bp since longer DNA fragments will be amplified at low rate. This is caused by the fact that the bisulfite treatment degrades DNA and long intact DNA molecules are not available for the amplification. 6. The methylation pattern of individually complementary DNA molecules can be analyzed. Methylation in the context of a double strand DNA can be analyzed by the ligation of a hairpin linker prior to the bisulfite treatment. Semi-nested or Nested PCR, Section 29.3.1 Chemical Cleavage According to Maxam Gilbert, Section 30.1.3

Normally, 50 ng bisulfite-treated DNA is fully sufficient for a PCR reaction. If only a small amount of DNA is available, the sensitivity of the detection can be increased by a semi-nested or nested PCR. The DNA methylation is identified either by direct sequencing of PCR products or by sequencing of many individual DNA molecules after cloning in a vector system. One advantage of the bisulfite method is that the DNA methylation can be detected by conventional sequencing (didesoxy or Maxam–Gilbert method) or pyrosequencing (Figure 31.4). Methylated C as well as unmethylated C can be detected with the DNA sequencing method: m C occurs as C and unmethylated C as T (Figure 31.4). Hence, the methylation of all CpG in a DNA fragment can be analyzed. From the pyrosequencing additional quantitative conclusions can be drawn about the level of methylation of the PCR products and the presence of incomplete conversions of unmethylated cytosines (see bisulfite control in Figure 31.4). Alternatively, sequencing from short DNA fragments can be accomplished by mass spectrometry. During the PCR reaction a preferential amplification (bias) of unmethylated or methylated DNA may occur. In samples representing a certain mixture of methylated DNA and unmethylated DNA, this PCR bias can be investigated. To generate methylated standards DNA is methylated by a CpG-methylase (e.g., SssI-methylase) before bisulfite conversion and if necessary by cloning of PCR products in Escherichia coli.

31.2.2 Restriction Analysis after Bisulfite PCR In Vitro Restriction Analysis, Section 27.1.4

For further analysis of bisulfite-modified DNA different sensitive detection methods have been developed. One of them is the restriction analysis of PCR products of bisulfite-treated DNA. This method was termed combined bisulfite restriction analysis (CoBRA). The principle is that methylated C remains C after bisulfite treatment in a CpG sequence while unmethylated C is converted into T. If the C is in a palindromic sequence (i.e., 5´ -TCGA) the methylation can be investigated with a restriction enzyme. The restriction enzyme TaqI cuts the recognition sequence TCGA, so that the “methylated” PCR products are digested (Figures 31.5 and 31.6). On the other hand, if this restriction site is missing and the product is not cut, the PCR product originates from an

31

Analysis of Epigenetic Modifications

821

Figure 31.4 Examples of bisulfite methylation analysis after conventional sequencing (a) or pyrosequencing (b) of PCR products of the RASSF1Apromoter. (a) After bisulfite treatment and PCR amplification all unmethylated cytosines (C) are replaced by thymines (T). Methylated Cs in CpG context are resistant against this conversion. (b) Pyrograms of three sequencing reactions with the sequence YGTTYGGTTYGYGTTTGTTA and different levels of methylation of the PCR products. Double height of the signal indicates the incorporation of two nucleotides.

Figure 31.5 Principle of the restriction analysis after bisulfite PCR. The DNA is denatured, treated with bisulfite, and amplified by PCR. While unmethylated cytosines are modified to thymines (T) the methylated cytosines (m C) remain C. For example, in the “methylated” DNA the restriction cutting site for TaqI (5´ -TCGA) and in the “unmethylated” DNA the recognition site for TasI (5´ -AATT) emerges.

822

Part IV: Nucleic Acid Analytics

Figure 31.6 Example of a restriction analysis of the RASSF1A-promoter after bisulfite-PCR. The 205 bp long “unmethylated” PCR product is not cut by Taql. The “methylated” PCR product is digested in 171 bp (partial methylated), 90 bp, and 81 bp fragments. (The 34 bp long fragment is not visualized.) A 100 bp marker (M) serves as length standard for the 2% agarose gel.

Table 31.2 Enzymes for the restriction analysis of bisulfite modified DNA of PCR products. Restriction enzyme

Recognition sequence

For methylated DNA (CpG): TaqI

T/CGA

BstUI

CG/CG

MaeII, HpyCH4IV or TaiI

A/CGT

BsiWI

C/GTACG

PvuI

C/GATCG

ClaI

AT/CGAT

MluI

A/CGCGT

For unmethylated DNA (TpG):

unmethylated DNA (Figures 31.5 and 31.6). For COBRA all enzymes with CG in the recognition sequence can be used as diagnostic restriction enzyme (Table 31.2) and also those with only one C at the 3´ -end (i.e., EcoRI: GAATTC). However, the most common enzymes are the four base pair cutter Taql, BstUI, and MaeII, since their recognition sites are more abundant (Table 31.2). Interestingly, this assay can also be used to verify the complete conversion of C into T in the analyzed DNA. Thereby, a new restriction site is created through the bisulfite conversion. For example, a cutting site for Taql (TCGA) is only created from the original sequence 5´ -CCGA when the unmethylated 5´ -C was modified to T, but the second C is not converted in a methylated CpG context (Figure 31.5). The same principle can also be used to investigate an unmethylated C in a CpG sequence. If a C at the 3´ -end of a putative recognition site is modified to T, a new restriction cutting site is only created in the deaminated “unmethylated” DNA (Figure 31.5). For example, if the sequence AATCG is modified to AATTG, a restriction site for the enzyme TasI (AATT) will be found in a PCR product amplified from unmethylated DNA (Figure 31.5). One main limitation of COBRA is that only the analysis of the DNA methylation at restriction enzyme recognition sites is possible and therefore not all CpG in a DNA molecule can be investigated.

31.2.3 Methylation Specific PCR

TasI, Tsp509I

/AATT

AseI, VspI

AT/TAAT

SspI

AAT/ATT

Figure 31.7 Principle of the methylation specific PCR. DNA is denatured and treated with bisulfite. In this process methylated cytosines (m C) are kept while unmethylated Cs are deaminated to uracils (Us). The “methylated” DNA is amplified with methylation specific primer (MF and MR) and the “unmethylated” DNA with unmethylation specific primer (UF and UR).

In 1996 Herman and coworkers developed methylation specific PCR (MSP) to increase the sensitivity in detecting methylated DNA after bisulfite treatment. MSP analysis is very sensitive and can detect up to 0.1% methylated (or unmethylated) DNA sequences per sample. The MSP method uses different primer pairs for the amplification of methylated and unmethylated DNA after the bisulfite modification (Figure 31.7). These primers are located at specific CpGs and

31

Analysis of Epigenetic Modifications

their amplification rate reveals the methylation status of these Cs. For the amplification of methylated DNA a methylation specific primer pair with Cs in the forward primer and Gs in the reverse primer at the investigated CpG sites is utilized (Figure 31.7). Therefore, these primers only bind and amplify the previously methylated bisulfite modified DNA. In contrast, for the amplification of the unmethylated bisulfite-treated DNA an unmethylated specific primer pair is used, where the C in the forward primer is replaced by T (in the reverse primer G is replaced by A). These primers only bind and amplify the previously unmethylated bisulfite modified DNA. After gel electrophoresis the methylation status is directly detected by the amount of methylated and unmethylated PCR products (Figure 31.8). To increase the specificity of the amplification rate of “methylated” or “unmethylated” bisulfite-treated DNA by MSP, several aspects should be considered during the primer design. Moreover it is important to utilize DNA controls with known methylation status (i.e. methylated, unmethylated, and unconverted negative controls). The following aspects should be considered for the primer design:

 The methylation specific forward primer should harbor a C (reverse-primer a G) at the 3´ -end.  The unmethylation specific forward primer should have a T at the 3´ -end (reverse-primer an A).

 The primer should have three to four CpG or TpG to ensure a specific amplification of the methylated or unmethylated DNA, respectively.

 To increase the specificity of the primer pairs for bisulfite modified DNA the primer should harbor several Ts for “deaminated Cs” in the forward-primer and As in the reverse-primer.

The advantage of the MSP method is based on its high sensitivity and the simplicity of the assay. MSP has been combined with the real time-detection (real time MSP). The MethyLight method uses during the PCR methylation- (or unmethylation-) specific TaqMan probes. Therefore, this technique is utilized to quantify the methylation levels of the analyzed CpGs.

31.3 DNA Analysis with Methylation Specific Restriction Enzymes Some restriction endonucleases do not cut the DNA when their recognition sites are methylated, while other restriction enzymes are insensitive to such DNA methylation. For a third group of enzymes methylation of their recognition site is necessary for the cutting. These restriction endonucleases are used to identify methylated Cs or methylated As in their recognition sites. Often the enzymes HpaIl und Mspl are used for the analysis of the methylation status of cytosines at the CpG dinucleotides (Figure 31.9). Both enzymes recognize the sequence 5’CCGG, but only HpaIl is able to cut this sequence, when the second cytosine is unmethylated. In contrast the methylation insensitive isoschizomer Mspl is utilized to cut methylated and unmethylated DNA. Mspl cuts the methylated CmCGG-sequence as well as the unmethylated CCGG-sequence. The different methylation specific restriction fragments of HpaIl und Mspl are analyzed by Southern blot and PCR (Figure 31.9). For Southern blot analysis approximately 10 μg of genomic DNA is necessary. This technique allows a quantitative estimation of the methylated rate at the specific cutting site. Southern blot and hybridization can be conducted by a standard protocol.

823

Figure 31.8 Example of a methylation specific PCR of the RASSF1A-promoter. After bisulfite treatment a previously unmethylated DNA is amplified with the unmethylation specific primer pair (u) and a 105 bp PCR product is detectable after gel electrophoresis. On the other hand, a 93 bp PCR product is obtained by amplification of previously methylated DNA with the methylation specific primer pair (m). In a partial methylated bisulfite-treated DNA, PCR products for both primer pairs are detectable. A 100 bp marker (M) served as length standard for the 2% agarose gel.

824

Part IV: Nucleic Acid Analytics

Figure 31.9 DNA-methylation analysis with methylation specific restriction enzymes. The methylation sensitive enzyme Hpall cuts only unmethylated DNA. The methylation specific inhibition is analyzed by Southern blot or PCR and controlled with the insensitive enzyme Mspl, which cuts both the methylated and the unmethylated DNA. For the Southern blot analysis the DNA is digested with an additional insensitive restriction enzyme (R).

Isoschizomers are distinct restriction endonucleases with identical recognition sites, which generate similar or different cleavage products.

For the PCR analysis after a methylation specific restriction, little genomic DNA (50 ng) is needed and this allows detection of low levels of methylated DNA. However, this technique is prone to restriction artifacts (e.g., uncomplete digestion, see below) and therefore should be well controlled. For this analysis two primers flanking the restriction site should be designed. PCR products are analyzed by gel electrophoresis and compared to specific controls. Only in the methylated sample is a fragment detected after a restriction digest with the methylation sensitive enzyme HpaIl and PCR (Figure 31.9). As control, no fragment should be obtained after digestion with the insensitive Mspl and PCR amplification. As further control, the DNA can be digested with a restriction enzyme that cuts outside of the analyzed fragment. After this restriction a PCR product should be detected. For the methylation analysis of m CpG a number of methylation sensitive restriction enzymes can be utilized. Several of these enzymes are listed in Table 31.3. With these enzymes not only the methylation of known cutting sites can be investigated, but also novel methylated DNA Table 31.3 Methylation sensitive enzymes and insensitive isoschizomers for restriction analysis. Sensitive enzyme (insensitive isoschizomer)

Methylated recognition sequence (isoschizomer)

m

CpG-methylation

HpaII (MspI)

/CmCGG (C/mCGG)

BstUI

m

NotI

GmC/GGCmCGC

AscI

GG/mCGmCGCC

SmaI (XmaI)

CCmC/GGG (C/CmCGGG)

CG/mCG

Dcm-methylation – CmCWGG EcoRII (BstNI)

/CmCWGG (CmC/WGG)

SfoI (NarI)

GGC/GCmCWGG (GG/CGCmCWGG) G/GTACmCWGG (GG/TACmCWGG)

Acc65I (KpnI) m

Dam-methylation – G ATC MboI (Sau3AI)

/GmATC (/GmATC)

DpnII (Sau3AI)

/GmATC (/GmATC)

BclI

T/GmATCA

AlwI

GGmATC(4/5)

31

Analysis of Epigenetic Modifications

regions can be isolated that are resistant to digestion by methylation sensitive restriction enzymes. Methylation-sensitive arbitrarily primed PCR, differential methylation hybridization, and restriction land-mark genome scanning are examples of methods used to identify potentially methylated DNA regions in the genome (see Further Reading). For the methylation analysis with restriction enzymes a certain caution is necessary: The incomplete digestion of unmethylated DNA with a methylation sensitive enzyme can be mistaken as a partial methylated restriction site. Since the restriction of the genomic DNA can be inhibited by contamination of the sample with cell membranes, carbohydrates, or lipopolysaccharides or by utilizing wrong conditions during the reaction (e.g., salt concentration or pH), the purity of the genomic DNA is essential. In Escherichia coli cytosine is only methylated within the sequence 5´ -CmCWGG (W = A or T), which is called Dcm-methylation. The status of methylation of this sequence can be analyzed with the isoschizomer EcoRII I-BstNI (Table 31.3). While EcoRII does not cut the methylated sequence (CmCWGG), the isoschizomer BstNI cuts both the methylated and the unmethylated sequence. Some enzymes are sensitive to Dcm methylation; their consensus sequences are listed in Table 31.3. In prokaryotes methylation of adenine can be found in the sequence 5´ -GmATC, which is termed Dam-methylation. To analyze this adenine methylation the isoschizomers Mbol/Sau3A can be used (Table 31.3). Both enzymes recognize the sequence GATC. Mbol is sensitive for m A and the methylated sequence is not cut. In contrast Sau3A is insensitive for this methylation and cuts the sequence GmATC. The methylated GmATC-sequences can also be detected with Dpnl. Interestingly, the enzyme Dpnl only cuts the DNA when adenines on both strands of its recognition site are methylated. It should be considered that because of Dcm- and Dam-methylation the cloning of certain DNA fragments from Escherichia coli is sometimes problematical when certain restriction sites are methylated – especially when the methylation motif is encoded through the flanking sequence. The restriction enzymes Clal (ATCGAT) and Xbal (TCTAGA) do not cut the DNA when an adenine has been modified by an overlapping Dam methylation (ATCGmATC respective TCTAGmATC). The enzyme Stul (AGGCCT) can be inhibited by an overlapping Dcm-methylation (AGGCmCTGG). Thus it is important to consider flanking bases when a specific recognition site is not cut by the appropriated restriction enzyme. This problem can be avoided by utilizing a methylation insensitive isoschizomer or an Escherichia coli strain that is negative for Dcm- or Dam-methylation.

31.4 Methylation Analysis by Methylcytosine-Binding Proteins This method utilized specific proteins that exhibit high affinity binding to methylated DNA. Different proteins have been isolated from mammalian cells, which can bind methylated cytosine and are involved in the inactivation of gene expression and changes of chromatin state. These proteins are termed methyl-CpG-binding proteins (MeCPs) and possess a methyl binding domain (MBD). Several different proteins (i.e., MeCP2, MBD1, MBD2, and MBD3) have been characterized and isolated. These MBD proteins are used for different methods to analyze the methylation status of specific genomic regions or to identify novel differentially methylated regions (Figure 31.10). For example, tagged MBD proteins are immobilized to nickel agarose beads and used for enrichment of methylated DNA. Genomic DNA is fragmented by sonication or restriction digestion and purified on a column. At a specific salt concentration the methylated DNA is bound to the column, and the unmethylated DNA is eluted. Afterwards the methylated DNA is eluted with high salt concentration and analyzed. Subsequently the methylated DNA can be quantified by real time-PCR, hybridized on microarrays, or analyzed by deep sequencing. The methylated-CpG island recovery assay (MIRA) uses the ability of MBD3L binding to MBD2 and thereby increases the affinity of MBD2 to methylated DNA. In MIRA recombinant GST-tagged MBD2b protein and His-marked MBD3L1 protein is expressed in bacteria and is purified with glutathione-Sepharose 4B or the respective Ni-NTA agarose beads. The genomic DNA is isolated and cut with the enzyme MseI. MseI recognizes the TTAA cutting site but cuts rarely in CpG islands; 1 μg purified GST-MBD2b and 1 μg His-MBD3L are pre-incubated together with 500 ng

825

826

Part IV: Nucleic Acid Analytics

Figure 31.10 Analysis of the DNA-methylation with methyl-binding proteins. The genomic DNA is fragmented by ultrasound or restriction digestion in small fragments and bound specifically to methylcytosine binding proteins (MBD) or to methylcytosine antibodies, precipitated, and purified. The methylated DNA can be quantified by PCR, sequenced, or analyzed on microarrays.

unmethylated DNA (e.g., bacterial DNA) and then incubated for some hours with approximately 500 ng of fragmented genomic DNA. During this step the methylated DNA is bound to MBD2b/ MBD3L. The methylated DNA is precipitated with immobilized glutathione paramagnetic particles and the beads are washed. Subsequently, the methylated DNA is eluted from the beads and analyzed. For example, after linker ligation the enriched DNA can be amplified, labeled, and hybridized on microarrays. As an example application, tumor specific DNA methylation can be revealed by fluorescence marking of DNA from tumor tissue with Cy5 (red) and compared to DNA from normal tissue, which was marked with Cy3 (green) (Figure 31.11). With the chromatin immunoprecipitation (ChIP) method endogenous MBD proteins are linked with formaldehyde to genomic DNA in vivo. Subsequently, the protein–DNA complexes are purified and precipitated with anti-MBD-antibody (see also Section 31.7). Again the methylated DNA can be analyzed by PCR, next generation sequencing (NGS), or microarray (ChIP on chip). These methods allow the identification of methylated DNA-sequences in a certain genome.

31.5 Methylation Analysis by Methylcytosine-Specific Antibodies

Figure 31.11 Analysis of differential methylation levels of normal and tumor tissue. Genomic DNA is fragmented and purified with MBD proteins (MIRA) or methylcytosine specific antibodies (MeDIP). The methylated DNA is marked by fluorescence and analyzed on microarrays.

A further method for detecting and quantifying modified DNA is based on the development of antibodies binding specifically to 5m C, 5hm C, 5f C, or 5ca C (Figure 31.10). These antibodies interact with single strand modified DNA. For example, methylated DNA can be precipitated and analyzed. The sensitivity of the anti-5-methylcytosine antibody is very high. The monoclonal mouse 5m C-antibody interacts specifically with methylated DNA in 10 ng genomic DNA that contains only 3% 5m C. With these antibodies methylated DNA can be precipitated and analyzed by real time-PCR, deep sequencing, or microarray-technology (Figures 31.10 and 31.11). This method is also called the methylated DNA immune-precipitation (MeDIP) method. During the MeDIP procedure 4 μg of genomic DNA is cut into 300–1000 bp fragments by sonication and denatured by heating. An aliquot of the sheared DNA can be used as an input control. Subsequently, the DNA is incubated with 10 μl monoclonal mouse 5m C antibody at 4 °C for several hours. The methylated DNA is precipitated with anti-mouse IgG beads,

31

Analysis of Epigenetic Modifications

827

purified, and eluted through a proteinase K digestion. The enriched methylated DNA can now be investigated by real-time PCR, sequencing, or microarray analysis. The 5m C, 5hm C, 5f C, or 5ca C antibodies together with immunofluorescence can be utilized to investigate chromosomal segments with a high occurrence of modified cytosines.

31.6 Methylation Analysis by DNA Hydrolysis and Nearest Neighbor-Assays The following methods allow us to analyze the frequency of modified bases in the DNA and for the nearest neighbor-assays their context as dinucleotides can be detected (e.g., 5m CpN or 5hm CpN). However, with these methods, it is not possible to locate the modified base in the genomic sequence. Since the DNA of contaminated organisms can influence the analysis results, the investigated cells should be free of foreign DNA from viruses, mycoplasms, or other endoparasites. With the DNA hydrolysis method the DNA is completely hydrolyzed. Subsequently, the base composition is fractionized and the modified bases are quantified. Since products of a chemical hydrolysis are rather complex, the enzymatic hydrolysis is the preferred method. Spleen phosphodiesterase or Micrococcus-nuclease produces 3´ -phosphorylated mono-nucleosides. Pancreatic DNase I or snake venom phosphodiesterase produces 5´ -phosphorylated mono-nucleosides. Afterwards, 3´ - or 5´ -phosphates are eliminated with an alkaline phosphatase and the hydrolysis products are identified by different techniques such as high performance liquid chromatography (HPLC), mass spectrometry, or capillary electrophoresis (CE). With HPLC it is possible to detect in 2.5 μg DNA from 0.04 up to 0.005% 5m C. With the nearest neighbor analysis the frequency of methylated bases in context of the 3´ neighbor can be dissected. For this purpose the purified genomic DNA is marked with one of four [α-32P]dNTP by nick translation at accidental strand breaks that can be generated by DNase I (Figure 31.12). Then the DNA is digested with Micrococcus nuclease and calf thymus phosphodiesterase (exonuclease) to 3´ -dNMP, with the radioactive 32 P of the 5´ -position of the marked nucleotide located now at the 3´ -position of the 5´ -base. By adsorption chromatography or HPLC the 3´ -marked dNMP are separated and then compared to different modified standards. For example, one can analyze how often 5m C or N6m A is found with respect to one out of four different 5´ -bases. Since this method was rather complex, it has been refined. DNA is digested with a restriction enzyme and then the cutting site is marked with a specific [α-32P]-dNTP and Klenow. To analyze methylated CpG, the DNA can be digested with Mbol (/GATC) and the cutting site can be labeled with [α-32P]-dGTP and Klenow (Figure 31.12). With nucleases the DNA is digested to 3´ -marked-dNMP and the modified base can be quantified with chromatography. The intensity of labeled 5m dCp, dCp, dTp, dGp, and dAp indicates the quantity of 5m dCpG, dCpG, dTpG, dGpG, and dApG, respectively, at MboI cutting sites. Alternatively, the DNA can be digested with Fokl (GGATGN9–13) and marked with one of the four [α-32P]dNTPs.

Figure 31.12 Principle of the nearest neighbor-analysis. (a) α-32P-dGTP is inserted at the 3´ -end of the methylated adenine and analyzed by chromatography. (b) The DNA is digested with Mbol, the cutting site is labeled with α-32PdGTP and Klenow. Subsequently methylated Cs are quantified.

828

Part IV: Nucleic Acid Analytics

Chromatin Immunoprcipitation (ChIP), Chapter 32.4.7

The advantage of Fokl is that DNA modifications can be investigated independently from their recognition site but in the context all four 3´ -downstream bases (NpA, NpG, NpT, and NpG). With the enzyme MvoI (CC/WGG) only the methylation of CpA and CpT can be analyzed. In principle, several enzymes could be used; however, it should be considered that the selected restriction enzyme is not sensitive to DNA modifications. The nearest neighboranalysis is a preferred technique to identify new DNA modifications; however, with this method these modifications cannot be localized within the genome.

31.7 Analysis of Epigenetic Modifications of Chromatin Chromatin modifications and the specific binding of proteins (e.g., transcription factors) to DNA are preferably investigated with chromatin immunoprecipitation (ChIP) (Figure 31.13). This method is based on the crosslinking of nucleic acids with proteins or by crosslinking from proteins among each other with formaldehyde. The crosslinked protein–nucleic acid complexes are precipitated with a specific antibody against the protein or the chromatin modification. The precipitated complexes are washed stringently to elute unspecific bound chromatin. By heating the crosslinks are made reversible and the proteins are digested. The purified DNA can be investigated with PCR (real-time PCR), deep sequencing (ChIP Seq) or microarray (ChIP-onchip). One important requirement for this method is that the antibody for the analyzed modification or protein works also in the crosslinked chromatin condition. For ChIP analysis, DNA binding proteins are crosslinked in vivo by 1% formaldehyde. During this procedure proteins are linked covalently among each other as well as with the genomic DNA. This crosslinking can be carried out in cell culture for 10 min at 37 °C. Afterwards the cells are washed and harvested. The cells are lysed in a SDS buffer and the chromatin is sheared by sonication into fragments of approximately 500 bp. Subsequently, the chromatin is diluted in a binding buffer to 200 up to 300 μg μl 1 protein and an aliquot is retained as an input-control. To reduce the unspecific binding of antibodies the chromatin sample is pre-incubated with protein A agarose. Then the chromatin is incubated at 4 °C overnight with a specific antibody against the chromatin modification or factor (Figure 31.13). As a negative control in parallel an unspecific mouse IgG antibody can be used or as positive control a histone H3 antibody is suitable. The bound chromatin is precipitated by addition of protein A agarose for 1 h at 4 °C and centrifuged. Subsequently, all samples are intensely washed. Finally, crosslinking of samples and the input-control are reversed by the addition of 5 M NaCl and incubation at 65 °C overnight. Proteins are digested by proteinase K and DNA is purified. The enriched DNA can now be analyzed by real-time PCR, ChIP-on-chip (chromatin immunoprecipitation on microarray) or ChIP-Seq (chromatin immune-precipitation with ultradeep sequencing).

31.8 Chromosome Interaction Analyses Figure 31.13 Analysis of DNA-associated protein modifications or factors with chromatin immunoprecipitation (ChIP). Chromatin is crosslinked with formaldehyde and fragmented by sonication in approximately 500 bp. The chromatin is incubated with specific antibodies, precipitated, and purified. The purified DNA can be quantified by PCR, sequenced (ChIP-Seq), or analyzed on microarrays (ChIP on Chip).

During the interphase chromosomes are uncoiled and organized in different topologically associating domains (TADs) and chromosomal territories. In this configuration chromatin interacts intrachromosomally as well as interchromosomally. For example, there are intrachromosomal interactions between enhancers and promoters. This chromosomal topology can be investigated with the chromosome conformation capture technique (CCC or 3C analysis) (Figure 31.14). During the 3C analysis chromatin is crosslinked, cut, and ligated. During this treatment a preferential ligation of associated and crosslinked DNA fragments occurs and these specific interactions can be analyzed by quantitative PCR. For the 3C method the chromatin is incubated with formaldehyde to covalently crosslink protein–protein and protein–DNA interactions. This treatment can be performed in cell culture with formaldehyde. The cells are lysed and the crosslinked chromatin is cut with a restriction enzyme. Subsequently, the restriction enzyme is deactivated and the chromatin is ligated by T4 DNA ligase. During this step all DNA fragments that are fixed by crosslinking are preferentially

31

Analysis of Epigenetic Modifications

829

ligated (Figure 31.14). Afterwards crosslinks are reversed by an incubation at 65 °C overnight and the DNA can be purified and precipitated. The amount of ligation products can be analyzed by quantitative real-time PCR. The more often ligation products are detected the higher is the probability that DNA fragments interact in vivo. With the help of circular chromosome conformation capture (4C- or circular 3C-analysis) new chromosomal interactions can be detected (Figure 31.14). In this analysis the ligation products resulting from the described 3Canalysis are cut by a further restriction enzyme and transformed in a circular DNA (circular DNA) with a second ligation. These circular ligation products can be amplified by inverse PCR and the unknown DNA sequence can be identified by a sequencing or microarray technique. For this analysis anchor primers for a fixed so called anchor point are utilized. With the inverse PCR the sequence of unknown circular DNA can be detected. In this respect, a PCR from the known DNA fragment (bait-DNA) into the unknown ligation product is performed.

Hi-C is another modification of 3C that can identify chromatin interactions in a genome-wide manner. Again cells are fixed with formaldehyde and DNA is cut with a restriction enzyme. Subsequently, 5´ overhangs are filled in with biotinylated nucleotides and Klenow. Then a blunt-end ligation is performed under very dilute conditions. This results in a library of ligation products that represent interacting genomic regions. This library is sheared, and the junctions are pulled-down with streptavidin beads. Finally, interacting regions can be identified by paired-end sequencing. Another technique combines ChIP and chromosome configuration capture with high throughput sequencing. This technique, called ChIA-PET for chromatin interaction analysis by paired-end tag sequencing, allows the identification of genome wide chromosome interactions that are associated with a specific chromatin factor or modification.

31.9 Outlook The DNA modifications and chromatin play important roles in the regulation of gene expression (epigenetics). Since the sequence of the human genome has been elucidated, the DNA sequence is known. The next challenge is to decode the epigenetic modifications and configuration of the chromatin. Tissue- and illness-specific epigenetic patterns could be utilized as biomarkers for early diagnostics of diseases and their molecular classifications.

Further Reading Beck, S. and Rakyan, V.K. (2008) The methylome: approaches for global DNA methylation profiling. Trends Genet., 24, 231–237. Collas, P. (ed.) (2009) Chromatin Immunoprecipitation Assays, Methods and Protocols Series: Methods in Molecular Biology, Volume 567, Springer. Dammann, R., Li, C., Voon, J.H., Chin, E.L., Bates, S., and Pfeifer, G.P. (2000) Epigenetic inactivation of a RAS association domain family protein from the lung tumour suppressor locus 3p21.3. Nat. Genet., 25, 315–319. Dekker, J., Rippe, K., Dekker, M., and Kleckner, N. (2002) Capturing chromosome conformation. Science, 295, 1306–1311. Esteller, M. (ed.) (2005) DNA Methylation: Approaches, Methods and Applications, CRC Press, Boca Raton. Fullwood, M.J., Liu, M.H., Pan, Y.F., Liu, J., Xu, H., Mohamed, Y.B., Orlov, Y.L., Velkov, S., Ho, A., Mei, P.H., Chew, E.G., Huang, P.Y., Welboren, W.J., Han, Y., Ooi, H.S., Ariyaratne, P.N., Vega, V.B., Luo, Y., Tan, P.Y., Choy, P.Y., Wansa, K.D., Zhao, B., Lim, K.S., Leow, S.C., Yow, J.S., Joseph, R., Li, H., Desai, K.V., Thomsen, J.S., Lee, Y.K., Karuturi, R.K., Herve, T., Bourque, G., Stunnenberg, H.G., Ruan, X., Cacheux-Rataboul, V., Sung, W.K., Liu, E.T., Wei, C.L., Cheung, E., and Ruan, Y. (2009) An oestrogen-receptor-alpha-bound human chromatin interactome. Nature, 462, 58–64. Herman, J.G., Graff, J.R., Myohanen, S., Nelkin, B.D., and Baylin, S.B. (1996) Methylation-specific PCR: a novel PCR assay for methylation status of CpG islands. Proc. Natl. Acad. Sci. U.S.A., 93, 9821–9826. Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B.R., Sabo, P.J., Dorschner, M.O., Sandstrom, R., Bernstein, B., Bender, M.A., Groudine, M.,

Figure 31.14 Analysis of chromosomal interactions by the chromosome conformation capture technique. The chromosomal interactions are fixed by formaldehyde crosslinking. Chromatin is cut by a restriction digestion and the DNA fragments are ligated. The ligation products are purified and analyzed by quantitative PCR (real-time PCR) or by other techniques (e.g., inverse PCR).

830

Part IV: Nucleic Acid Analytics Gnirke, A., Stamatoyannopoulos, J., Mirny, L.A., Lander, E.S., and Dekker, J. (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science, 326, 289–293. Lister, R., Pelizzola, M., Dowen, R.H., Hawkins, R.D., Hon, G., Tonti-Filippini, J., Nery, J.R., Lee, L., Ye, Z., Ngo, Q.M., Edsall, L., Antosiewicz-Bourget, J., Stewart, R., Ruotti, V., Millar, A.H., Thomson, J.A., Ren, B., and Ecker, J.R. (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature, 462, 315–322. Xiong, Z. and Laird, E.W. (1997) COBRA: a sensitive and quantitative DNA methylation assay. Nucleic Acids Res., 25, 2532–2534. Zhao, Z., Tavoosidana, G., Sjölinder, M., Göndör, A., Mariano, P., Wang, S., Kanduri, C., Lezcano, M., Sandhu, K.S., Singh, U., Pant, V., Tiwari, V., Kurukuti, S., and Ohlsson, R. (2006) Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra- and interchromosomal interactions. Nat. Genet., 38, 1341–1347.

Protein–Nucleic Acid Interactions Rolf Wagner Gustav-Stresemann-Sraße 15, 41352 Korschenbroich, Germany

Protein–nucleic acids interactions are fundamental for all events in living organisms that serve the conservation and propagation of genetic information. All steps during the flow of genetic information, such as replication, transcription, translation, as well as events during chromatin remodeling, repair, maturation, or transport are characterized through extensive contacts between nucleic acids and diverse classes of proteins. Notable examples of such proteins are polymerases, transcription factors, helicases, topoisomerases, ligases, or ribosomes and telomerases, the latter representing itself complexes between RNA and proteins. Hence, for modern molecular biology or biochemistry it is of central importance to unravel the molecular mechanisms underlying the recognition between proteins and nucleic acids. Although the basic molecular details in the recognition between proteins and DNA or RNA share many common principles there are several subtle differences caused by the fundamental different higher-order structures and functions of both nucleic acid classes leading to specific peculiarities in their interaction mechanisms. A separate chapter is therefore devoted to describe methods for the analysis of RNA–protein complexes. It should be emphasized, however, that many of the methods described are similarly suitable for the analysis of RNA–protein complexes as well as DNA-protein complexes. This is especially true for the physical methods described in this chapter.

32.1 DNA–Protein Interactions 32.1.1 Basic Features for DNA–Protein Recognition: Double-Helical Structures DNA predominantly exists as a double-stranded helical structure, which over large ranges consists of the B-form structure as postulated by Watson and Crick. This helical structure is characterized by two base-paired polynucleotide strands, which are intertwined plectonemically by two strands of opposite polarity (strands cannot be pulled apart without unwinding). In this structure, the negative charges of the sugar-phosphate backbone point outwards at an optimal distance. The helix is furthermore characterized by a major groove and a minor groove, which are wound in right-handed turns around the helix axis. The paired aromatic bases (A:T and G:C) are stacked parallel on top of each other perpendicular to the helix axis (tilt). Neighboring base pairs are twisted relative to each other by 36° in right-hand turns (twist), which results in a full helical turn after about ten base pairs. Donor and acceptor positions of the nucleotide bases are involved in base pairing within the helix and thus shielded by the sugar-phosphate backbone from functional side groups of potential outside proteins. Apart from the interactions of polymerases and some single-strand binding proteins the recognition of most DNA binding proteins occurs without breaking the double-stranded base pairing structure. Hence, the interaction between DNA and Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

32

832

Part IV: Nucleic Acid Analytics

Figure 32.1 Structure of helical B-DNA. (a) Arrangement of the sugar-phosphate backbone and the major and minor grooves. (b) Chemical recognition motifs within the grooves. A: H-bond acceptors, D: H-bond donors, and M: methyl group.

proteins does not involve specific Watson–Crick-type base pairing, which is otherwise extremely important for nucleic acid interactions during biological processes. However, the grooves of the DNA double helix provide very specific surfaces for the recognition of protein structures in which each specific base pair exhibits an individual pattern of H-bond donors, -acceptors, or methyl groups for the interaction with amino acid side chains of proteins. The helical grooves of the DNA play a predominant role in the interaction between DNA and proteins (Figure 32.1). There are of course special proteins that recognize less abundant DNA structures, such as single-stranded DNA or alternative helix structures, like Z-DNA, for instance, which represents a left-handed structure. Interestingly, these proteins often share structural similarities with RNAbinding proteins.

32.1.2 DNA Curvature

Figure 32.2 Parameters describing helical DNA conformations. (a) DNA-helix parameters; (b) schematic models explaining DNA curvature.

In total, the structure of DNA does not only represent DNA in the B-form. In fact, the exact helical geometry of a given DNA results from the sequence of the different base pairs. Depending on the individual sequence the DNA does not uniformly follow a B-form structure but local differences in the helical conformation occur as a result of deviations in rotational (twist, tilt, or roll angles) or translational parameters (shift, rise, slide) of the base pairs. Often these structural deviations cause a curved path (curvature) of the otherwise straight DNA conformation. Such changes in the DNA contour often provide additional recognition signals for the interaction of specific proteins. Characteristically, curvature arises at consecutive A:T base-pairs clustered in helical phase. Yet, DNA curvature can also result from GGCC sequence repetitions. There is, however, a difference in the direction of the curvature induced by A:T clusters or GGCC repeats. At A:T clusters the minor groove points to the inside of the curvature whereas GGCC repeats result in a curvature with the minor groove pointing to the outside of the curvature. The resulting contour angles are quite remarkable and for a single A:T cluster angles between 12° and 23° have been noted. Several models have been put forward to explain the occurrence of curvature. The most descriptive are probably the ApA-wedge- and the B-junction models (Figure 32.2). In a simplified way, the ApA-wedge model predicts that each A:T dinucleotide causes a change in the tilt and roll angles. This results in a widening of the stacked bases similar to a wedge. Several such alterations in helical phase (ten base-pair distance corresponding to one helical turn) lead to a continuous DNA curvature. In the B-junction model the curvature is explained by the fact that only in the B-form DNA are the stacked bases perpendicular to the helix axis while in other helical conformations, such as the A-form, the plane of the bases relative to the helix axis is changed by the tilt angle. DNA sequences consisting of A:T clusters have a tendency to exist in the DNA A-form. A kink of the helix axis occurs at the junction between the A- and B- conformations because the stacking properties of the aromatic bases force all base pairs to remain packed in a parallel manner. Notably, DNA curvature must not always be static to enable the interaction of curvature-dependent proteins. Often, anisotropic flexibility of the DNA (preferential bending in one direction) suffices to cause a shortened persistence length, facilitating the deformation in one but not the other

32 Protein–Nucleic Acid Interactions

direction. A specific interaction is supported if an adequate adaptation of the DNA structure to the protein surface is possible. This type of induced conformational change is termed DNA bending. DNA curvature, and the concomitant efficiency to bind various proteins, depends on several external parameters including temperature (normally DNA curvature melts above 50 °C) or the presence of bivalent ions, such as Mg2+ and Ba2+, which generally enhance curvature, while antibiotics like distamycin, which binds into the minor groove of A:T-rich sequences, cause a reduction in curvature. Moreover, superhelicity has a profound effect on DNA curvature. How can the curvature of DNA be detected or even the position and intensity of DNA curvature be determined and how could one show that binding of a protein might alter the curvature of a given DNA? The simplest approach applicable in almost every laboratory is gel electrophoresis. As outlined below, DNA curvature reduces the gel electrophoretic mobility. In other words, the speed of migration of a curved DNA fragment is slower compared to a non-curved DNA fragment of the same length. To determine the degree of curvature of a particular DNA one has to compare the gel electrophoretic mobilities of the curved (μobs) and the non-curved (μact) DNA fragments. This can easily be done for DNA fragments between 100 and 500 bp with native polyacrylamide gels between 8% or 10%. To measure the difference in mobility the gel electrophoresis has to be performed at low (50 °C) temperature at which the curvature but not the base pairs of the double strands melts. The ratio of the two mobilities is termed the k-factor (k = μact/μobs). A k-factor larger than 1 (k > 1) indicates that the DNA is curved, whereby the magnitude of the k-factor correlates with the angle of the curvature. Moreover, the magnitude of the k-factor depends on the position of the center of curvature within the DNA fragment, which means that reduction of the mobility caused by curvature correlates with the end-to-end distance of the DNA fragment. In fact, the end-toend distance of a curved DNA is a function of the angle of the curvature and the position of the center of the curvature with respect to the fragment ends. Hence, for the end-to-end distance of a curved DNA fragment the position of the center of curvature matters. For the same angle the distance is smaller if the center of curvature is closer to the middle of the fragment than to its ends. This means that DNA fragments of equal size with the same curvature exhibit the lowest gel electrophoretic mobility if their center of curvature is localized in the middle of the fragment. If one determines the mobility of a DNA fragment with a given curvature in the middle of the fragment (μM) and compares this mobility with the same curvature localized at the fragment ends (μE) it is possible to derive the curvature angle α from the empirical relationship μM/μE = cos(α/2). This gel electrophoretic measurement not only allows determination of the degree of curvature it also discloses the center of curvature within a certain DNA fragment. Special plasmids have been constructed to clone curved DNA fragments or curved protein binding sites at different positions within a series of DNA fragments of exactly equal length (circular permutation assay). The different DNA fragments are generated by hydrolysis with a set of restriction enzymes for which cleavage sites have been positioned as direct repeats flanking a central cloning site for the uptake of the desired DNA. The combined restriction digest results in a set of DNA fragments of identical size with the inserted DNA at a different distance relative to the fragment ends. A plot of the gel electrophoretic mobility of the different fragments against the distance of the fragment ends in bp yields the center of curvature as the extrapolated position of minimal mobility. The curvature angle can additionally be derived from μM/μE = cos α/2 (Figure 32.3).

32.1.3 DNA Topology Single-stranded DNA forms are extremely rare; hence, most existing DNA conformations can be described by different double-helical forms and sections of static or dynamic curvature. However, biological DNA are often covalently closed circles and/or very large with the ends fixed and not free to move. Such structures give rise to different topological isomers, which are characterized by additional parameters. The topology of DNA molecules is very important for biological processes such as replication or transcription, which generally involve DNA–protein interactions. Some fundamental facts to understand the effect of topology on DNA–protein interaction are listed below.

833

Figure 32.3 Scheme of a permutation analysis to determine the center of curvature within a given DNA. (a) Arrangement of DNA fragments of identical length resulting from restriction hydrolysis with enzymes A–G. Restriction sites of the enzymes A–G for the integration of a DNA-binding region or a curved DNA flank the cloning site (grey box) as direct repeats. Hydrolyses with the different enzymes result in fragments of equal length in which the region for the integration of the DNA in question exhibits different distances to the fragment ends. (b) Schematic depiction of a retardation gel with the different DNA fragments. (c) Diagram showing the relative mobilities of the DNA fragments as a function of the position of the cloning site relative to the fragment start. The center of the curvature is derived by extrapolation of the position with minimal mobility versus the base position.

834

Part IV: Nucleic Acid Analytics The spatial description of molecules that exist as closed circles or which are fixed at their ends require an additional dimension defining the topology of the system. Such molecules can exist as different topoisomers. In the case of circular DNA molecules different superhelical structures give rise to different topoisomers. Superhelical structures are divided into positive (left-handed screw, DNA overwound) and negative (right-handed screw, DNA underwound) superhelical windings. Circular DNA with neither positive nor negative superhelical windings is termed relaxed.

For a more detailed description of DNA topology and related phenomena the reader is referred to specialized literature. The parameters relevant for topological molecules are defined by a simple equation: LK ˆ T W ‡ W R

Figure 32.4 Schematic illustration of the parameters LK (a), TW (b), and WR (c) describing superhelical DNA structures.

Figure 32.5 Coupling between transcription and superhelical DNA according to the twin supercoiled domain model.

(32.1)

where LK is the linking number, TW is the twisting number, and WR is the writhing number. The linking number LK describes how often a DNA strand is intertwined; LK represents the topological constant, which can only be changed if a DNA strand is broken; LK is necessarily an integer. The twist TW gives the number of rotations of the antiparallel DNA strands around the helix axis. In B-DNA the twist TW is for instance 10.5 bp per turn. The writhing number WR reflects the three-dimensional contour of the helix axis and describes the number of superhelical over- or underwindings. For relaxed circular DNA without any superhelical windings WR = 0. For such a molecule the linking number and twisting number are identical (LK = TW, which follows from LK = TW + WR). In the case of right-handed superhelical windings the writhing number is negative (WR < 0), while for left-handed supercoils it is positive (WR > 0) (Figure 32.4). How can these parameters be influenced? As outlined above, the topological constant LK can only be changed by breaking covalent bonds (the responsible enzymes in the cell are called topoisomerases). Both WR and TW are prone to changes by a number of biological relevant processes, which are related to protein binding. Examples are changes in twist (overwinding or melting of the doublestrand structure). Proteins that change the twist upon DNA binding either enhance or reduce the superhelicity of the DNA. In turn, enhancement or reduction of the superhelicity may cause a change in the binding affinity of such proteins. Processes like the intercalation of aromatic amino acids between DNA base pairs change twist automatically. In addition, the intercalation of dyes (e.g., ethidium bromide) or antibiotics, which bind into the grooves of DNA, has a similar effect. The superhelicity also changes if base pars are disrupted and the DNA melts in response to protein binding because unwinding the DNA strands reduces TW. This effect is valid for all polymerases, which cause DNA-melting over a defined range! Therefore, transcription has a direct influence on DNA superhelicity and vice versa. Note that polymerases, owing to their size and steric constraints, are unable to rotate with the necessary speed (∼300 rpm). As a consequence, regions of positive and negative superhelicity flank the section, where RNA polymerase moves (twin supercoiled domain model). Within the cell such regions of superhelical over- or underwound DNA are normally relaxed by cellular enzymes (topoisomerases) (Figure 32.5).

32 Protein–Nucleic Acid Interactions

835

32.2 DNA-Binding Motifs Comparative structural analyses of known DNA-binding proteins have led to the classification of characteristic amino acid sequence motifs for the recognition and binding of DNA. The most prominent DNA-binding motifs can be divided into five major classes (Figure 32.6): helix-turnhelix structures, leucine zipper structures, zinc-finger domains, helix-loop-helix domains, and β-sheet structures. Helix-turn-helix structures (HTHs) consist of a section of roughly 20 amino acids in length in which two α-helices are linked by a short β-loop (turn) of approximately four amino acids with an invariant glycine at the second position. Both α-helices are oriented almost perpendicular to each other. The helix closer to the C-terminus is defined as the recognition helix. The recognition helix, which fits exactly into the DNA major groove, is responsible for the recognition. HTH-binding proteins are ubiquitous in prokaryotes and eukaryotes. In prokaryotes HTH-proteins generally recognize palindromic DNA sequences and for that reason normally exist as symmetrical dimers or even-numbered oligomers. Members of eukaryotic HTH-proteins, for instance the homeodomain protein family, bind non-symmetrical DNA sequences as monomers or heterodimers. Some contain additional N-terminal sequences that facilitate binding through the interaction with the DNA minor groove. A variant of HTHproteins is those containing winged-HTH domains. In winged-HTH-proteins the recognition motif is extended by a third α-helix with a neighboring β-sheet. This secondary structural element makes additional contacts with the DNA backbone. Zinc-finger proteins exist in many variations and are mainly found in eukaryotes. They all are characterized by a tetrahedral coordination of one or two Zn-ions by conserved cysteines or histidines, which stabilize modular domains of the protein. In the classic case of Zn-finger transcription factors two antiparallel β-sheets, which are linked by a loop with an α-helix, are coordinated by a Zn-ion between two cysteines and two histidines. The DNA contact is maintained by the α-helix, which recognizes a stretch of three base pairs through the major groove. Zinc-finger proteins often consist of multiple such motifs arranged in a consecutive way, such that they are wound helically around the DNA during binding. A special situation is found in Gal4, a yeast transcription factor. Here, two neighboring Zn-ions are bound coordinatively by six cysteines, with each two cysteines sharing one Zn-binding (shared ligands). The two Zn-ions stabilize the position of two α-helices, which also interact with the DNA major groove. Leucine zipper proteins are designated according to their mechanism of dimerization. They exist as homo- or heterodimers and have almost exclusively been described in eukaryotes. They are composed of an α-helical recognition helix linked to a C-terminal dimerization helix. Dimerization is maintained through hydrophobic interactions between two amphipathic dimerization helices, which form a coiled-coil structure. This structure is characterized by the interaction of each two hydrophobic amino acid residues (generally leucines) separated by two α-helical repeats (heptad repeat) oriented almost on the same site of the helix. The leucine side chains are arranged like the teeth of a zipper. The DNA interaction is maintained by the two separate N-terminal domains that contain positively charged side chains (basic region). These recognition helices are formed like a fork, which fits in opposite directions of the DNA major groove. Helix-loop-helix proteins (HLHs) are related to zipper proteins. They consist of a shorter DNA binding helix and a longer dimerization α-helix, which is linked by an unstructured loop to a four-helix-bundle. HLH proteins form homo- or heterodimers similarly to leucine zipper proteins. Each one of the α-helixes from the two dimers binds into the DNA major groove. The binding specificity and affinity can thus be modulated by different protein partners. β-Folds of proteins use their particular secondary structure as the principal element for DNA binding. A pair of anti-parallel β-strands adapts itself into the DNA minor groove. Solved high-resolution structures (e.g., the TATA binding protein (TBP)) reveals that the conserved β-leaf structure of two pseudo-identical domains form a saddle-like structure, which fits into the minor groove of the DNA recognition sequence. Aromatic side chains of two conserved phenylalanines at the end of each β-sheet are intercalated between two DNA base pairs. This interaction creates a DNA kink, such that the DNA points away from the binding protein.

Figure 32.6 Schematic depiction of different DNA binding motifs.

836

Part IV: Nucleic Acid Analytics

32.3 Special Analytical Methods Several very powerful methods, starting from very simple to those having a high to very high technical cost, are presented below. As a detailed introduction to the technical and theoretical requirements exceeds the intension of this chapter, different methods will only be introduced briefly and their applications exemplified.

32.3.1 Filter Binding One of the earliest methods for the analysis of protein–nucleic acid interactions is the filter binding technique, which relies on the principle that proteins bind to nitrocellulose membranes while, for example, nucleic acids, if not too large or complex, migrate through the membrane during a filter process. In a typical binding experiment a mixture of protein and the putative target nucleic acid (preferably radiolabeled) is filtered through a nitrocellulose membrane. The non-bound nucleic acid is subsequently washed from the filter. Existing protein–nucleic acid complexes are retarded on the filter by the protein. The use of radiolabeled nucleic acid allows the amount of complex to be determined by counting the radioactivity on the filter. If adequately performed a differential determination of the filtrate is also possible. Although filter binding is not a real equilibrium binding method it yields relatively exact quantitative data. Filter binding therefore serves to determine apparent binding constants and, moreover, is suitable for slower kinetic measurements. Owing to the low technical requirements and the fact that the method is simple, fast, and generally applicable, filter binding still belongs to the frequently used methods. Note, however, that the mechanism of interaction between a given protein and the nitrocellulose membrane is not completely known. It has been observed that certain proteins are not bound or lose their binding properties by induced conformational changes.

32.3.2 Gel Electrophoresis EMSA, electrophoretic mobility shift analysis, or short mobility shift or gel retardation describes a method used to separate protein–nucleic acid complexes from the free nucleic acid.

The analysis of protein–nucleic acid complexes by gel electrophoretic methods is, next to filter binding, certainly a relatively easy technical system and probably the most popular method of all for complex analysis. Today, the terms EMSA (electrophoretic mobility shift analysis) or gel retardation summarize qualitative as well as quantitative procedures for the analysis of protein–nucleic acid complexes. The method is equally suitable for the investigation of DNA or RNA complexes. In fact, gel retardation was initially established for the analysis of complexes between ribosomal proteins and ribosomal RNAs while today the technique is preferentially used to study the interaction of DNA-binding proteins. The method is based on the observation that binding of a protein to a nucleic acid generally reduces the electrophoretic mobility of the nucleic acid in non-denaturing polyacrylamide or agarose gels. In a typical experiment the protein or proteins under study are incubated with the nucleic acid and the complexes formed are subsequently separated from the free nucleic acid by gel electrophoresis. Visualization of the complex bands usually occurs by autoradiography of the radiolabeled nucleic acid. If less sensitive methods can be tolerated (above the nanogram range) non-labeled DNA or RNA can also be detected by sensitive staining (fluorescent dyes, ethidium bromide, or toluidine blue). One important advantage of gel retardation for the analysis of protein–nucleic acid interactions is the fact that studies can also be performed with impure protein preparations. Moreover, a binding analysis of several different proteins to the same DNA or RNA molecule is equally possible. Under favorable conditions complexes with different protein stoichiometry can be separated. This is a notable advantage of gel retardation compared to spectroscopic methods. The method requires only minute amounts of material and is applicable in the range of nanograms of proteins or femtomoles (10 15 mole) with respect to the nucleic acid. If pure proteins are available thermodynamic or kinetic parameters, such as equilibrium binding constants and association or dissociation rate constants, can be determined (Figure 32.7). Background to Gel Retardation What are the physical principles valid for the mobility of a DNA molecule during gel electrophoresis? In a first approximation the migration of a DNA

32 Protein–Nucleic Acid Interactions

837

Figure 32.7 Examples of retardation gels. (a) DNA–protein complex analysis. The concentration-dependent binding of the transcription factor FIS to a 260 bp DNA fragment with the regulatory region of the Escherichia coli rrnD operon is shown. The FIS concentration is increased in 70 nM steps from 0 to 700 nM (lanes 1–11). The different occupation of three independent binding sites (complexes 1–3) is visible. (b) RNA–protein complex analysis. The binding of a regulatory RNA from Escherichia coli (6S RNA) to the bacterial RNA polymerase associated with different sigma factors is shown. The two enzymes form different complexes.

molecule during electrophoresis can be described by the following equation: vˆ

where:

h2 QE L2 f

(32.2)

ν = migration velocity, h = end-to end-distance of the DNA molecules, Q = effective charge, E = electric field, L = contour length of the DNA, f

= friction coefficient.

DNA molecules, which are generally long and small, behave under the conditions of electrophoresis, where they have to pass through a three-dimensional network of pores, in a worm-like fashion. Their motion can be described by a reptation model. The worm-like mobility depends not only on the length of the DNA strand but also on the flexibility (DNA persistence length) and the conformation of the DNA molecules. A static curvature will increase the DNA bulkiness. A worm-like movement of such a molecule through the gel matrix is impeded according to the reptation model. Many proteins induce bends or enhance existing DNA curvature when bound to DNA. Hence, this effect, and not only the change in mass, must be considered in the interpretation of retardation gels. For the question of how much the mobility of a DNA–protein complex is altered with respect to the free DNA the mass ratio of DNA and protein is of prime relevance along with the mobility change induced by an altered conformation of the DNA. This ratio, and less the absolute masses of protein or DNA, is especially important for the resolution of complexes during gel electrophoresis. In the case of very acidic (negatively charged) proteins it is possible that no retardation of the complex occurs because the increase in mass is compensated by an overall increase in negative charge by the protein. This has for instance been noted for the Trp repressor protein. To obtain preferably high resolution between complex bands and free DNA the average pore size of the gel should not be much larger than necessary for separation of the samples in question. The pore size of polyacrylamide gels depends directly on the ratio of concentrations between acrylamide and crosslinker (bisacrylamide) within the gel. At acrylamide concentrations between 10% and 4% the average pore size lies between 5 and 20 nm, respectively, depending on the concentration of the crosslinker. In comparison, the tetrameric Lac repressor has the approximate dimensions of 3.5 × 3 × 13 nm3 and a 50 bp DNA fragment has the approximate dimensions of 2 × 17 nm2. For a good resolution, of comparable cases, the acrylamide concentration should not exceed roughly 5%. When choosing the gel system one must also keep in mind that polyacrylamide gels of less than 4% are difficult to handle. For

Electrophoresis, Reptation Model, Chapter 27

838

Part IV: Nucleic Acid Analytics

the separation of larger proteins or longer DNA molecules it is therefore recommended to use agarose gels. The pore size of agarose gels is generally much larger. In the commonly used range of agarose concentrations (0.5–2%) pore sizes of 700–70 nm, respectively, are observed. For very large molecules (polymerases or very long DNA molecules) agarose gels are a good alternative. One important aspect for the gel analysis of DNA–protein complexes is the stability of the complex during separation. Gel retardation studies benefit from an effect for which the term caging has been coined. This effect is based on the observation that the time for the electrophoretic separation of the samples can exceed the half-life for dissociation of the complex by orders of magnitude. Nevertheless, complex bands can be detected in such cases. The caging effect does not mean that electrophoresis has an effect on the dissociation rate constant. The phenomenon rather results from an enhancement of the local concentration of the reacting molecules, for which the spatial separation, following the dissociation, is reduced in the gel matrix. This enhances the (concentration-dependent) re-association. Moreover, the reduced activity of water molecules within polyacrylamide gels plays an important role in the cage effect. As result of enhanced association kinetics relatively small complex bands can be observed even if the time of electrophoresis exceeds the half-life of complex dissociation several-fold and dissociation may occur several times during separation. All binding proteins that interact with nucleic acids also undergo non-specific interactions up to a certain degree. Non-specific interactions are essentially the result of different charges of the macromolecules. Generally, non-specific interactions are therefore purely electrostatic. During gel electrophoresis non-specific interactions may give rise to extreme band broadening and impair resolution. Of course, they also interfere with the determination of binding constants. Specific versus non-specific binding: For all interactions between protein and nucleic acids it is possible to distinguish between non-specific and specific interactions. Non-specific interactions are generally purely electrostatic in nature and result from different charges between proteins and nucleic acids (polyanions). Specific interactions additionally involve H-bonds, hydrophobic interactions, often associated with structural adaptation, stacking interactions between aromatic residues, or directed salt bridges, which contribute to stable binding. Non-specific binding, caused by mere charge interactions, can therefore largely be suppressed by the addition of salt. To relieve non-specific binding during the analysis of protein–nucleic acid complexes it is common, therefore, to add a competitor. Typically, an excess of non-related DNA or the polyanionic compound heparin are added as competitor substance.

It is important, therefore, to distinguish specific from non-specific interactions or better to suppress non-specific binding during analysis. The simplest way is to compensate the surface charge of the interacting partners by the addition of salt (e.g., NaCl or KCl at around 150 mM) to the reaction mixture. The charges contributing to non-specific binding are shielded by the charges of the ions. Yet, the presence of high salt concentrations may affect the efficiency of electrophoretic separations. The general method of choice is the use of an excess of a nonrelated DNA as competitor substance. As competitor DNA a mixture of chromosomal DNA of different origin, such as calf thymus DNA, may be used. Natural DNA may, as a disadvantage, contain unwanted specific binding sites. The use of a synthetic DNA, like poly(dI-dC) or poly (dA-dT), avoids such potential complications. Heparin, a polyanionic compound (mucopolysaccharide with sulfate groups), has been proven to be of general value as competitor substance. There are no fixed rules predicting the exact amount of competitor substance required for a particular gel retardation experiment. The appropriate concentration has to be determined in prior titration experiments for each different binding partner. A good start is between 20 and 150 ng μl 1 competitor in the reaction mixture. Another rule of thumb suggests the use of a roughly 200-fold molar excess of competitor with respect to the target DNA. Notably, however, at high competitor concentrations not only the unbound protein will be captured by reducing the non-specific interactions. It is known that high competitor or heparin concentrations may also actively initiate the dissociation of complexes. For several complexbinding interactions, defined by strongly different on- and off-rates, the order of addition of the competitor must be considered. The choice of a suitable competitor, the time of addition, and the optimal adjustment of its concentration are therefore crucial steps for any gel retardation experiment!

32 Protein–Nucleic Acid Interactions

32.3.3 Determination of Dissociation Constants In a reaction where a protein–DNA complex is in equilibrium between binding and dissociation the equilibrium is defined by the equilibrium constant K. The value of K is given by the ratio between the concentration of the complex and the product of the concentrations of unbound protein and free DNA in equilibrium. The inverse measure 1/K is termed the dissociation constant KD. To determine dissociation constants one selects a low DNA concentration (smaller than the dissociation constant expected) and adds increasing concentrations of protein to the reaction. If the DNA concentration is much smaller than the dissociation constant, then, under equilibrium conditions between free DNA and DNA–protein complex, the concentration of protein equals the dissociation constant: KD ˆ

‰PŠ‰DŠ ‰PDŠ

(32.3)

where P = protein and D = DNA-binding site. If [D] ≪ KD it follows that [P]free  [P]total, leading to: KD ˆ

‰PŠtotal ‰DŠ ‰PDŠ

(32.4)

For a quantitative determination of complex dissociation constants one normally measures the concentration of protein that is required to bind half of the DNA present. This is best done in a pilot experiment, where the protein concentration is varied over several orders of magnitude. The range of the dissociation constant can be taken from the protein concentration that yields half-saturation of the complex. To obtain a more accurate value of KD from the visually estimated half-saturation the precise amounts of free DNA and DNA–protein complex are determined by a preferably exact method of quantification. Typically, the amount of free DNA is plotted against the logarithm of the protein concentration (Bjerrum plot). Provided that complexes can be resolved sufficiently and quantitatively determined by exact densitometric methods (autoradiography or phosphoimaging) binding constants can be obtained with relatively high precision. For a correct measurement of the binding or dissociation constant it is important that the ratio of DNA concentration under conditions of half-saturation is small with respect to the protein concentration (0.01–0.1). The quantitative evaluation can either be determined from the intensity increase of the complex or from the decrease in concentration of the free DNA. Both methods yield consistent results if the separation technique (gel electrophoresis) itself has no effect on the formation. In case of unstable complexes or badly resolved complex bands, which impede an exact quantification, the measurement of the free DNA concentration is often the only possible way to determine the KD value. Importantly, dissociation constants measured by gel retardation do not reflect real equilibrium determinations and, therefore, only apparent constants are obtained. These generally depend on the gel electrophoretic separation conditions (e.g., temperature, competitor concentration, buffer conditions). To gain exact binding or dissociation constants the KD values obtained should be verified by real equilibrium measurements. Gel retardation also provides a method for determination of the kinetic parameters of a binding reaction. To measure association binding constants DNA and protein samples are mixed and aliquots of the mixture are separated at short time intervals by gel electrophoresis (samples may be loaded on a running gel!). The association reaction is stopped by virtue of the instant separation of free DNA. The amounts of free DNA or formed complexes are determined by densitometry as outlined above. In this way, the increase in complex concentration over time (rate of complex formation) is obtained. If dissociation rate constants are required the experiment starts with pre-formed complexes, which exist in equilibrium. Dissociation is induced by rapid dilution of the sample or addition of excess competitor DNA (quenching). To ensure that dissociated proteins are effectively captured the competitor concentration must clearly be larger than the protein concentration and is generally used in excess. Aliquots of the sample are separated by gel electrophoresis immediately after the quenching reaction. Of course, measurements of kinetic constants by gel retardation are limited by the time scale of sample manipulation and are therefore hardly feasible in the range below a few seconds.

839

840

Part IV: Nucleic Acid Analytics

The stoichiometry within a formed DNA–protein complex can be determined in several ways. Double labeling of DNA and protein can generally be applied. In the case of radiolabeling the protein is labeled with [3H] and the DNA usually with [32P]. The maximal energies of the radioactive decay spectra of both isotopes are sufficiently different to allow parallel scintillation measurements. An exact stoichiometry of the complex partners can be derived if the specific activities of both DNA and protein are known. Alternatively, with somewhat lower sensitivity, a quantification of non-labeled proteins can be performed by quantitative Western blotting of a retardation gel with known amounts of DNA or protein staining of the complex band with Coomassie blue. The gel retardation method also enables the identification of unknown DNA-binding proteins, which interact with a specific target DNA (in cases in which cell extracts or mixtures of proteins are used). Such analyses require to scale-up of the amount of DNA for complex formation compared to the radiolabeled gel retardation assays because in gel retardation the amount of DNA generally limits the amount of complex formed. For a DNA fragment of roughly 200 bp and a given dissociation constant in the micromolar range about 0.5 μg of DNA are required. The amount of protein in question depends on the affinity to bind DNA and in the case of a KD of 10 6 M should be employed in the range of micrograms. The non-labeled DNA and the DNA–protein complex are briefly stained with ethidium bromide. The stained complex band can be excised and the included protein is separated in a second electrophoretic step by a discontinuous SDS gel. If the protein is unknown the stained protein band can be excised from the SDS gel, digested with suitable proteases, and identified by mass spectrometry (MALDI-TOF analysis). With known proteins, for which antisera might be available, the retarded complex band can be blotted on a membrane and the protein identified subsequently by Western analysis. Often, however, DNA–protein complexes are difficult to transfer to membranes and in unfavorable cases excess free protein migrates at the position of the complex band. In such cases it is sometimes possible to obtain a supershift on the retardation gel by the addition of an antibody, which specifically interacts with the binding protein. A supershift occurs because the complex is additionally retarded by the molecular weight of the antibody. In rare cases it is possible, however, that the antibody obscures complex formation or causes dissociation. Indirectly, this can be taken as an indication for the identity of a protein.

32.3.4 Analysis of DNA–Protein Complex Dynamics TGGE: temperature-gradient gel electrophoresis; a gel electrophoretic method used to study changes in the gel electrophoretic mobility of nucleic acids or protein–nucleic acid complexes resulting from temperature-dependent conformational changes of the sample.

The thermal stability of protein–nucleic acid complexes can ideally be studied by the method of temperature-gradient gel electrophoresis (TGGE). In this method a horizontal gel chamber is used in which a linear temperature gradient perpendicular to the direction of separation is installed. The method has great potential to study temperature-induced structural transitions of molecules that change the electrophoretic mobility of the sample to be separated. This includes of course, along with the dissociation of complexes, conformational changes of the nucleic acid (or protein) itself. Self-evidently, the method has a large application area for RNA molecules and RNA–protein complexes. The temperature range at which DNA–protein complexes dissociate is generally below the DNA melting point of roughly 70 °C. This does not apply, however, to DNA curvature or protein-induced DNA-bending. In such cases structural transitions generally occur in a range below 50 °C, which is considerably below the melting point of DNA. If the binding of a particular protein depends on DNA curvature this can elegantly be studied within this temperature range by TGGE analysis (Figure 32.8). If the DNA curvature is caused by A:T clusters arranged in helical phase the curvaturedependent DNA-binding can alternatively be studied using substances that influence DNA conformation. Among such substances are for instance the antibiotic distamycin. This oligopeptide specifically binds to the DNA minor groove bent by A:T clusters and abrogates existing curvature. The distamycin-dependent dissociation of DNA–protein complexes indicates that the binding mechanism involves DNA curvature. The effect of curvature-removal can be demonstrated directly with the free DNA by TGGE but also by the use of distamycin because the loss of curvature enhances the electrophoretic mobility, which in most cases can be easily recognized by gel electrophoresis. Such studies are especially suitable to analyze the effect of proteins on DNA conformation.

32 Protein–Nucleic Acid Interactions

841

Figure 32.8 Examples of conformational analyses of nucleic acids by gel electrophoresis. (a) Temperature-dependent conformational change of RNA (Escherichia coli 6S RNA) analyzed by temperature-gradient gel electrophoresis (TGGE); a 7.5% polyacrylamide gel is shown, a temperature gradient between 38 and 48 °C perpendicular to the direction of electrophoresis has been applied. A clear conformational transition of 6S RNA is evident at 43 °C. A 400 bp DNA fragment was loaded as control. (b) Conformational change of a curved DNA on a native polyacrylamide gel. Shown is a 260 bp DNA fragment (containing the regulatory region of a bacterial rRNA promotor) with intrinsic curvature. The DNA curvature caused by A:T clusters of the sequence is abrogated in the presence of distamycin, which results in an increased gel mobility. (c) TGGE analysis of the thermal stability of a DNA–protein complex. The left-hand panel shows the separation of a 260 bp DNA fragment with the regulatory region of a bacterial rRNA promoter (P1 DNA) with a temperature gradient of 20 to 60 °C. The righthand panel shows a separation of the same DNA bound to the DNA-binding protein H-NS under the same conditions. It can be seen that the complex disintegrates at 55.5 °C.

32.4 DNA Footprint Analysis To characterize the interacting domains of a protein with its target DNA the term footprint analysis has been coined, which indicates that something like an “imprint” of the protein on the surface of the DNA is determined. Footprint analyses are generally performed in combination with gel electrophoresis. They are based on the determination of the accessibility of the nucleic acid towards nucleases or modifying reagents, which differs in the absence or presence of a bound protein. Conversely, the accessibility of a protein towards proteases or modifying reagents can also be determined. Positions at which proteins and the nucleic acid are in contact are naturally less accessible, resulting in a reduction of the enzymatic hydrolysis or chemical modification at this region. Enhanced signals, which may occur as well, are generally interpreted as a conformational change of the binding partners. DNA fragments derived from enzymatic hydrolysis or cleavage at the modified positions are separated at the nucleotide level by denaturing gel electrophoresis. A separate DNA sample hydrolyzed or modified under the same conditions but in the absence of the protein is separated in parallel as control. The difference in band intensities between both samples indicates the positions where the binding protein has been localized. Yet, footprint analyses can provide much more details of a DNA–protein complex than the information about the sequence region where the protein was bound. Special footprint techniques may yield information about the mode of interaction in addition to indicating the borders of protein contact. Conclusions about which bases or chemical groups of the nucleic acid are involved in binding or whether the protein binds into the minor or major DNA groove are possible, even whether binding induces conformational changes or if the DNA strands are interrupted into single-strands during binding can be inferred from footprint data (Figure 32.9). Apart from such structural information footprint analyses are suitable to determine dynamic processes, such as conformational changes or they can be used for measurements of binding constants. One advantage of the binding constant determination by footprints versus gel retardation is the fact that footprint analyses can be performed under real equilibrium conditions

In molecular biology a footprint denotes different accessibilities of protein– nucleic acid complexes towards chemical reagents or limited enzymatic hydrolysis, usually visualized by gel electrophoresis. It serves to localize or to narrow down binding sites between the protein and the nucleic acid.

842

Part IV: Nucleic Acid Analytics

Figure 32.9 (a) Example of a DNase I footprint. The figure shows a section of a primer extension analysis following limited DNase I hydrolysis of the upstream regulatory regions of three rRNA promoters from Escherichia coli (rrnA, rrnB, and rrnC) in the presence (+) or absence ( ) of the binding protein FIS. Arrows indicate enhanced reactivity (conformational change), open bars mark protected regions. Sequencing lanes are indicated by A, C, G, and T and the sequence positions relative to the transcription start site are shown. (b) Example of a DMS footprint. Results from a primer extension analysis of a DMS footprint of the rrnB operon regulatory region are shown. Nucleotide positions and sequencing lanes are indicated. The footprint reaction was performed in the presence (1) and absence (0) of FIS (left-hand side) or presence (1) and absence (0) of H-NS (right-hand side), respectively. Blue arrows indicate protections, black arrows enhanced accessibilities. The lower part of the figure summarizes the reactivity differences in a DNA helix

32 Protein–Nucleic Acid Interactions

843

because protein and nucleic acids are not separated during the reaction. Of course, to solve these specific problems special footprint techniques are required, which will be described below.

32.4.1 DNA Labeling The simplest and most direct method visualizing footprint bands involves the radioactive labeling of DNA before complex formation and hydrolysis. The resulting fragments of different length are subsequently separated by denaturing gel electrophoresis and their position on the gel is visualized by autoradiography or the use of a phosphoimager. Radioactive labeling of the DNA is only performed at one end of the molecule. This is important because labeling multiple positions results in more than one labeled product band after hydrolysis, which obscures identification. Two principal methods for labeling DNA ends can be used. End-labeling can be performed at either the 5´ or the 3´ ends of each strand. Polynucleotide kinase and γ-[32P]-ATP as substrate are used for 5´ labeling. If the 5´ ends are already phosphorylated, as it is the case after restriction enzyme hydrolysis, the 5´ phosphate group must first be removed by a phosphatase. Alternatively, a kinase reaction may be performed under “exchange” conditions. If DNA fragments generated by PCR are used the 5´ ends are normally not phosphorylated, except phosphorylated primers were employed. Radioactive labeling at the 3´ ends can best be done by a “fill-in” reaction with DNA polymerase provided that the DNA has a 5´ overhanging end. This can be achieved by use of appropriate restriction enzymes. Suitable enzymes for the “fill-in” reaction are either T4 DNA polymerase or the Klenow fragment of DNA polymerase I and the suitable α-[32P]-dNTP. Both the 5´ kinase reaction and also the “fill-in” reaction at the 3´ ends of a DNA molecule can take place at each of the two antiparallel DNA strands. To avoid such double labeling, which obscures the assignment of hydrolytic fragments, a strategy has to be followed that either limits the labeling reaction to one strand or the second labeling has to be removed by use of a single cutting restriction enzyme close to the distant DNA end.

Radioactive Labeling, Labeling Positions, Section 28.3.1 Enzymatic Labeling, Section 28.3.2

32.4.2 Primer Extension Reaction for DNA Analysis Bands resulting from footprint reactions can be visualized indirectly without the prior use of radiolabeled DNA. A variant primer extension reaction is used for this indirect way of analysis. The principle of the primer extension reaction for DNA corresponds to the reaction that is frequently used for the analysis of RNA molecules. Primer extension describes a method in which a nucleotide sequence (RNA or DNA) is transcribed into a complementary DNA sequence (cDNA), starting from a DNA oligonucleotide (primer). The primer oligonucleotide is chosen such that it specifically binds to a complementary target sequence located 3´ from the sequence to be transcribed. The primer is subsequently extended in the presence of deoxyribonucleotide triphosphates by Klenow DNA polymerase (for DNA primer extension) or by reverse transcriptase (for primer extension of RNA). The extension reaction either stops at the 5´ end of the target nucleic acid or at bases, which are modified by footprint reactions, which cause the elongation reaction to become blocked (chemical footprint).

During the primer extension of RNA a cDNA is created by reverse transcription of an RNA template. In contrast, for the primer extension method of DNA, a primer oligonucleotide selecting a DNA strand for the copying reaction is used and instead of reverse



model. (c) Example of a hydroxyl radical footprint analysis. The analysis of the binding of Epstein–Barr nuclear antigen (EBNA) to a synthetic consensus DNA sequence is shown. On the left-hand side the autoradiogram of hydroxyl radical cleavage products of the upper (lanes 1, 2) and the lower strand (lanes 3, 4) are shown. Lanes 2 and 4 reflect the presence of the binding protein. In the middle is a densitometric evaluation of the autoradiogram. Lanes 1 and 2 indicate the upper, lanes 3 and 4 the lower strand. Lanes 2 and 4 reflect the presence of the binding protein. The scheme on the right-hand side depicts helical B-form DNA with the upper strand (light) and the lower strand (dark). Protected regions are marked by dots. (d) Example of a KMnO4 footprint analysis for identifying single-stranded DNA regions. The figure shows the primer extension analysis of the KMnO4-reacted template strand DNA of a bacterial promoter (upper panel) and the same analysis of the non-template strand (lower panel). K and K+ mark control lanes without binding protein before and after KMnO4 reaction, respectively. The lanes 0 and 300 indicate the absence or presence, respectively, of RNA polymerase during the reaction, which renders the promoter region into a single-stranded “transcription bubble” giving access to the nucleotides for modification. Source: Part (c), Kimball, A.S., Milman, G., and Tullius, T D. (1989) Mol. Cell Biol., 9, 2738–2742. With permission, Copyright  1989, American Society for Microbiology. Part (d), Jöres, L. and Wagner, R. (2003) J. Biol. Chem., 278, 16834–16843. With permission, Copyright  2003, The American Society for Biochemistry and Molecular Biology, Inc.

844

Part IV: Nucleic Acid Analytics

transcriptase a DNA polymerase serves as enzyme. In that case, the DNA fragments obtained by limited hydrolysis or chemical modification of the footprint reaction are directly used as templates for the synthesis of a complementary DNA strand by the Klenow fragment of DNA polymerase I. The polymerization reaction is started by a primer oligonucleotide, which is complementary to a sequence at or close to the 3´ region to be analyzed. Information can be obtained for both strands if appropriate primers are selected, which are complementary to either the upper or lower strand. Generally, primers are always extended in the 3´ to 5´ direction. The synthesis stops at the 5´ ends of the DNA fragments and complementary DNA strands to the fragments resulting from the footprint reaction are generated. Many of the chemical reagents for DNA footprint analysis alter the nucleic acid bases in such a way that the primer extension reaction becomes aborted. Hence, these fragments can also be used directly for primer extension analysis without the need of prior hydrolysis of the DNA strand. To visualize the newly formed DNA fragments either the primer used can be radiolabeled at the 5´ end by a kinase reaction or the extension reaction is carried out in presence of a suitable α-[32P]-dNTP. Because the primer extension reaction works with non-labeled DNA and the sequence region as well as the particular strand can be selected by the primer, footprint reactions of DNA molecules can be analyzed that are normally much too large to be separated on sequencing gels. The gel will only show the labeled products of the primer extension reaction, which can be chosen in a way that is optimal for the resolution of sequencing gels. As a further advantage of the primer extension analysis, circular and superhelical DNA molecules can also be studied, which enables the analysis of DNA–protein interactions of different DNA topoisomers at conditions close to the in vivo situation.

32.4.3 Hydrolysis Methods To obtain information by limited enzymatic digestion, preferably about an entire DNA molecule, sequence-independent DNases are employed. DNases are classified as endonucleases, which can cut anywhere within the molecule or exonucleases, which only digest DNA from either the 5´ or the 3´ end. Sequence-specific restriction enzymes (endonucleases), which are frequently used in molecular biology, are not of particular use for the analysis of protein– DNA complexes by limited hydrolysis because they cut at only a few specific sites. A basic prerequisite to obtain information by enzymatic hydrolysis over the entire length of a particular DNA molecule is given by the fact that the hydrolysis must be limiting. Principally, each nucleotide position accessible within the sample should statistically be cleaved but each DNA molecule should only be cleaved (hit) once. Such a situation is characterized as a single hit condition. For a DNA molecule of limited length, as generally used in footprint analyses, single hit conditions are fairly easily achieved. This is because the probability of hitting a molecule that has not yet been hit before is much greater than that of hitting the same molecule twice. It can be shown for DNA molecules between 50 and 200 bp that each molecule is statistically cleaved only once if about 70% of the starting molecules remain non-cleaved. Hence, if the limited hydrolysis reaction is performed under conditions that leave about 70% of the starting material non-cleaved, one can assume single hit conditions with statistical distribution of the single cuts. Whether such conditions are met in a particular experiment can easily be inferred when about 70% of the input DNA migrates as non-cleaved material on the gel at the same position as the non-treated DNA sample. DNase I

The nuclease used most frequently for mapping of protein-binding sites within a DNA molecule is DNase I. This endonuclease cleaves DNA largely sequence-independent (there is a slight preference for purine-pyrimidine sequences). On the other hand, it has some structural limitations in the hydrolysis of double-stranded B-form DNA. The enzyme fits specifically into the minor groove of B-form DNA and cleaves the phosphodiester bond such that the phosphate remains at the 5´ end of one of the cleaved strands. The other end is left with a free 3´ -OH group. Due to the dimensions of the enzyme a width of the minor groove of about 13 Å is optimal for the hydrolysis reaction. Any distance smaller or larger than this has a negative effect on the enzyme. Owing to this steric requirement the hydrolysis efficiency becomes reduced when the DNA minor groove is reduced by the DNA conformation. This is the case for curved DNA, for instance. The minor groove at the inside radius of the curvature is

32 Protein–Nucleic Acid Interactions

only poorly cleaved by DNase I. At sequence regions with regular A:T clusters as well as at some G:C-rich sequences causing an intrinsic curvature the minor groove at the inside of the curvature is so narrow that cleavage by DNase I, due to its dimension, is hindered for steric reasons. This is why, at limited DNase I hydrolysis, sections of strong cleavage alternate with regions of weak hydrolysis signals. DNase I does not therefore create a completely regular hydrolysis pattern of natural DNA along the entire length. DNase I is very efficient, which means that only very small amounts of the enzyme and correspondingly short reaction times are required for a limited hydrolysis reaction. The temperature may also be kept low to avoid too strong a hydrolysis. The enzyme requires Mg2+ ions for its activity. This fact can be used to control the hydrolysis reaction, which can precisely be started by the addition of Mg2+ ions after all necessary compounds have been added to the reaction mixture (this method of activating DNase I is only possible if the complex formation between DNA and protein does not itself depend on the presence of Mg2+ ions). A rapid addition of excess EDTA, which sequesters all the required Mg2+ ions, causes the immediate end of the hydrolysis reaction. The enzyme can then be removed by phenol extraction. DNase I can also perform hydrolysis reactions within the gel matrix. This fact can be used to perform footprint reactions of complexes present on retardation gels, which have been used for their separation (in-gel footprint). Complexes separated from the unbound DNA must not be extracted from the gel, which often may result in dissociation. For the in-gel hydrolysis, the complex band is favorably cut from the retardation gel and a DNase I solution in the presence of EDTA is allowed to soak into the gel piece (this is a slow process and for a gel piece 1–2 mm in size requires about 15–30 min). The reaction is then started and stopped by the addition of Mg2+ions or EDTA-containing buffer, respectively (ions diffuse rather rapidly into the gel). The hydrolyzed DNA fragments can subsequently be removed from the gel by extraction in the presence of phenol. After a precipitation step the sample may be analyzed directly on a sequencing gel. Exonuclease III Along with DNase I several exonucleases are suitable enzymes for footprint reactions. Exonuclease III (Exo III), as indicated by its name, hydrolyzes DNA only from its 3´ end. The hydrolysis starts at the 3´ end and progresses from 3´ in the 5´ direction; hence both antiparallel strands of a DNA fragment are digested in opposite direction (both in the 5´ direction). The preferred substrate for Exo III is double-stranded DNA, which under ideal conditions of complete hydrolysis yields two single-stranded fragments consisting of about half the length of the original DNA, each containing the original 5´ end. Exo III hydrolysis stalls at sites where a protein is stably bound to DNA. Characterization of the remaining DNA fragments after hydrolysis allows definition of the DNA borders protected by the interacting protein. For the analysis only one DNA strand is labeled at the 5´ end. In contrast to DNase I footprint reactions Exo III reactions are not performed under limiting conditions because, rather, as many DNA molecules as possible should be digested until they reach the protein border. In a parallel experiment, in which the opposite strand is labeled, the protein border on the other strand can be mapped, completing the protein binding site information. Some problems with Exo III footprints may be encountered if the protein binds extremely weakly and becomes displaced by the nuclease. Certain stable DNA secondary structures may also disturb the analysis and in cases where a protein binds to multiple sites on the DNA only the one closest to the 3´ end is detected. λ-Exonuclease Another exonuclease, with complementary direction of hydrolysis, is λ exonuclease. This enzyme degrades DNA from its 5´ end, which makes it a complementary tool to Exo III. For footprint analyses the DNA must of course be labeled at the 3´ ends! Exonuclease Activity of DNA Polymerases To map the borders of DNA-binding proteins the 3´ –5´ exonuclease activities of proofreading-competent T4 or T7 phage DNA polymerases may generally be employed. In reactions under footprint conditions any substrate NTP is omitted to favor the exonuclease over the polymerization activity. In rare cases, where single-stranded DNA should be analyzed, the highly single strandspecific nuclease S1 can be used. For S1 mapping reactions, like for DNase I reactions, strict single hit conditions are required. Nuclease S1 cuts DNA in a sequence-independent manner and, because the specificity of the enzyme is not limited to DNA, S1 is also of value for structural analysis of RNA molecules.

845

846

Part IV: Nucleic Acid Analytics

32.4.4 Chemical Reagents for the Modification of DNA–Protein Complexes Along with the limited enzymatic hydrolysis for the characterization of DNA–protein complexes the use of nucleic acid-modifying chemical reagents is an approved method. It has the advantage that due to the small dimensions of most reagents they can penetrate closer to the interacting surface of the protein–nucleic acid complexes under study, which results in a higher resolution of the footprint data. Moreover, the distinct specificity in reacting only with certain functional groups like H-bond donor or acceptor groups within the different DNA grooves often enables detailed conclusions to be drawn about the mechanism of protein–DNA interaction. For most of the chemical reagents approved for footprint analyses it must be considered that they will also react to some extent with functional groups of amino acid side chains. This demands that it must be verified during the reaction that the protein does not lose its binding activity or specificity. Hence, a gel retardation control is suggested in all cases. Dimethyl Sulfate (DMS) A very versatile reagent approved for the analysis of many different complexes is dimethyl sulfate (DMS). DMS methylates the N7 position of guanine and the N3 position of adenine in double-stranded B-form DNA. The reagent is suitable, therefore, to map protein contacts within the major groove of DNA at G-positions, and at A-positions in the minor groove. Moreover, DMS reacts with the N1 position of adenines and to a somewhat weaker extent with N3 positions of cytosines within single-stranded DNA, which are present, for instance, in RNA polymerases–DNA complexes as a result of the melting activity during transcription. In double-stranded DNA these positions are generally not reactive because they are involved in Watson–Crick base-pairing interactions. To identify chemical modification sites by simple sequencing gels one of the following conditions must be met: the chemical reaction must either result in a direct strand brake or the DNA chain must be destabilized by the reaction in such a way that in a second reaction a selective cleavage can be induced. Moreover, the identification of the modification site is possible if the chemical modified base causes the abortion of the DNA polymerase copying reaction during primer extension analysis. Both conditions apply for DMS, but also for many other chemical reagents. The identification of guanine methylation by DMS can also be carried out indirectly by a subsequent cleavage of the DNA chain with piperidine. The N7 methyl group, resulting from the DMS modification, confers a positive charge to the imidazole ring of the guanine base, which becomes eliminated by the piperidine reaction, followed by a strand break. To assign the DNA fragments after cleavage on denaturing sequencing gels a non-treated DNA sample is separated in parallel after a Maxam–Gilbert sequencing reaction. In the case of a well-known sequence it often suffices to perform only the G + A Maxam– Gilbert reaction to enable a correct assignment. DMS footprint reactions can alternatively be carried out with non-labeled DNA molecules when a primer extension reaction is used for the identification of modified sites. This type of analysis takes advantage of the fact that the N7 methylated base acts as a block for the DNA-copying enzyme (e.g., Klenow fragment of DNA polymerase I). DMS footprints can therefore also be carried out with circular superhelical plasmids. Because DMS penetrates through membranes it is also possible to use the technique for in vivo footprint studies. KMnO4 and OsO4 The two chemicals KMnO4 and OsO4 are able to penetrate membranes and cell walls and are therefore also suitable in vivo footprinting reagents. Both reagents share the same specificity but due to its lower toxicity KMnO4 is used much more frequently. The reagents are known to oxidize the 5,6-double bond of thymine and with somewhat lower reactivity of cytosine within single-stranded DNA. The accessibility of the 5,6-double bonds of pyrimidines is significantly impeded in double-stranded DNA due to strong base stacking. The reaction is thus sensitive to changes in the DNA conformation, such as untwisting, but also for transitions between B- and Z-form DNA or the formation of cruciform structures. The two reagents are especially helpful for the analysis of protein binding associated with melting of the DNA doublestrand or severe changes in the helix geometry. KMnO4 footprints are therefore especially practical for the analysis of RNA polymerase–DNA promoter interactions, which usually result in a local melting of the DNA double-strand. Because the oxidation of the 5,6-double bond of

32 Protein–Nucleic Acid Interactions

847

pyrimidines by KMnO4 or OsO4 generates a block for DNA polymerases the respective modified nucleotides can easily be identified by a primer extension analysis. Diethyl Pyrocarbonate (DEPC) and Halo-acetaldehydes DECP and the halo-acetaldehydes, iodine- or bromo-acetaldehyde, are primarily used as probes for structural changes of the DNA conformation. Generally, these reagents do not react with B-form DNA, but they become DNA-reactive after transition into Z-DNA, the formation of cruciform structures, or melting into single-strands. DEPC carbethoxylates the N7 position of guanine and adenine as well as the NH2 group at C6 of adenine. The halo-acetaldehydes react with non-paired cytosines, adenines, and guanines. Modified DNA strands can be cleaved with piperidine. In the case of halo-acetaldehyde modification a prior treatment with DMS (to analyze nonpaired adenines or cytosines), formic acid (for non-paired cytosines), or hydrazine (for nonpaired adenines and guanines) is required. This rather high level of effort has limited a broad use of this technology in recent times. N-Ethyl-N-nitrosourea (ENU) Apart from the reaction with some bases ENU mainly forms ethyl esters with DNA phosphate groups. It is a straightforward probe to analyze the proximity of the DNA phosphate backbone. The reaction conditions are incompatible though with most protein complexes (50 °C, 50% ethanol). The reagent is therefore mainly used for interference studies (see below). The hydroxyl radical footprint method, which likewise senses changes in the sugar-phosphate backbone, has recently largely replaced ENU modification. Hydroxyl Radical Footprint The high resolution and the sequence-independent reaction make hydroxyl radicals, which are probably the smallest chemical reagents, a very versatile and important footprinting tool. The size of hydroxyl radicals is comparable with H2O molecules, which enables a sequence-independent reaction with any nucleotide position of DNA- or RNA strands. Presumably, hydroxyl radicals attack the C1´ or C4´ ribose by oxidation and ultimately cause the elimination of a nucleotide leading to a strand scission. Within double-stranded DNA this reaction causes a gap (gapped duplex). The DNA ends of such gaps can either exist in 3´ - or 5´ phosphorylated form. With lower probability 3´ phosphoglycolates are also formed (these different ends cause differences in the electrophoretic mobilities of fragments shorter than about 25 nucleotides). The high reactivity and the small space requirements of hydroxyl radicals give rise to a very even hydrolysis pattern over the entire length of a DNA molecule. Periodic conformational anomalies, as they occur in curved DNA or certain protein complexes, however, change the hydrolysis pattern in a characteristic way. The lifetime of hydroxyl radicals is extremely short and they only react within a small radius of their creation. Based on the very fast reaction it is supposed that the radius of diffusion is limited to about 1 nm. The observed cleavage rate at a distinct DNA position correlates with the accessibility of the corresponding bond or atom with respect to the hydroxyl radical attack. Each change in reactivity allows very precise conclusions to be drawn about the geometry of the DNA–protein complex. Hence, hydroxyl radical footprints are excellent surface probes, recording structural changes with very high sensitivity. The generation of hydroxyl radicals is based on the Fenton reaction. In this reaction Fe(II) is oxidized by H2O2 to Fe(III), whereby OH– and hydroxyl radicals OH• are formed: Fe…II† ‡ H2 O2 ! Fe…III† ‡ OH ‡OH

(32.5)

2

In the laboratory [Fe(EDTA)] , hydrogen peroxide (H2O2), and sodium ascorbate are used for the reaction. Sodium ascorbate regenerates Fe(II) from Fe(III) in a cyclic reaction:

…32:6†

To adjust the exact reaction times the hydroxyl radical reaction can be stopped by an excess of thiourea (quenching). Optionally, hydroxyl radical footprints are performed with end-labeled DNA. The positions are assigned by means of a Maxam–Gilbert sequencing reaction. Because of the stereoselective reaction hydroxyl radical footprints enable a rather precise description of the topography of a protein-binding site. Hydroxyl radicals attack the

Hydroxyl radicals are generally formed through the Fenton reaction, in which Fe(II) ions are oxidized by H2O2 to Fe(III). Hydroxyl radicals are characterized by their very high reactivity and immediate attack of chemical structures in the vicinity of their creation, causing strand scissions in nucleic acids and proteins. Hydroxyl radical footprint reactions are common for the structural analysis of DNA and RNA as well as protein–nucleic acid complexes.

848

Part IV: Nucleic Acid Analytics

deoxyribose along the minor groove. In curved DNA the minor groove is narrowed at the inside of the curvature, which results in reduced accessibility. In the case of curved DNA it is possible therefore to recognize the exact helix periodicity, or any alteration of this parameter mediated by bound protein, in the footprint pattern. The high resolution of the hydroxyl radical reaction enables detection of even minor changes in the accessibility induced by interacting proteins. To evaluate hydroxyl radical footprints it is recommended to perform a densitometric scan of the lanes from the footprint gel. Integration of the peak areas can be used to make quantitative conclusions. Superimposing (subtraction) of the scan profiles from footprint lanes with and without bound protein will reveal architectural intricacies of the protein interacting domain. The quality of the footprints can be further improved when the DNA–protein complexes are separated from the free DNA on retardation gels after hydroxyl radical treatment. Complexes treated at single hit conditions, with a maximal one nick in the DNA, will remain intact. The DNA is extracted from the gel band and directly separated by denaturing gel electrophoresis. The background of non-bound DNA is thus eliminated, leading to a significantly improved contrast of the complex signals. Hydroxyl radical reactions require some precautions. Radical scavengers in the solution may reduce the lifetime of the hydroxyl radicals significantly, causing a strong reduction of the band intensities. Such radical scavengers may be stabilizing substances like glycerol, which is a frequent additive in protein solutions. Consequently, glycerol-containing protein samples have to be dialyzed before the reaction. The same is true for SH-group containing solutions. Care should be taken, for instance, that the concentration of 2-mercaptoethanol does not exceed 5 mM. High concentrations of divalent cations are also known to disturb the reaction.

32.4.5 Interference Conditions A special variant of the hydroxyl radical footprint technology is termed missing nucleotide analysis. The principle is also known as damage selection or interference footprinting. In this type of footprint the limited hydrolysis or chemical modification of a nucleic acid is performed before it is used for complex formation with a protein. In a subsequent binding reaction only those molecules are selected that have retained their binding capability. Complexes are then separated from the free nucleic acid. Cuts or modifications at sites essential for protein binding are not present in the fraction of the complexes but are exclusively found associated with the non-bound nucleic acid. In contrast, positions identified as cleaved or modified that are only present in the fraction of the free nucleic acid are most likely involved in direct protein binding. A particular advantage of the interference footprint is that modified conditions may be applied that normally cause the dissociation of protein–nucleic acid complexes.

Because of the reactivity of hydroxyl radicals, which primarily attack the deoxyribose, this type of footprint provides exclusively information about the contacts of binding proteins with the DNA sugar-phosphate backbone. Conclusions about interactions between the bases and functional groups of the protein cannot be drawn at first. However, exactly this information is provided by the missing nucleotide principle. The method takes advantage of the hydroxyl radical cleavage chemistry to remove at random distribution single nucleotides from a DNA molecule. The reaction is performed under single hit conditions, resulting on average in less than one random gap per DNA strand. The double strands stay intact by base pairing even with a missing nucleotide in one of the strands. The DNA containing statistical nucleotide gaps is then used in a protein binding reaction and complexes formed are separated by gel retardation from the non-bound DNA. The binding experiment will select those DNA molecules in which the nucleotide gap has or has not an essential influence on the capability to form a protein complex. DNA from the complex band and the band with the free DNA is extracted and the fragments created by the prior hydroxyl radical cleavage are separated on a sequencing gel. Nucleotides that are essential for protein binding can be identified directly by comparison of the band pattern between DNA extracted from the complex and non-bound DNA. Nucleotides that are essential for binding are missing or weakly represented in the lane of the sequencing gel corresponding to the complex. In contrast, in the lane representing the free DNA strong bands can be expected at the same position (Figure 32.10).

32 Protein–Nucleic Acid Interactions

849

Figure 32.10 Scheme of interference footprints.

The principle of missing nucleotide analysis can also be performed with methylating reagents like DMS. In such a case one speaks of methylation interference. Because the DMS modification occurs at the nucleotide bases this method allows study of the effects of DMS-reactive positions of nucleotides on protein binding. In contrast to the hydroxyl radical reaction, methylation interference does not provide information about all bases and only purine positions can be tested. Additional reactions with alternative reagents are necessary to gain information on all DNA positions. Unlike to the missing nucleotide analysis methylation interference requires that the DNA strands must be cleaved with piperidine prior to the separation on sequencing gels.

32.4.6 Chemical Nucleases Owing to their negative charge Fe(II)-EDTA complexes, which are used to generate hydroxyl radicals, do not bind to DNA. The generated radicals therefore react with arbitrary positions of the DNA by free diffusion. If footprint reagents are employed with the EDTA-moiety coupled to an intercalating reagent, which preferentially interacts with specific DNA structures, then a directed and less homogeneous cleavage pattern occurs. The compound methidiumpropylEDTA-Fe (MPE-Fe) belongs to such a group of synthetic footprinting reagents. Because the radical reaction cleaves the phosphodiester backbones of both DNA and RNA, reagents of this type are designated as chemical nucleases (Figure 32.11). MEP-Fe, like the structurally related ethidium bromide, intercalates between the stacked bases of double-stranded nucleic acids. The EDTA-iron complex is adjusted, such that it points into the minor groove. Addition of O2 or H2O2 causes hydroxyl radical formation, which ultimately leads to cleavage of the phosphodiester chain. Cleavage occurs preferentially close to the minor groove. Bound proteins, which either interfere with MPE-Fe intercalation or inhibit the diffusion of hydroxyl radicals into the minor groove, cause an altered cleavage pattern. Protection from MPE-Fe-induced strand scissions therefore indicates protein binding to the minor groove. A locally reduced reactivity may on the other hand result from more complex causes. If, for instance, the DNA doublestrand is unwound by protein binding MPE-Fe is unable to intercalate. Consequently, this section of DNA turns out to be protected (pseudo-protection). Because of its sensitivity towards changes in the minor groove MPE-Fe is frequently regarded as a chemical version of DNase I. Another metal ion-chelating compound with properties of a chemical nuclease is bis(1,10ortho-phenanthroline)-copper (I) complex (OP-Cu) (Figure 32.11). The reagent attaches without covalent binding to the DNA minor groove and in the presence of ascorbic acid or mercaptopropionic acid and H2O2 causes cleavage of the phosphodiester backbone deep within the minor groove. DNA in the B-form conformation is the preferred substrate while, under the same conditions, the DNA A-form is only weakly cleaved and DNA Z-form is not reactive at

Figure 32.11 Structure of chemical nucleases.

850

Part IV: Nucleic Acid Analytics

Figure 32.12 Structure of Fe-BABE, Fe 1(p-bromoacetamidobenzyl)-ethylenediaminetetraacetic acid.

all. OP-Cu is much more sensitive than DNase I with respect to protein-induced conformational alterations in the minor groove. Moreover, it can cleave sequence regions characterized by frequent A:T clusters, which are only scarcely hydrolyzed by DNase I. Contrary to MEP-Fe, OP-Cu is able to cleave single-stranded regions and the fact that lower reagent concentration are required, compared to MEP-Fe, normally ensures that protein–nucleic acid interactions are not disturbed. The reagent may also be employed for structural studies of RNA–protein complexes. However, the rules that determine the cleavage efficiency are far less known due to the generally much more complex structure of RNA. The cleavage reaction induced by OP-Cu can efficiently be stopped by the addition of 2,9dimethylorthophenanthroline (neocuproine) (Figure 32.11). Neocuproine sequesters the available copper ions in a stable inert complex. Due to their small size the chemical nucleases described above can also conveniently be used to perform footprints directly in the retardation gel because they are able to rapidly diffuse into the matrix of polyacrylamide gels. In this case, the complex band and the band containing the free DNA may be separately subjected to a footprint analysis without prior extraction of the gel pieces. The cleaved DNA is extracted only after the treatment and directly analyzed on a sequencing gel. A particularly elegant method for the mapping of protein–nucleic acid interacting domains is based on the targeted modification of complexes, where the interacting partners have been conjugated to reactive groups. Hydroxyl radical-generating EDTA complexes, which are conjugated to functional side chains of proteins (but also nucleic acids), have been proven to be successful. The best known of these conjugated “scissors” certainly is the iron-chelating reagent Fe-BABE. Fe 1-(p-bromoacetamidobenzyl)ethylenediaminetetraacetic acid (FeBABE), serves to generate radicals in the presence of H2O2 and ascorbate (Figure 32.12). The reagent can be conjugated by the bromoacetamide linker with any SH group of cysteine side chains of proteins, where it serves for the contact-dependent cleavage of a nucleic acid in complex with the protein. Contacts or close proximity of proteins and nucleic acids may thus be easily localized with Fe-BABE. The reagent can alternatively be conjugated to proteins by 2iminothiolane and the amino group of lysine residues, which extends the application range significantly because lysines are more common amino acids in proteins than cysteines. The use of Fe-BABE-conjugated proteins does not only allow a precise localization of the protein by cleavage of the nucleic acid chain. The radicals can likewise cleave the peptide chain of neighboring proteins and thus provide information about protein–protein interactions. Because of the small activity radius of the generated radicals the cleavage positions are generally only 1.2 nm distant from the cysteine (or lysine) to which the reagent has been conjugated.

32.4.7 Genome-Wide DNA–Protein Interactions Chromatin Immunoprecipitation, Section 31.7

Chromatin Immunoprecipitation

One important method for the in vivo analysis of DNA– protein interacting regions is chromatin immunoprecipitation (ChIP). This technique enables us to map the genomic localization of many regulatory proteins, which is a prerequisite for the understanding of many regulatory networks. The method can be combined with DNA microarrays and “high throughput” sequencing methods and makes possible the genome-wide characterization of DNA sequences for regulatory proteins. While in vitro studies with purified proteins and isolated DNA fragments, along with the limitation to short DNA regions, do not reflect the actual in vivo situation of the DNA structure (chromatin structure or superhelicity) the ChIP methodology can provide information under almost physiological conditions. The method is furthermore not only suitable for the analysis of relatively simple bacterial DNA but also for the more complex genomes of eukaryotic organisms (Kim, T.H. and Ren, B. (2006) Genome-wide analysis of protein–DNA interactions. Annu. Rev. Genomics Hum. Genet., 7, 81–102.). Initially, the ChIP method was based on the treatment of living cells with crosslinking reagents, usually formaldehyde, to fix existing protein–DNA complexes. The formaldehydecrosslinked DNA–protein complexes are extracted followed by a fractionation of the chromatin and immune-affinity purification of the DNA fragments linked to the protein. This can be achieved by an antibody specific for the binding protein to be analyzed. The crosslink is subsequently reverted and the DNA binding region can be characterized by various analysis techniques like Southern blotting, PCR, or DNA sequencing methods. The original technique allows the analysis of protein-binding regions of a limited number of DNA target sequences but

32 Protein–Nucleic Acid Interactions

no genome-wide information about the binding region of a distinct protein can be determined. The ChIP method may, however, be combined with microarray techniques, enabling the localization of protein–DNA binding sites distributed over an entire genome (ChIP-on-chip technique). It requires that the total DNA (input DNA) and the immune-precipitated DNA are labeled with two different fluorescence dyes. Both DNAs are hybridized to the same microarray chip with the input DNA serving as hybridization control. The differences in hybridization intensity on the chips are taken as a measure for the enrichment caused by the bound protein. Because the method does not require any knowledge about the potential recognition sequence it can be used for the characterization of as yet unknown binding regions leading, for instance, to a global overview of the binding sites of a distinct transcription factor on a complete genome. One important prerequisite, which poses a limitation for the method, is the availability of a highly specific antibody against the binding protein in question. If no such antibody is available, in some cases the limitation can be overcome as in organisms, which are prone for genetic manipulation. In such cases fusion of the binding protein with an epitope tag or with DNAadenine methyltransferase for the identification of the binding positions can be helpful. The fusion of DNA-adenine methyltransferase causes a local DNA methylation in the vicinity of the protein-binding site, which can readily be identified by methylation sensitive restriction enzymes. The relatively high number of cells required for a ChIP-on-chip experiment is a further limitation. Generally, 107 cells are necessary for a reaction, making the method unsuitable for certain types of cells.

32.5 Physical Analysis Methods Physical methods provide several advantages in comparison to conventional biochemical techniques such as gel electrophoretic methods combined with radioactive labeling. Firstly, most physical methods allow the observation of a binding reaction and characterization of the complexes under real equilibrium conditions, whereas gel electrophoretic analyses are performed under non-equilibrium conditions and complexes may dissociate during separation, depending on the circumstances of the electrophoresis. External conditions like temperature and salt concentration may also not be chosen freely for gel electrophoresis. A similar situation applies to enzymatic or chemical footprint methods or filter binding experiments.

32.5.1 Fluorescence Methods As a main advantage of fluorescence-based analyses the experiments can be performed in solution without the need to separate or quantify free or complexed components. A further advantage resides in the physical principles of fluorescence itself. The timescale of fluorescence emission lies between pico- and a few hundred nanoseconds, depending on the distinct fluorophore and the condition of measurement. This timescale equals that of many molecular dynamic effects, such as molecular rotations, diffusion of small molecules, solute reorientations, movements of individual molecular domains, or the energy transfer between chromophores. Hence, fluorescence spectroscopy enables conclusions to be made about these fast processes and short-lived molecular states. Along with the analysis of equilibrium phenomena the method is of benefit for kinetic analyses in the range of milliseconds. Moreover, the sensitivity is extremely high, caused by the very low limit of detection of fluorescence dyes, and covers a range between millimolar (10 3 M) and approximately picomolar (10 12 M). Among the most frequently used analyses methods are measurements of the fluorescence anisotropy, measurements of fluorescence intensity changes, or florescence resonance energy transfer (FRET).

32.5.2 Fluorophores and Labeling Procedures A set of different fluorescent reagents for labeling proteins and nucleic acids is commercially available. Labeling of nucleic acids is generally preferred because it is more versatile and proteins are often more difficult to obtain in pure form. Frequently used labels are for instance fluorescein- or rhodamine-tagged phosphoramidites, which are linked to the last nucleoside

851

852

Part IV: Nucleic Acid Analytics

position of an oligonucleotide, yielding a 5´ labeled probe. Alternatively, different coupling reagents are available in addition to spacer molecules consisting of a different number of CH2groups (tether). Coupling reactions may also be performed with thiol- or amino-reactive dye molecules after the nucleic acid has been modified at the 5´ end with thiol- or amino group containing phosphoramidites or ATP-γS and T4 ligase. Oligonucleotides that have been modified by amino groups may be coupled with isothiocyanate or linked to fluorescence dyes by succinimide esters. A crucial point or the success of fluorescence-based binding studies is the purity of the fluorescence probe. It is essential for the sensitivity of the respective study that 100% of the DNA or RNA are labeled with the fluorescence dye. Any contamination of free fluorophore should also be avoided because it will contribute to the anisotropy of the overall signal, notably limiting the sensitivity of the study. HPLC-purification of the oligonucleotides after labeling is the method of choice. Oligonucleotide purification kits, size exclusion chromatography (e.g., Sephadex G-10), or gel electrophoresis are also an option. Preparative gel electrophoresis under denaturing conditions has proven of value as a universal and very efficient purification method. For the preparation of fluorescence labeled DNA double-strands the labeled single-strands are usually hybridized followed by the separation of the fluorescence labeled non-base-paired single-strands by native gel electrophoresis.

32.5.3 Fluorescence Resonance Energy Transfer (FRET)

FRET, Section 16.7.1–16.7.4

A special application of fluorescence analysis for the characterization of spatial relationships of complex macromolecules is the fluorescence resonance energy transfer (FRET) technique. The method is particularly suitable for the determination of structural changes induced by protein– nucleic acid interactions. FRET allows measurement of the distances between fluorescent dyes residing at defined positions of a macromolecule, thereby providing three-dimensional structural information, which is otherwise very difficult to obtain. The method can successfully be used, for instance, to determine in a qualitative and quantitative way the protein-induced DNA-bending of labeled molecules. For the analysis, each donor and acceptor fluorescence dye must be linked at the ends of a DNA fragment containing a protein binding site. Of course, the end-to-end distance should not exceed the range limit of the fluorescence energy transfer (Förster transfer). Fluorescence resonance energy transfer describes a non-radiative transfer of energy (without emission of a photon) from the excited state of a chromophore, the energy donor D, by intermolecular long-range dipole–dipole coupling to a neighboring chromophore, the energy acceptor A. The efficiency of the transfer is inversely proportional to the sixth power of the distance between the fluorescence dyes. Hence, this limits the method to measurements between 1 and 10 nm. Quite obviously, for a FRET system to be efficient the fluorescence spectra of D and the absorbance spectra of A must be sufficiently overlapping. A further critical parameter for FRET analysis resides in the orientation of the two dyes with respect to each other. The transfer efficiency depends on the relative orientation of the diploes of the donor and acceptor. To enable highly efficient transfer the transition dipoles of donor and acceptor should preferentially be parallel. The relative orientation of the dipoles with respect to each other is given by the orientation factor κ 2, which is not precisely known in most cases and the simple assumption that both chromophores can rotate freely and fast with respect to the lifetime of the excited state of the donor (κ 2 = 2/3 ) is certainly not always correct. If one considers, what might be true for many donor–acceptor pairs, that one chromophore is static while the other rotates fast and freely then the error in the distance determination R0 adds up to simply 12%. The distance at which the energy transfer is 50% is defined as the Förster distance (R0). The magnitude of R0 depends on the spectral properties of the donor and acceptor molecules and is calculated by: R0 ˆ 9:79  103 JQn 4 κ 2

where:

1 6

(32.7)

= the spectral overlap integral,

J

Q = the fluorescence quantum yield of the donor in the absence of the acceptor, n

= the refractive index of the medium,

κ

= the dipole orientation factor.

2

32 Protein–Nucleic Acid Interactions

853

Figure 32.13 Molecular beacons for the detection of protein–nucleic acid complexes. (a) Structure of a molecular beacon; due to the close proximity of the fluorophore (F) to a quencher (Q) no fluorescence signal is emitted. (b) The interaction of a protein causes the spatial separation of fluorophore and quencher, which results in the emission of a fluorescence signal. Source: taken from Li, J. J., Fang, X., Schuster, S. M., Tan, W. (2000) Angew. Chem., Int. Ed., 39, 1049–1052. With permission, Copyright  2000 Wiley-VCH Verlag GmbH.

It is known that many dyes, which are used for labeling of nucleic acids, interact with neighboring bases. This can significantly contribute to quenching of the excited state of the chromophore. Moreover, since these interactions are sequence-dependent the newly formed nucleic acid–dye complexes may have completely different spectroscopic and stereochemical properties. To enable identical FRET efficiencies for quantitative comparisons it is important, therefore, that all labeled probes have the same nucleotide sequence in the vicinity of the fluorophore.

32.5.4 Molecular Beacons The descriptive name molecular beacons denotes a special application of fluorescence analyses that depends on changes of the fluorescence of a chromophore by a quencher molecule located nearby. Molecular beacons consist of a hairpin nucleic acid with a fluorescence molecule linked to one end and a quencher molecule to the other end. The spatial proximity of the quencher to the fluorophore prevents the emission of a fluorescence signal. When the hairpin structure is disrupted by the interaction with other biomolecules (DNA, RNA, or proteins) the distance increases between fluorophore and quencher, which causes an increase in the fluorescence intensity (Figure 32.13). Molecular beacons are therefore favorable tools for the identification of protein–nucleic acid interactions. The principle can also be applied in intact cells. In such cases the probes can be introduced into the cells through liposomes.

Molecular Beacon, Section 28.1.3

32.5.5 Surface Plasmon Resonance (SPR) Surface plasmon resonance (SPR), often designated according to the company logo Biacore, describes an optical method measuring changes of the diffraction index close to a sensor layer (∼300 nm) between glass and metal. The surface forms the bottom of a tiny cell (20–60 nl) through which an aqueous solution containing a binding partner (analyte) flows in a continuous way. Onto the sensor surface (usually a chip with a thin gold film) a binding protein (ligand) is immobilized. Binding of the analyte to the ligand increases the molecular mass at the surface, resulting in an increase in the refractive index. Several chemical procedures to immobilize the ligand on to the chip surface are available. The choice of the coupling conditions depends on the

Surface Plasmon Resonance (SPR), Section 16.6

854

Part IV: Nucleic Acid Analytics

Figure 32.14 Principle of an SPR measuring device.

chemical properties and expected sizes and structure of the complexes between the ligand and analyte. The physical principle of surface plasmon resonance is relatively complex and takes place when a monochromatic planar polarized light beam is totally reflected at the interface of a glass surface, which is coated on its outside with a gold layer. This creates two different optical waves: one wave with exponentially decreasing intensity (evanescent) resulting from total internal reflection, and one wave propagating through the metal. At a certain angle the interaction between both waves causes a drop of the surface-reflected light (plasmon resonance). This drop can be recorded. The resonance conditions depend on the material that is adsorbed to the metal layer (analyte). The resonance energy shows an almost linear dependence on the mass concentration of biomolecules, like proteins, DNA, or RNA, that are fixed to the sensor surface. The SPR signal, recorded in resonance units (RU), is a measure of the concentration of the mass on the surface of the sensor chip (∼1000 RU corresponds to the adsorption of 1 ng mm 2 protein or nucleic acid; this results in a deflection of the reflection angle by 0.1°). SPR measurements allow us to record the association and dissociation of analyte and ligand and to determine equilibrium as well as rate constants of the interaction (Figure 32.14).

32.5.6 Scanning Force Microscopy (SFM) Atomic Force Microscopy, Section 20.2

Scanning force microscopy (SFM), also sometimes termed atomic force microscopy (AFM), is a typical method for studying single molecules. With SFM, surfaces are scanned in the nanometer range with a very fine needle fixed to a cantilever. To exactly maneuver the needle over the sample piezo-electric elements are used. The needle hovers over the sample surface, held in place by a minute spring force. In a line-by-line scanning of the surface structure the cantilever tip is deflected. This deflection is recorded by a laser, which is reflected from the backside of the tip. This information is recorded by a photodetector and converted into a threedimensional image. SFM can routinely visualize structures in the atomic range (between 0.1 and 10 nm). Measurements are possible in air or in aqueous solutions and, hence, at physiological conditions. There are different ways the measurements can be performed. In the contact mode the tip of the needle remains at very close distance to the sample material to be scanned. Thereby, the tip interacts with the sample by van der Waals, capillary, or electrostatic forces. As a disadvantage, shearing forces may distort or destroy the sample material. Alternatively, measurements can be performed by the tapping mode. Thereby, the spring that fixes the tip oscillates at high frequency. The measuring procedure is largely without contact. If the fine tip touches the sample the oscillation amplitude goes down. A regulatory circle keeps the amplitude constant so that only very light contacts occur. As an advantage the lateral resolution is higher with lower application forces, which preserves the sample. For the measurements the sample, purified as highly as possible, is deposited in a dilute buffer onto a freshly cleaved mica surface. Adsorption to the mica can be improved by Mg2+ ions. Preferred objects for SFM are primarily protein–nucleic acid complexes of higher molecular weight, such as DNA bound to RNA polymerase or chromatin complexes. However, successful analyses have been performed with small proteins like the transcription factors Cro or FIS with molecular weights of 2 × 7.6 kDa and 2 × 11.2 kDa, respectively. The method is especially capable of

32 Protein–Nucleic Acid Interactions

855

Figure 32.15 Analysis of the binding of RNA polymerase to a promoter DNA by scanning force microscopy (SFM). (a) RNA polymerase in complex with a DNA fragment containing a bacterial rRNA promoter. (b) The same complex in the presence of the repressor protein H-NS. The DNA is fixed by H-NS around the RNA polymerase, which prevents it from leaving the promoter. Source: courtesy of Dr. R. Dahme, Leiden, The Netherlands.

detecting conformational changes of macromolecules, enabling the visualization of DNA bending, loop formation, or topological alterations of circular molecules (Figure 32.15).

32.5.7 Optical Tweezers Optical tweezers use light to manipulate microscopically small objects, down to the size of an atom. The method takes advantage of the fact that the focus of a laser beam can fix a small particle (optical trap). With such a device forces in the range of pico-Newtons (pN) and deflections in the range of nanometers can be precisely determined. If, for instance, the interacting forces of a molecular system should be determined one component of this system is bound to a solid support while a second component (often fixed to a silica bead) is kept in an optical trap. The deflection of the silica bead within the focused laser beam can be measured by the restoring force. With such a device it is possible to determine the interacting forces between macromolecules or, for instance, the motion and the forces involved in an RNA polymerase synthesizing an RNA chain while traveling along a DNA molecule (Figure 32.16).

856

Part IV: Nucleic Acid Analytics

32.5.8 Fluorescence Correlation Spectroscopy (FCS)

Figure 32.16 Examples of the analysis with optical tweezers. (a) Schematic depiction of an experiment to measure the force of a transcribing Escherichia coli RNA polymerase. (b) Schematic arrangement to measure the forces of the transcription and exonuclease reaction of T7 DNA polymerase. Source: taken from Wuite, G.J. et al. (2000) Nature, 404, 103–106. With permission, Copyright  2000, Rights Managed by Nature Publishing Group.

Fluorescence correlation spectroscopy (FCS) is a very fast and dynamic method for the analysis of DNA– and RNA–protein interactions in solution. The method combines laser technology and confocal microscopy. Although rather long standing the method has been further developed to become a promising technique by introduction of modern laser microscopy. In FCS the random motion of fluorescence labeled molecules is recorded in a very small volume, which roughly corresponds in size to the volume of an Escherichia coli cell (10 15 l, 1 femto-liter). A focused laser beam irradiates this small volume. The application is based on fluctuations of the fluorescence signal resulting from single molecules, which diffuse around the solution studied. The diffusion times of the molecule can be obtained from the fluctuation data, which directly correlates with the mass of the particles. Every change of the molecular mass, like complex formation with another macromolecule, strongly alters the diffusion times and can be used for the analysis of thermodynamic or kinetic parameters of the interacting partners. In this way mobility coefficients and characteristic rate constants of inter- or intramolecular reactions can be determined at nanomolar concentrations. The time-resolution of the method spans from a millisecond to more than 10 s. The measurement requires, along with a fluorescence correlation spectrophotometer, a fluorescence labeled probe. Relatively small fluorescence labeled ligands that bind to heavy (about a factor of 8–10 bigger) non-fluorescent molecules with large diffusion coefficients are ideally suited for the analysis (Figure 32.17). Lately, also devices are at hand for the simultaneous analysis of two differently fluorescence labeled components (twocolor detection).

32.6 RNA–Protein Interactions Remarkably, RNA molecules within a cell rarely exist as single molecules. At almost all times they are complexed with proteins and execute their functions as ribonucleoprotein (RNP) particles. RNPs can be very complex, like ribosomes or spliceosomes. However, there are also simple binary complexes, like RNase P or telomerase. Often, RNA molecules, depending on their functional tasks in the cell, may rapidly change between different complexes. If tRNA is taken as an example it almost never exists without a protein partner in the cell. Its functional path right after synthesis and maturation is accompanied with aminoacyl synthetase, which catalyzes the link with the specific amino acid. Following amino acylation tRNAs form a ternary complex with the elongation factor EF-Tu•GTP. They are then docked to the ribosomal A-site where they change to the P- and E-site before they are amino-acylated again after leaving the ribosome. Clearly, such RNA molecules must possess a complex pattern of different specific interaction mechanisms.

32.6.1 Functional Diversity of RNA

Figure 32.17 Example of an FCS measurement. The change of the diffusion time is recorded that results from the formation of a complex between a large molecule (protein) and a small fluorescence labeled oligonucleotide (ligand).

In addition to the general task as mRNAs, for information transfer from DNA to protein, RNA molecules have many different functions. For instance, tRNAs, a family of stable RNAs, function among several alternative tasks as adaptors for amino acids, which they activate and transport to the protein synthesis machinery. A different set of RNA molecules, ribosomal RNAs (rRNAs), not only constitute the most important structural part of all ribosomes but are also active in binding and catalytic steps during protein synthesis. Generally, RNA molecules exerting catalytic functions are collectively termed ribozymes. Among the catalytic functions of natural ribozymes are for instance nucleolytic activities (hydrolysis of nucleic acids), formation of phosphodiester bonds (ligation or splicing), or peptide bond formation. In many higher organisms guide RNAs (gRNAs) play a role in RNA editing. Moreover, small nuclear RNAs (snRNAs) or small nucleolar RNAs (snoRNAs), which are responsible for the splicing process or the maturation and sequence-directed modification of eukaryotic rRNAs, respectively, are of special importance. Recently, in all organisms studied so far, many small non-coding RNAs (ncRNAs) have been characterized, which have variable regulatory functions at the different levels of gene expression. Among those RNAs are the small interfering RNAs (siRNAs) responsible for the RNA interference phenomenon or micro RNAs (miRNAs). Moreover, the

32 Protein–Nucleic Acid Interactions

857

recently discovered small RNAs (crRNAs) involved in the prokaryotic defense system CRISPR against foreign DNA belong to the group of versatile regulatory RNAs.

32.6.2 RNA Secondary Structure Parameters and unusual Base Pairs What are the reasons for the specific differences in the interactions of proteins with either DNA or RNA? Despite the primarily rather small chemical differences between the two classes of molecules, DNA and RNA, namely the change of 2´ -deoxyribose to ribose and a swap of a methyl group in thymine against a hydrogen atom in uracil, there are enormous differences in the structures and function of the two macromolecules. The preferred sugar pucker in 2´ deoxyribose is the C2´ -endo form while in ribose, due to the larger space requirements and the ability of the C2´ -OH group to form H-bonds, the C3´ -endo form is prevalent. As a consequence of this conformational difference the distance to neighboring phosphates is shortened, which in double-stranded polynucleotides causes a stabilization of the A-form helix. In the A-form helix, in contrast to the B-form helix, the standard conformation of DNA, the base-pairs are not oriented at right angles with respect to the helix axis but are tilted and shifted slightly lateral out of the helix center (slide, shift). This causes a change in the dimensions of the major and the minor grooves compared to the B-form helix. The major groove changes to a deep narrow groove in which the functional groups of the bases are not accessible for protein contacts as in double-stranded B-form DNA. The minor groove of the B-form is converted into a shallow groove in the A-form. Contrary to the deep major groove, the minor shallow groove of the ARNA helix is easily accessible for functional groups of proteins. However, the known structures of natural RNA that have been solved until now indicate that helical sections are rarely longer than a complete helical turn without being interrupted by mismatches, bulges, or other secondary structural elements, which enables access to the deep groove from the helix ends. The discontinuation of the regular helix structure enables a multiplicity of structural variants, which deviate considerably from the A-form, producing an inexhaustible repertoire of options for interactions (Figure 32.18). Additional structural peculiarities of RNA molecules are deviations from the standard Watson–Crick base pairs, like the existence of wobble, Hoogsteen, or reversed-Hoogsteen pairs, mismatch pairs, bulges, stem-loop structures, bifurcations, tetraloops, or tertiary interactions resulting from base-triples or pseudoknots (Figure 32.19). This structural multiplicity makes up for the somewhat lower information content of the RNA helix grooves, which, as indicated above, are anyway rather rare in natural RNAs without irregularities. RNA-binding proteins, therefore, recognize only in special cases perfect double-helical elements with a conserved sequence. In contrast, single-stranded regions in between secondary structure elements are generally recognized in which the functional groups of the bases are exposed for sequence-dependent interactions. Additionally, intermolecular stacking interactions, which are normally very scarce in DNA, are often found in RNA–protein complexes. The energetic contribution to the stability of the complexes is especially high (∼3 kcal mol 1). For the description and a molecular understanding of RNA–protein interactions exact knowledge of the complex structure folding of RNA molecules is important.

32.6.3 Dynamics of RNA–Protein Interactions The specificity of DNA–protein interactions depends to a large degree on distinct interactions of amino acid side chains with functional groups of bases from the DNA strands in the major or minor grooves. The specificity of the interaction is often based on symmetrical DNA-structure elements (palindromes, sequence repetitions). Additional interactions with the sugar-phosphate backbone enhance the specificity of the complex formation. In other words, the interactions are predominantly based on the helical nature of the nucleic acid chain. In contrast, the specificity of RNA–protein interactions, in agreement with the much higher variability of the RNA 3Dstructure, is considerably versatile. Moreover, RNA molecules are (often) much more flexible, which is a prerequisite for many catalytic processes. A clear tendency of RNA-binding proteins that can be noted is to “freeze” the correct structure from an ensemble of alternative structures.

Figure 32.18 Differences in the geometry of the helical grooves between DNA and RNA. The accessibility of reactive groups of proteins is strongly restricted in the deep groove of the RNA helix.

858

Part IV: Nucleic Acid Analytics

Figure 32.19 Characteristic structural elements of RNA. (a) Different secondary structures; (b) examples of special secondary structures; (c) a pseudoknot as an example of a tertiary structure; (d) examples for special base-pairs and base-triples: (1) Hoogsteen-base-triple A:U:U, (2) protonated Hoogsteenbase-triple C+:C:G, (3) purine-base-triple in tRNAPhe, and (4) adenosine-H-bound to a 2´ -OH-group of a reverse Hoogsteen-A:U pair in tRNAPhe. (Batey, R.R., Rambo R.P., and Doudna, J.A. (1999), Tertiary Motifs in RNA Structure and Folding. Angew. Chemie Int. Ed. 38, 2326–2343)

32 Protein–Nucleic Acid Interactions

Presumably, conformational adaptation during RNA–protein recognition plays a much greater role than for the interaction between DNA and protein. Surprisingly, the binding energies of RNA–protein complexes do not correlate with the size of the interaction surface, which is predominantly participating in van der Waals interactions and makes an entropic contribution to the replacement of solvent molecules and ions. For example, the interface of the very stable U1A protein–RNA complex (KD  10 11 M) is very small and the binding includes an entropically expensive step, which converts a disordered single-strand RNA structure into an ordered secondary structure. It is assumed that RNA–protein interface regions are not completely rigid. Rather it is supposed that a compromise between rigidity and flexibility exists, which determines whether either high specificity, with the necessary very high entropic expenses, or the absence of any selectivity exists. An important parameter determining the binding specificity resides in the energetics necessary for the intrinsic RNA folding. For some RNA– protein complexes, melting of the RNA secondary structure causes an up to 105-fold lower binding energy. If one assumes that a preformed RNA secondary structure reflects exactly the spatial counterpart of a rigid protein binding domain then the entropic expenses for RNA folding, when bound to the protein surface, are reduced by the fraction of the electrostatic binding energy contribution, which derives from the interaction with the phosphate backbone. If one and the same protein recognizes two different RNA molecules the one that more readily forms its correct structure will be bound much better, no matter if all contacts between the protein and the two RNAs are identical in the bound state. The formation of RNA–protein complexes is therefore clearly a highly dynamic process. The recognition directed by the RNA structure generally changes the RNA conformation to form a fitting intermolecular binding surface. Hence, knowledge of the static RNA structure is not sufficient to understand the specificity of RNA– protein interactions on a quantitative level and it is necessary that the information is complemented by additional thermodynamic and kinetic studies.

32.7 Characteristic RNA-Binding Motifs Commonly, RNA-binding proteins consist of independently folded compact globular αβ-recognition domains, which may occur singly or in multiple repetitions. Obviously, αβ-structures provide a particularly suitable platform enabling specific interaction with RNA molecules. The most prominent αβ-domains of RNA-binding proteins are the ribonucleoprotein (RNP) domain, the K-homology domain (KH), and the double-stranded RNA binding domain (dsRBD). The arrangement of the αβ-secondary structure motifs between these three RNA-recognition modules is generally different. The RNP domain is characterized by a repetitive βαβ-arrangement, dsRBD proteins have an αβββα organization, and KH proteins exhibit a βααββα secondary structure fold. Another class of RNA-binding proteins is characterized by a basic domain. This domain consists of a stretch of about 10–15 amino acids with a very high content of arginines and lysines (arginine-rich motif, ARM). In contrast to the recognition motifs described above ARM proteins do not have independently folded compact domains. Rather, the structure of the binding domain adapts itself to the respective RNA secondary structure. Moreover, the basic amino acids are not found at conserved positions but occur as clusters with relatively high variability. RNA secondary structures that interact with ARM proteins often show structural irregularities like bulges or internal loops within a regular RNA double helix. The major groove of the RNA helix is widened such that a β-sheet or a α-helical secondary structure can enter, allowing specific interactions with functional groups of the RNA bases from the major groove. The specificity depends to a great extent on how exactly an adaptation to the target RNA can occur. This becomes apparent by the observation that short synthetic arginine-rich peptides already possess notable affinities to RNA molecules (KD values in the range of 10 9 M). According to thorough studies those peptides do not differentiate by more than a factor of two between related and non-related RNA molecules. In contrast, a specific ARM-RNA-binding protein like (HIV-1) Rev protein binds a specific RNA 1000-fold better than a non-related RNA. The RBD domain has already been described as a RNA-binding motif rather early on. It is known to occur for instance in the protein U1A and the U1-snRNA. Proteins from this family exist in very different organisms and occur in all possible compartments and organelles.

859

860

Part IV: Nucleic Acid Analytics

Individual proteins from this family bind to various different RNA molecules. Based on their large number and the large functional diversity, proteins with RBD motifs belong to the biggest family of RNA-binding proteins. Usually a RNA hairpin is recognized by a βαβ-βαβ-protein fold with characteristic aromatic side chains. Examples for RBD proteins are hnRNP A1, PABP, U2AF (U2 auxiliary factor), or general splicing factor SF2/ASF. A relatively common motif found in RNA-binding proteins is the KH-domain. The name is derived from hnRNP K, a human RNA-binding protein. The KH motif exists singly or in several copies and confers binding properties towards single-stranded RNA. The domain is composed of approximately 50 amino acids containing a conserved octapeptide, -Ile-Gly-X2Gly-X2-Ile- (X denoting any amino acid). Structure predictions indicate a three-stranded β-sheet with two α-helices arranged opposite. The ribosomal protein S3 is a typical example of a protein with KH domain. The occurrence of KH domains in very different organisms suggests that the KH motif is in evolutionary terms a rather ancient protein motif. Another well-defined RNA-binding domain consists of the so-called RGG box. This motif is built from 20 to 25 amino acids with tightly staggered Arg-Gly-Gly (RGG) sequences often separated by aromatic amino acids. The number of RGG motifs (RGG boxes) within different proteins can vary between 6 and 18. The arginine residues within the RGG boxes are often modified as NG,NG-dimethyl-arginine, which contributes to the variation of the steric properties of the motif. RGG motifs are often found in combination with other RNA-binding domains enhancing the specificity. The double-stranded RNA-binding domain (dsRBD) is an approximately 70 amino acid long sequence region, intermingled with numerous basic amino acids, recognizing double-stranded RNA. Proteins containing dsRBD motifs do not bind to double-stranded DNA, however, which indicates that the motif specifically recognizes the geometry of the A-form helix. Often, dsRBD motifs exist in multiple copies. Rather scarce as RNA-binding motifs are zinc finger domains although some exhibit additional DNA binding ability. TFIIIA, for instance, binds with a nine-fold zinc finger both the 5S rRNA gene and to the 5S rRNA itself, whereby the three middle zinc finger domains are predominantly responsible for the RNA binding. Similar properties for the interaction with DNA and RNA are known for Y box proteins, which comprise several eukaryotic transcription factors. These proteins can additionally recognize RNA and often function to mask mRNAs. With respect to their structure they resemble bacterial cold-shock proteins (CSPs). Numerous RNA-binding proteins are characterized by a so-called oligonucleotide/oligosaccharide binding fold (OB fold). This motif consists of a five-stranded β-barrel structure. Members of the class OB fold proteins are, next to tRNA synthetases, the bacterial termination factor Rho, which recognizes C-rich RNA, or the trp RNA-binding attenuation protein (TRAP). For both proteins the structure has been solved at high resolution. The Rho factor binds as a hexameric protein to C-rich RNA with little or no secondary structure. In contrast, TRAP recognizes a specific sequence within the trp operon RNA-leader. Binding results in the distortion of a hairpin structure, which normally acts as a terminator structure for RNA polymerase. The protein is composed of 11 identical subunits consisting exclusively of β-sheet structures. The recognition is carried out by 11 repetitive trinucleotide sequence repeats (G/UAG) separated by each two nucleotides. The affinity between TRAP and the leader RNA is modulated by tryptophan. A stable RNA complex will only form when all TRAP subunits contain bound Ltryptophan. Both the termination factor Rho and the RNA-binding protein TRAP are impressive examples that binding of a protein to its target RNA is accompanied by dramatic changes of the RNA secondary structure with distinct functional consequences (riboswitch).

32.8 Special Methods for the Analysis of RNA– Protein Complexes Many of the analysis techniques described for DNA–protein complexes are suitable for RNA molecules as well. The same applies for the physical methods, outlined in Section 32.5, which are generally suitable for studies of both types of complexes. The following section summarizes those types of analyses that have especially been developed for the handling of RNA–protein complexes taking advantage of RNA-specific enzymes or chemicals.

32 Protein–Nucleic Acid Interactions Table 32.1 Specificities of RNases. RNase

Substrate

Cleavage

RNase T1

Single-stranded GpN

Gp (3´ -P)

RNase U2

Single-stranded ApN

Ap (3´ -P)

RNase CL3

Single-stranded CpN ≫ ApN > UpN

Cp, Ap, Up (3´ -p)

RNase T2

Single-stranded NpN

Np (3´ -p)

Nuclease S1

Single-stranded NpN

pN (5´ -p)

RNase CVE (V1)

Double-stranded NpN

pN (5´ -p)

32.8.1 Limited Enzymatic Hydrolyses For the analysis of RNA structures several specific enzymes (RNases) are available, which differ notably in their specificity and structure selectivity from DNases (Table 32.1). Although RNases can be divided in endo- or exonucleases no parallel exists to the large number of DNA sequence-specific endonucleolytic hydrolyzing restriction enzymes. Instead, several RNases are known that are highly specific for a single nucleotide and that hydrolyze the phosphodiester chain in a way that the phosphate remains either at the 3´ - or the 5´ -site of the recognized nucleotide. Because, usually, those functional groups of the bases are recognized, which are generally involved in Watson–Crick-type base pairing, and hence not freely accessible in the paired state, most RNases are preferentially single-strand-specific. There are also RNases without distinct base specificity, which are capable of hydrolyzing the RNA chain after each possible nucleotide A, U, G, and C with statistic frequency. By combination of different RNases it is possible to gain information about each nucleotide of an isolated RNA molecule as well as within a protein complex. In addition to the base-specific RNases some enzymes are known with a distinct specificity for RNA secondary structure, which only cut either single-strand or double-strand RNA. The use of these enzymes represents the experimentally simplest and generally applicable way to solve the secondary structure of RNA molecules, single or in complex with a protein.

32.8.2 Labeling Methods The analysis following enzymatic hydrolysis is basically always possible by direct gel electrophoresis of the resulting cleavage products of radiolabeled RNA molecules. The labeling can be done either at the 3´ - or at the 5´ ends. For the direct analysis of end-labeled RNA there is, however, a certain length limitation (about 200–400 nucleotides) depending on the resolution power of the sequencing gel system. Hence, information will only be accessible for nucleotides that are not further away than 200–400 nucleotides from the labeled end. Labeling of the 5´ -ends is performed as described for DNA fragments by the transfer of a [32P]-phosphate group from γ-[32P]-ATP catalyzed by T4-polynucleotide kinase. In the case of already 5´ -phosphorylated RNA fragments the 5´ -phosphate is removed by alkaline phosphatase or a polynucleotide kinase exchange reaction with γ-[32P]-ATP is performed. Labeling at the RNA 3´ -ends is carried out by ligation of a short radioactive labeled nucleotide diphosphate (typically 5´ -[32P]-pCp) to the free 3´ -OH-group of the RNA. The appropriate enzyme is T4-RNA ligase. To facilitate the assignment of bands on the denaturing sequencing gels it is common to co-separate an aliquot of the labeled RNA after a mild statistical alkali hydrolysis. This generates a “ladder” representing cleavage products at all nucleotide positions. In addition, a limited hydrolysis of the labeled RNA with RNase T1 has proven to be of value due to the high specificity, which indicates the positions of all guanines within the RNA. The following should be considered for the analyses. With the enzymatic cleavage of RNA molecules it is possible that the first cut already induces dramatic structural changes, which may cause the native complexes to fall apart. As a consequence secondary cuts may appear rapidly, which falsifies the analysis results. There is,

861

862

Part IV: Nucleic Acid Analytics

however, an elegant way to distinguish such secondary from primary cuts. It takes advantage of the fact that primary cuts should be visible as well with the 5´ end-labeled as with the 3´ end-labeled RNA molecule while secondary cuts will only show up in one of the labeled samples. For a correct analysis it is reasonable, therefore, to use samples that are labeled at either end and to interpret only those results that are evident with both labeled samples.

32.8.3 Primer Extension Analysis of RNA A second method used to assign cleavage or modification sites within an RNA molecule is the primer extension reaction, which can in principle be used for the analysis of protein– DNA complexes. Yet, primer extension reactions with RNA molecules are not performed with a DNA polymerase but with reverse transcriptase. This enzyme, starting from a complementary DNA oligonucleotide hybridized to the RNA, synthesizes a cDNA in the presence of the four dNTPs as substrate. In this reaction the RNA sequence is copied from the 3´ position of the annealed oligonucleotide into the 5´ direction. The primer is extended until the 5´ -end of the RNA is reached. This means that all newly generated 5´ -ends caused by the cleavage reaction result in a new abortive primer extension product on the denaturing gels. Furthermore, the reverse transcription reaction stops the primer extension at bases modified at positions that interfere with the Watson–Crick base-pairing. These are modifications at N1-Ade, N1-Gua, N2-Gua, N3-Cyt, and N3-Ura. The N7-Ade modification, introduced by the DEPC reaction (see below), also suffices for a primer extension abort although this position is not involved in typical Watson–Crick base-pairing. Modifications at N7-Gua require a subsequent strand break reaction to be detected by primer extension analysis. Usually, for the analysis, the oligonucleotide is labeled at the 5´ -end by γ-[32P]-ATP and polynucleotide kinase. Modified positions close to the 3´ -end of the RNA appear as short cDNA sequences (at the bottom of the sequencing gel) while positions modified near the 5´ end of the RNA are reflected by longer cDNA products (in the upper part of the gel). The choice of different oligonucleotide primers allows even very long RNA molecules to be analyzed. The only requirement is that oligonucleotides are available that are complementary to the target RNA at a distance of roughly 200 nucleotides to scan the total sequence of even a very complex RNA molecule. For the exact assignment of the primer extension products on the denaturing polyacrylamide gels a parallel separation of products from a sequence reaction of the non-modified target RNA with all four dideoxy-NTPs (ddNTPs) is commonly performed. A certain challenge for the interpretation of primer extension analyses is caused by sequence-dependent abortions, which may arise from very stable secondary structures or result from the presence of naturally modified nucleotides as they occur in some RNA molecules. A control with non-modified RNA is therefore advisable.

32.8.4 Customary RNases For the analysis of RNA–protein complexes a set of commercially available purified enzymes is listed in Table 32.1. They can be used universally and cover a broad range of applications for the analysis of RNA structures and RNA–protein complexes (Table 32.1). RNase T1, isolated from Aspergillus orizae, cleaves single-stranded RNA at the 3´ site of guanosine moieties. The phosphate group remains at the 3´ terminal guanosine. The enzyme is relatively insensitive to changes in pH and remains active in the presence of 7 M urea. RNase T1 is a very frequently used enzyme because of its high specificity and limitation to only one base (guanosine). The methylated guanosines m1G and m7G occurring in natural RNA molecules are not recognized. RNase T1

RNase U2 RNase U2 is isolated from Ustilago sphaerogena and cleaves 3´ all four common nucleotide positions in single-stranded RNA with a strong preference for A (A > G ≫ C > U). Cleaved fragments remain phosphorylated at the 3´ end. RNase U2 has a pH optimum of 4.5 and 7 M urea hardly inhibits the cleavage. For the analysis of most complexes the conditions have to be adjusted to neutral pH, which requires correspondingly higher enzyme concentrations

32 Protein–Nucleic Acid Interactions

RNase CL3

RNase CL3 is isolated from chicken liver. The enzyme is preferentially used to cleave at non-paired cytosines. Cleavage does also occur behind A and U but requires considerably longer reaction times or higher enzyme concentrations. In all cases the 3´ phosphorylated cleavage fragments are generated. The presence of magnesium ions or spermidine enhances the activity of the enzyme. RNase T2 RNase T2 is generally derived from Aspergillus orizae. This nuclease cleaves at any nucleotide, even at most of the modified ones. Cleavage results in 3´ phosphorylated ends. A slight preference for cleavage at adenosine has been observed. The enzyme is active at slightly acidic and neutral pH but is inhibited by metal ions (Cu2+). Nuclease S1 This nuclease is likewise isolated from Aspergillus orizae and hydrolyzes DNA as well as RNA in sequence-independent manner. Hydrolysis leads to 5´ -phosphorylated ends. Nuclease S1 is the major enzyme used for characterization of single-strand sequences. Zn2+ ions are essential for the reaction. Cleavage is optimal at pH 4.5 and reactions at neutral pH require correspondingly higher enzyme concentrations. RNase CVE This enzyme, often also designated V1, is isolated from the venom of the cobra snake Naja naja oxiana and cleaves selectively double-strand RNA without notable sequence preference, leading to 5´ phosphorylated ends. For recognition as cleavage site a minimum of 4–6 paired nucleotides are required. Occasionally, non-paired sequences may be cleaved if they are in tightly stacked conformation. The enzyme requires Mg2+ ions for activity and is strongly inhibited by EDTA.

32.8.5 Chemcal Modification of RNA–Protein Complexes Chemical reagents can be applied for the structural characterization of RNA–protein complexes in a similar fashion as described for DNA–protein complexes. Due to different secondary structures of RNA and the coherently different accessibilities to the functional groups of the bases additional reagents are in use to account for the different reactivity of RNA molecules. It is also valid here that the smaller sizes of chemical reagents compared to enzymes enable a tighter intrusion into complex structures, contributing to higher resolution of the analysis. The specificity of the chemical reaction does not only yield information about the nucleotides involved in RNA–protein complex formation. Because the reagents attack in each case very distinct functional positions of the nucleic acid chain information about the accessibility of these positions is provided at an atomic level. Among these positions are the N1 and N2 positions of guanine, the N7 and N1 positions of purine bases, the N3 position of pyrimidines as well as the phosphate groups of the nucleic acid backbone. By combination of reagents it is possible to detect differences responsible for the formation of Watson–Crick base pairs or alternative pairings (e.g., Hoogsteen pairs) or differences resulting from deviations in the sugar-phosphate backbone. It is important to note that chemical modifications primarily indicate changes in reactivity. This is not necessarily identical with changes in the stereochemical accessibility of the molecule. Another point demanding caution is that many chemical reagents can also modify functional groups of proteins. It is obligatory, therefore, to test whether the complexes remain intact under the reaction conditions. Moreover, for chemical modification reactions buffers should avoided that contain amino groups; this means Tris-buffers are inappropriate, for instance, and should be replaced by HEPES, phosphate, or cacodylate buffers. On the other hand, chemical modifications are much less dependent on the presence of Mg2+ ions or EDTA and in contrast to many enzymes do not have a small pH optima. In contrast to RNases chemical modifications do not directly lead to cleavage of the RNA chain. To localize the modified position either an additional cleavage reaction must be performed or the analysis must be completed by a subsequent primer extension reaction if the modification directs a polymerase abort. The reagents indicated below have been proven to be of special value for the analysis of RNA structures and RNA–protein complexes (Table 32.2; Figure 32.20). Kethoxal (α-Keto-β-ethoxy-butyraldehyde) Kethoxal reacts specifically only with nonpaired guanosines by formation of a five-membered ring including the guanine atoms N1

863

864

Part IV: Nucleic Acid Analytics Table 32.2 Reagents for the chemical modification of RNA. Reagent

Specificity

Analysis methoda),b)

Remarks

Kethoxal

N1-Gua, N2-Gua

PE

DMS

N3-Cyt

Cleavage, PE

Adduct RNase T1resistant Application in vivo possible Application in vivo possible

N1-Ade

PE

N7-Gua

Cleavage, PE after cleavage

DEPC

N7-Ade

Cleavage, PE

ENH

Sugar-phosphate backbone

Cleavage, PE after cleavage

OH-radicals

Sugar-phosphate backbone

Directly, PE

Often only applicable as interference method

a) PE: primer extension. b) Cleavage: treatment with aniline.

Figure 32.20 Structural formula of RNAmodifying reagents.

and the amino group of C2. The product is stable at acidic pH and can be stabilized further in the presence of borate. Kethoxal modified guanosines within an RNA chain are no longer attacked by RNase T1, which can be utilized for identification. Because the kethoxal adduct is alkali-labile modified positions within an RNA chain can relatively easy be identified by RNase T1 hydrolyses before and after alkali treatment. Kethoxal, which penetrates through membranes and cell walls, can also be applied for in vivo studies with intact cells. DMS (Dimethyl Sulfate)

DMS, equally suited for the modification of DNA, can be employed under special conditions for the selective analysis of accessible positions of guanines (N7-Gua), cytidine (N3-Cyt), and adenine (N1-Ade) within RNA. The reaction of DMS with guanine occurs at the N7 position, introducing a methyl group and a positive charge to the imidazole ring. The slightly destabilized 7,8 double bond can be reduced with Na-borohydride at mild alkaline pH. At the resulting intermediate product (m7-dihydro guanosine) a strand scission can be performed with aniline. The reaction discloses guanine residues whose structural contribution rests on nonWatson–Crick base pairs. Strong stacking interactions inhibit the modification. DMS reacts with non-paired cytidines at the N3 position. The modification can be identified by primer extension because the modified cytosine gives rise to a reverse transcription abort. Alternatively, a cleavage reaction can be introduced with end-labeled RNA. This requires a reaction of the modified base with hydrazine prior to the cleavage of the chain by aniline. Note that the cleavage reaction results in unique products with 3´ end-labeled RNA molecules, yielding sharper bands on the subsequent sequencing gel compared to 5´ end-labeled RNAs! For the analysis of single-stranded adenosines the DMS reaction with N1A, which yields methyl-1-adenine, can be used. The subsequent analysis can only be performed by primer extension because there is no chemical strand scission reaction known for this modification. DMS reacts with many proteins; hence for the analysis of RNA–protein complexes it is advisable to test the stability of the complexes under the reaction conditions. Because the hydrolysis of DMS in water leads to the formation of sulfuric acid it is important to buffer the decline in pH at high reagent concentrations. DEPC (Diethyl Pyrocarbonate)

The reaction of diethyl pyrocarbonate (DECP) causes a carbethoxylation of the N7 position of purines. Diethyl pyrocarbonate can favorably be used at neutral conditions to test the participation of the N7 position of adenine in tertiary interactions. Watson–Crick type base-paired adenosines in helical regions are not reactive with DECP. The modification is usually identified with end-labeled RNA by an aniline-catalyzed strand scission. At mild acidic pH the N7-Gua position and at slightly alkaline buffer the N3-Ura positions are also carbethoxylated. Like DMS, DECP is known for its reactivity towards proteins, which explains its preferred use for structural analyses of isolated RNA. If RNA–protein complexes are studied their stability should be assured in advance. CMCT

(1-Cyclohexyl-3-(2-morpholinoethyl)carbodiimide

metho-p-toluolsulfonate)

CMCT reacts with uridine (N3) and guanosine (N1) if the respective bases are not paired

32 Protein–Nucleic Acid Interactions

according to Watson–Crick. Both modifications can be analyzed by primer extension. For the analysis of RNA–protein complexes CMCT is actually only used in special cases. ENU (Ethylnitrosourea) This N-nitroso-alkylating reagent attacks the oxygen atoms of the phosphate backbone and thus differs from, for example, DMS, which attacks the ring-nitrogen atoms of the bases. The resulting RNA-phosphotriesters from ethylnitrosourea (ENU) reactions are unstable and can easily be cleaved by mild alkaline conditions. ENU preferentially attacks phosphates oriented in exact helical geometry. The contribution of the phosphates in tertiary interactions or H-bonds to amino acid side chains as well as cation coordination reduces the reactivity. For the analysis of RNA–protein complexes ENU is usually normally only used under damage-selection conditions. Hydroxyl Radicals The reaction of hydroxyl radicals with DNA–protein complexes has already been described in detail. Hydroxy radicals (•OH), generated by the Fenton reaction (32.5) can also successfully be used for the analysis of RNA–protein complexes. The radicals that arise from the reaction of (Fe2+)-EDTA complexes and H2O2 (32.6) interact with C4´ of the ribose, causing a strand scission. The reaction is independent of the base sequence and only marginally influenced by the secondary structure. Protein contacts or notable changes in the groove geometry, however, cause differences in the modification pattern. The modification pattern is visualized by autoradiography of the sample separated on a sequencing gel, either directly with end-labeled RNA or after a primer extension reaction with a non-labeled RNA. The analysis of hydroxyl radical-modified RNA generally provides information about almost all nucleotides of the molecule and regions of protection or enhanced reactivity can best be identified when the profiles resulting from densitometry of the gel lanes from free- and complexed-RNA are superimposed. Fe-BABE (Fe 1-(p-Bromoacetamidobenzyl)ethylenediaminetetraacetic Acid) The reagent Fe-BABE, conjugated to proteins, has been proven of special value for the analysis of high molecular complex RNA–protein particles. It enables the elucidation of complicated neighborhood relations between RNA molecules and different ligands. Valuable information has been obtained for the topography of ribosomes and RNA polymerases bound to nucleic acids. Because the reagent can be coupled to RNA spatial information on RNA–RNA and RNA–protein interactions can be obtained.

During the hydrolysis of the 3´ -5´ -phosphodiester bond of single-strand RNA, contrary to double-strand RNA in the A-form, an arrangement of the 2´ -oxyanion position and the neighboring 5´ -oxyanion leaving group in the in-line orientation is possible. This orientation significantly facilitates a transesterification reaction and leads to a facilitated spontaneous or metal-catalyzed hydrolysis of non-helical RNA regions. As a result, internucleotide bonds of secondary structured RNA sequences show less frequent spontaneous hydrolysis as RNA sequences, which are unstructured. This fact can successfully be used to distinguish helical from single-stranded RNA sequences. The method, termed in-line probing, is especially effective in deriving structural information on RNA interacting with ligands or metal ions. For the analysis an end-labeled RNA is incubated at room temperature for about 40 h in the presence of Mg2+ ions and the products resulting from spontaneous cleavage are separated on denaturing gels and visualized by autoradiography. A limited RNase T1 hydrolysis, showing all guanosine positions, and an alkaline hydrolysis lane, showing statistically all sequence positions of the end-labeled RNA, are generally used for the sequence assignment at the nucleotide level. In-Line Probing

SHAPE Analyses

A recently developed method for the analysis of RNA secondary and tertiary structures, designated by the acronym SHAPE (selective 2´ -hydroxyl acylation analyzed by primer extension), can also be applied to study RNA–protein complexes. The method is based on the difference in reactivity of single-strand versus paired nucleotides of an RNA molecule towards electrophilic reagents like NMIA or 1M7 (1-methyl-7-nitroisatoic anhydride) or benzoyl cyanide (BzCN). The reagents react selectively with the ribose 2´ -OH group and local differences in the mobility/flexibility of the nucleotides can be recorded because less flexible nucleotides, as they exist in single-strand RNA, are found more frequently in a conformation that is more reactive

865

866

Part IV: Nucleic Acid Analytics

towards electrophilic reagents. As a result, nucleotides of single-strand regions or nucleotides not fixed by protein contact show stronger modification by the SHAPE reagents compared to nucleotides involved in stable base pairing or fixed by tertiary interactions. Because the SHAPEmodification causes an abortion of the cDNA synthesis the modified positions can readily be identified by primer extension reaction. The method provides information on the nucleotide level about the structure or dynamics of RNA molecules or RNA–protein complexes. It can furthermore be performed by high-throughput technology (hSHAPE) in which different fluorescence labeled primers are used and the resulting cDNA products are separated by capillary electrophoresis.

32.8.6 Chemical Crosslinking

Chemical Crosslinking, Section 6.3

For the identification of contact sites between binding partners chemical crosslinking is still a method of choice. As an advantage, crosslinking can convert weak, non-covalent interactions between proteins and nucleic acids into covalent and chemically stable conjugates. For that reason even complexes of low stability and short lifetime can be characterized by crosslinking. Note, however, that with crosslinking only positive results can be interpreted because the absence of a crosslink may be due to chemical or steric reasons and must not rule out a direct contact! For crosslink analyses different bifunctional reagents are available that possess proteinspecific as well as nucleic acid-specific reactive groups (hetero-bifunctional reagents). Each crosslinking reagent has a characteristic distance of its functional groups (Table 32.3). This distance limits the precision of the analysis. Because only information of closely neighboring functional groups is desired from crosslinking experiments, preferably reagents with short distances (about 1 nm) are considered. Nucleic acids can also be crosslinked by direct irradiation with energy-rich UV light between 250 and 280 nm. This type of crosslinking occurs at virtually zero-length. Crosslink reactions often suffer from low yields, which sometimes enforces more drastic reaction conditions. Crosslinks from such studies are not always restricted to the protein– nucleic acid contact (intermolecular) and numerous links within the nucleic acid (intramolecular), degradation caused by photo-cleavage or oxidations, have to be accepted. This can in many cases be avoided by the use of crosslinking reagents with photoreactive groups that are activated by mild UV-radiation (>300 nm). 2-Iminothiolane is such a reagent, for instance. Photoreactive reagents can also be used for stepwise crosslinks with a first chemical reaction to functional groups of a binding protein followed by a second reaction with the nucleic acid after formation of the protein–nucleic acid complex initiated by mild UV radiation. There are several different strategies for subsequent localization of the contact points at either the protein or the nucleic acid level. The original very laborious conventional peptide fingerprint methods used to identify the amino acids involved have been replaced by fast and sensitive mass spectrometric techniques (MS) allowing a high sample throughput and work with protein amounts in the picomole range. Basically, the crosslinked complex is hydrolyzed by nucleases and proteases. The protein can, for instance, be digested with endo-

Table 32.3 Some crosslinking reagents and their properties. Reagents

Reaction

Protein

RNA

Distance (nm)

2-Iminothiolane

UV (350 nm)

–NH2

Uridine

0.5

Diepoxybutane (DEP)

37 °C, pH 7

–NH2(Lys) –SH (Cys)

N7-G

0.46

Methyl-p-azidophenylacetimidate (APAI)

1. Tris buffer 2. 300–460 nm

Non-specific

ca. 1

C, G, A, T; Amino- and imino groups

0.2

Formaldehyde

Structure

H2C O

37 °C, pH 8

NH2 (Lys)

Lys, Arg, His, Trp

32 Protein–Nucleic Acid Interactions

peptidase Lys-C, while nucleic acids (in case of RNA) are hydrolyzed by RNase T1. The resulting peptide-oligonucleotide adducts are then purified by column chromatography or gel electrophoresis. The purified hetero-conjugates are subsequently sequenced by matrixassisted laser desorption/ionization time of flight (MALDI-TOF) spectrometry, which allows determination of the sequences of both the peptide and the nucleic acid and identification of the point of the crosslink. With chemical crosslinking reagents, and not direct UV-crosslinking, the exact mechanism of the crosslinking reaction must be known to derive the exact mass of the hetero-conjugate consisting of the peptide moiety, crosslinking reagent, and oligonucleotide.

32.8.7 Incorporation of Photoreactive Nucleotides In place of using bifunctional reagents or non-directional UV-irradiation, crosslinking can also be achieved by the directed incorporation of photoreactive nucleotides. A set of different nucleotide analogs is available, which can either be integrated into the nucleic acid chain by enzymatic incorporation or by chemical synthesis. They may alternatively be added to each of the 3´ or 5´ ends. For the analysis of protein–nucleic acid complexes the incorporation of 4SUTP has been of value. This nucleotide analog is accepted by T7 RNA polymerase and can be positioned by in vitro transcription reactions instead of UTP, either statistically or following suitable transcription strategies, at defined positions within an RNA chain. 4S-uracil has the advantage of behaving very similarly to uracil and no notable changes in the structure of the nucleic acid must be expected. The excitation wavelength for crosslinking (suitable for RNA with and without bound protein) is 300 nm, which does not create a risk for the integrity of RNA–protein complexes. The crosslink positions within RNA can generally be identified by comparative primer extension reactions with radiated and non-irradiated RNAs. Photoreactive nucleotide derivatives likewise exist for the directed incorporation into DNA, enabling crosslinking analyses of DNA–protein complexes as well. The strategies for incorporation differ here, of course. Derivatized oligonucleotides, serving as primers for a PCR reaction, which can place the reactive nucleotides at any desired position within a DNA fragment, are the matter of choice. Alternatively, crosslinking nucleotides can be placed at 5´ overlapping ends of DNA fragments with appropriate DNA polymerase fill-in reactions and photoreactive deoxynucleotide triphosphates as substrates. The fragment can be flanked by a second DNA fragment by a subsequent ligation reaction. For the choice of reaction conditions it must be considered that some reagents are lightsensitive and all steps of the procedure have to be performed under darkened laboratory conditions. Some photoreactions are disturbed by the presence of reducing agents. Photoreactive azides, for example, can be reduced to the non-reactive amines. Among the interfering substances are also the thiol reagents 2-mercaptoethanol or DTT. These substances, often added for the protection of SH-groups in proteins, should therefore not exceed a concentration of 50–100 μM.

32.8.8 Genome-Wide Identification of Transcription Start Sites (TSS) The quantitative determination of the total RNA from an organism by microarray technology is a very useful method for the characterization of changes in expression on the RNA level caused by alterations of the external conditions or as a consequence of mutations (transcriptome analyses). Today, the identification of transcripts relies increasingly on high-throughput sequencing techniques (RNA-Seq) instead of hybridization-based methods. The RNA-Seq procedure is based on the translation of the total RNA preparation into a library of shorter cDNA fragments. To these fragments sequencing adaptors are ligated and the sequence is established by current high-throughput techniques (e.g., 454 pyrosequencing). The initially obtained short sequences (sequence reads) are aligned with the reference genome, which often enables the establishment of a transcription profile of all genes from the organism at nucleotide resolution. Conventional transcriptome analyses generally record the steady state concentration of all present RNAs as long as they are represented on the microarray chips. It is not possible to distinguish between primary transcripts and processed RNAs. There are, however, suitable methods to

Transcriptome Analysis, Section 37.1 High- throughput Sequencing Techniques, Section 30.2.1

867

868

Part IV: Nucleic Acid Analytics

determine, on a genome-wide basis, direct changes of the primary transcripts allowing us to map and to annotate accordingly a primary transcriptome. If the exact transcription start sites are known microarrays with overlapping probes (tiling arrays) can be used to quantify the RNA level in direct proximity to the promoter. In eukaryotic systems primary transcripts can also be identified by their 5´ Cap structure and suitable procedures for the specific enrichment and identification of Cap structures exist (Cap analysis of gene expression, CAGE). For prokaryotic transcripts, which do not have a Cap, advantage can be taken of the fact that prokaryotic primary transcripts generally have a 5´ triphosphate, whereas the 5´ end of processed RNA is characterized by a 5´ monophosphate. For a TTS determination in prokaryotic systems, tiling arrays and RNA-Seq procedures are combined with a technique that allows distinguishing between RNAs with a 5´ triphosphate or a 5´ monophosphate moiety. For this procedure two different cDNA libraries are established. The first library is prepared from non-treated total RNA while the RNA for the second library is treated with terminator exonuclease (TEX), which digests RNA with 5´ monophosphates from their 5´ ends but not RNA with a 5´ triphosphate or a free 5´ OH group (or RNA with a 5´ Cap). Enriched primary transcripts, obtained in this way, are then identified by RNA sequencing techniques (454 Pyrosequencing, RNA-seq). The method is designated as differential RNA-seq (dRNA-seq) and allows genome-wide localization of RNA polymerase start sites (TTS or promoter regions).

32.9 Genetic Methods 32.9.1 Tri-hybrid Method Two-Hybrid System, Section 16.1

Figure 32.21 Scheme of a tri-hybrid system for the in vivo characterization of RNA–protein recognition. Source: according to Bernstein, S. et al. (2002) Methods, 26, 123–141. With permission, Copyright  2002 Elsevier Science (USA).

Like the two-hybrid system, serving for the identification of interacting proteins within the cell, an analog procedure, the tri-hybrid system, serves to identify RNA-binding proteins. The principle is very similar to the two-hybrid system. This is based on the activation of reporter genes within the cell, for instance by the transcriptional activator GAL4. This factor is expressed in the cell by special expression vectors as two separate fusion proteins. One of these proteins contains the activation domain, the other the DNA-binding domain. When both proteins interact the transcription factor activity is regenerated and the reporter gene can be transcribed. In this way, the two-hybrid system serves to demonstrate protein–protein interactions. The tri-hybrid system goes one step further. Now two fusion proteins are expressed, which in this case consist for instance of the GAL4 DNA-binding domain and the protein RevM10 (a mutated HIV-Rev protein) as well as the GAL4 activation domain and a putative RNA binding protein. In addition, an RNA hybrid, consisting of a binding sequence for the RevM10 protein (RRE) and the RNA to be examined, is transcribed in the cell by RNA polymerase II. The activation of the reporter gene depends exclusively on the binding of the RNA to be studied and the putative RNA-binding protein by which the functional elements for transcription activation are brought together (Figure 32.21). Alternatively, the tri-hybrid method can make use of fusion proteins from bacterial transcription factors when the reporter

32 Protein–Nucleic Acid Interactions

869

gene contains the respective regulatory sequences in the vicinity of the promoter. For instance, for the LexA operator, a fusion protein of the LexA repressor and the MS2 phage coat protein as well as a fusion between the iron-regulatory protein 1 can be expressed. As hybrid RNA, a fusion with binding sites for the MS2 coat protein and the iron-responsive element IRE can be used. In addition, systems have been described that take advantage of the interaction between HIV transactivator protein (Tat) and an RNA with the HIV trans-activator response element (TAR). Generally, tri-hybrid systems are especially appropriate to identify specific RNA-binding proteins or target RNAs in vivo from a library of cDNAs or to analyze structural peculiarities of an already identified RNA–protein interaction. In the latter case mutant RNA molecules are expressed. There also exists a one-hybrid screen for the identification of specific DNA sequences. In this case the DNA sequences are cloned upstream of a yeast reporter gene (target element). The proteins to be analyzed are cloned as fusion proteins with an activation domain for the reporter gene. A convenient combination is for instance the GAL4 activation domain with its recognition sequence and β-galactosidase or HIS3 as reporter gene. For screening experiments a yeast strain is constructed that has this reporter gene with the upstream positioned β-galactosidase integrated into its genome (reporter strain). This strain is then transformed with a cDNA library of candidate genes containing the protein sequences to be analyzed as fusion into the activation domain (GAL4 system). Positive candidates can be identified by a color reaction on corresponding agar plates.

32.9.2 Aptamers and the Selex Procedure The Selex (systematic evolution of ligands by exponential enrichment) method combines the possibilities of biochemical in vitro synthesis with the potential of genetic selection. It is a matter of in vitro selection of RNA (or DNA) molecules that are roughly 20 to 60 nucleotides long with specific properties, such as binding of small ligands, protein binding, or catalytic activity, from an enormous pool of random sequences. The Selex procedure has been used successfully in the past for the characterization of protein binding sites, binding of various different ligands to nucleic acids, as well as for the construction of ribozymes. The selected nucleic acids exhibit the properties of highly specific receptors and are designated aptamers. The term aptamer (Latin: aptus, to fit) depicts that out of the huge number of non-functional molecules (>1015) those with a suitable function were enriched. At first a random library of DNA or RNA molecules containing as much as possible randomized sequences in their middle are generated. The ends of the molecules in contrast contain known sequences, which serve as primer targets for PCR amplification. The evolutionary Selex method is based on the separation of those molecules from the total pool that exhibit binding properties (or catalytic activity). This is for instance achieved by the use of an affinity column. Those molecules with the best retention are used for further rounds of selection. In the case of RNA they are subjected to reverse transcription, converted into double-strands, and amplified by PCR. Usually, the ends of the molecules contain a sequence for the phage T7 RNA polymerase promoter, which facilitates the synthesis of correspondingly large amounts of the enriched single-strand RNA. From this pool of RNAs the fraction of functional molecules is separated again by a new binding reaction (or test for catalytic activity) and the cycle is repeated until no better binding properties can be observed (about 10–20 cycles). The stringency for binding may be enhanced between single cycles or randomization may be enhanced again by deliberately inaccurate cDNA synthesis enhancing the evolution rate. Usually, the number of sequence variants is limited in the start pool. For RNA with a length of 220 nucleotides there are theoretically 4220 = 10132 sequences possible. Such an enormous amount of RNA cannot be realized. Generally, one starts with 1015 sequences. With the previous randomization the active pool can be evolved. With the Selex procedure numerous nucleotide sequences with amazing properties have been generated. These include molecules with a specific binding capacity for small biomolecules, like amino acids, sugars, dyes, or antibiotics. In addition, aptamer molecules with different catalytic properties, not yet observed in nature, have been generated. Of special interest for the analysis of protein– nucleic acids is the fact that RNA-binding motifs for specific proteins, for instance the HIV-Rev protein, could be evolved. Sequence comparison and the measured binding constants allow conclusions to be made about the specific structural elements involved in protein binding.

Selection of Aptamers (SELEX), Section 38.4.1

870

Part IV: Nucleic Acid Analytics

32.9.3 Directed Mutations within Binding Domains When an RNA–protein binding partner (or likewise a DNA–protein binding partner) has been identified and the binding sequence within the nucleic acid has been localized subsequent directed or statistical in vitro mutagenesis procedures combined with binding experiments can yield a very precise description of the molecular basis of the protein–nucleic acid recognition. Within recent years the methods for targeted or comprehensive mutagenesis have been refined and simplified. The use of oligonucleotides has become a standard application for the molecular biologist and numerous commercially available kits facilitate fast, PCR-based mutagenesis techniques. For a systematic change of preferably all nucleotides or amino acids potentially involved in the interaction a mutagenesis method designated linker-scanning or alanine-scanning has been developed. In the linker-scanning mutagenesis a fixed sequence (linker) is used for the systematic replacement of nucleotide sequences over a defined sequence region. In the method originally developed two nested series of 5´ and 3´ deletions starting from a restriction enzyme recognition site of the target DNA are created at first. At each end of the deletion construct the same short oligonucleotide (linker) is ligated. Subsequently, both sets of 3´ and 5´ truncated deletion constructs leading to the original length are ligated, whereby the linker sequence is located at a different position in each newly formed fragment. As a result, the position of a distinct sequence is permutated over a large range. In this way position effects of binding sites can be varied systematically. The isolated constructs can then be used for binding studies with any assay and the quality of complex formation tested with each separate construct. In the alanine-scanning method codons within the mRNA of a binding protein of interest are systematically replaced with sequences encoding the neutral amino acid alanine (GCU, GCA, GCC, GCG). This is achieved by standard mutagenesis procedures. The gene sequence must of course be known and accessible for mutagenesis. The mutated sequences have to be cloned in an expression vector and potential effects on binding can be analyzed in vivo after transformation of a suitable strain by a screening procedure (tri-hybrid assay). Frequently, the altered proteins should be studied directly. This is of course a laborious step. It is advisable to choose an expression vector that adds a His-tag to the mutant protein gene. This enables a one-step enrichment of the expressed protein variants from the cell extract. If the point is to learn the molecular details of a distinct binding mechanism generally only target-directed mutations are created. For instance, charged or hydrophobic amino acids would be replaced by neutral ones. In the same manner, binding regions of nucleic acid components can also be mutated by single-nucleotide replacements, of course. In this case, the already existing information on the nature of the binding site is essential. For studies like this, Selex experiments can provide excellent information, for instance. In any case, the in vitro mutagenesis of binding sites in combination with binding experiments represent simple but very powerful methods with a great potential to elucidate the molecular mechanisms of protein– nucleic acid interactions.

Further Reading Bernstein, D., Buter, N., Stumpf, C., and Wickens, M. (2002) Analyzing mRNA–protein complexes using a yeast three-hybrid system: methods and applications. Methods Enzymol., 2, 123–141. Bustamante, C. and Rivetti, C. (1996) Visualizing protein–nucleic acid interactions on a large scale with the scanning force microscope. Annu. Rev. Biophys. Biomol. Struct., 25, 395–429. Dame, R.T., Wyman, C., and Goosen, N. (2001) Structural basis for preferential binding of H-NS to curved DNA. Biochimie, 83 (2), 231–234. Draper, D.E. (1995) Protein-RNA recognition. Annu. Rev. Biochem., 64, 593–620. Famulok, M. and Verma, S. (2002) In vivo-applied functional RNAs as tools in proteomics and genomic research. Trends Biotechnol., 20, 462–466. Favre, A., Saintome, C., Fourrey, J.L., Clivio, P., and Laugaa, P. (1998) Thionucleobases as intrinsic photoaffinity probes of nucleic acid structure and nucleic acid-protein interactions. J. Photochem. Photobiol. B, 42 (2), 109–124. Fried, M.G. and Daugherty, M.A. (1998) Electrophoretic analysis of multiple protein–DNA interactions. Electrophoresis, 19 (8–9), 1247–1253. Hermann, T. and Westhof, E. (1999) Non-Watson–Crick base pairs in RNA–protein recognition. Chem. Biol., 6 (12), 335–343.

32 Protein–Nucleic Acid Interactions Hill, J.J. and Royer, C.A. (1997) Fluorescence approaches to study of protein–nucleic acid complexation. Methods Enzymol., 278, 390–416. Huber, P. (1993) Chemical nucleases: their use in studying RNA structure and RNA–protein interactions. FASEB J., 7, 1367–1374. Jensen, O.N., Kulkarni, S., Aldrich, J.V., and Barofsky, D.F. (1996) Characterization of peptide-oligonucleotide heteroconjugates by mass spectrometry. Nucleic Acids Res., 24, 3866–3872. Lane, D., Prentki, P., and Chandler, M. (1992) Use of gel retardation to analyze protein–nucleic acid interactions. Microbiol. Rev., 56 (4), 509–528. Moine, H., Cachia, C., Westhof, E., Ehresmann, B., and Ehresmann, C. (1997) The RNA binding site of S8 ribosomal protein of Escherichia coli: Selex and hydroxyl radical probing studies. RNA, 3 (3), 255–268. Mukherjee, S. and Sousa, R. (2003) Use of site-specifically tethered chemical nucleases to study macromolecular reactions. Biol. Procedures Online, 5, 78–89. Pavski V. and Le, X.C. (2003) Ultrasensitive protein–DNA binding assays. Curr. Opin. Biotechnol., 14 (1), 65–73. Putz, U., Skehel, E., and Kuhl, D. (1996) A tri-hybrid system for the analysis and detection of RNA–protein interactions. Nucleic Acids Res., 24, 4838–4840. Record, M.T., Jr, Zhang, W., and Anderson, C.F. (1998) Analysis of effects of salts and uncharged solutes on protein and nucleic acid equilibria and processes: a practical guide to recognizing and interpreting polyelectrolyte effects, Hofmeister effects, and osmotic effects of salts. Adv. Protein Chem., 51, 281–353. Schulz, A., Mücke, N., Langowski, J., and Rippe, K. (1998) Scanning force microscopy of Escherichia coli RNA polymerase sigma54 holoenzyme complexes with buffer and in air. J. Mol. Biol., 283, 821–836. Senear, D.F. and Brenowitz, M. (1991) Determination of binding constants for cooperative site-specific protein-DNA interactions using the gel mobility-shift assay. J. Biol. Chem., 266 (21), 13661–13671. Soukup, G.A. and Breaker, R.R. (1999) Relationship between internucleotide linkage geometry and the stability of RNA. RNA, 5, 1308–1325. Steen, H. and Jensen, O.N. (2002) Analysis of protein–nucleic acid interactions by photochemical crosslinking and mass spectrometry. Mass Spectrom. Rev., 21 (3), 163–182. Tan, W., Wang, K., and Drake, T.J. (2004) Molecular beacons. Curr. Opin. Chem. Biol., 8 (5), 547–553. Thiede, B., Urlaub, H., Neubauer, H., Grelle, G., and Witmann-Liebold, B. (1998) Precise determination of RNA–protein contact sites in the 50S ribosomal subunit of Escherichia coli. Biochem. J., 334, 39–42. Wower, I., Wower, J., Meinke, M., and Brimacombe, R. (1981) The use of 2-iminothiolane as an RNA– protein crosslinking agent in Escherichia coli ribosomes, and the isolation on 23S RNA of sites crosslinked to proteins L4, L6, L23, L27, and L29. Nucleic Acids Res., 9, 4285–4302.

871

Part V

Functional and Systems Analytics

Sequence Data Analysis Boris Steipe University of Toronto, Department of Biochemistry, 1 King’s College Circle, Toronto, Ontario M5S 1A8, Canada

33

33.1 Sequence Analysis and Bioinformatics Over less than the last half century, the life sciences have undergone their most radical transformation yet. We have set out from the detailed study of individual molecules, to arrive in the genomic era, and we are now setting our sights on a post-genomic era in which the heritable information that makes up the identity of an organism can simply be assumed as. Several paradigm shifts have accompanied this development: in the past, we focused on developing algorithms that interpret sequences, more recently the challenge has been to integrate information across a large number of data sources that are freely available on the Internet. The present discussion in the field is focused on the challenges of “Big Data.” In the past, we have programmed institutional mainframe computers in compiled languages such as Fortran and C, later the focus shifted to Perl, PHP, and other interpreted languages to assemble data from the Internet on our desktop computers, and currently the emphasis is on large libraries of functions as in the statistical workbench R, and storage and computation is shifting into the cloud. This chapter will introduce all three paradigms: sequence information, integration, and genome-scale analysis, providing pointers to practical approaches wherever possible. With industrial-scale sequencers now available for less than the cost of a dump truck, sequencing has gone mainstream and the number of sequenced genomes is still growing exponentially: the US National Center for Biotechnology Information (NCBI) lists more than 4900 sequenced eukaryotic genomes and more than 125 000 sequenced prokaryotes as of December 2017. The management of such data volumes is not trivial. We are fortunate, however, that a massive reduction of data storage costs in the last decade has approximately kept pace with the deluge of data. The trend to delocalize storage into the cloud – globally distributed commodity data centers that store data efficiently and securely for a small fee – is growing and is reducing the cost of data management further. And the data is public, and freely available. Two major institutions are global hubs of databases and online services: the NCBI (http:// www.ncbi.nlm.ncbi.gov) and the European Bioinformatics Institute (EBI, http://ebi.org) in England. These exchange data daily through a data sharing agreement that also includes the DNA Data Bank of Japan (DDBJ), manage a large set of related and extensively crossreferenced databases, and run powerful data-analysis centers to freely support public queries. Beyond that, literally thousands of smaller, online data resources and services are available. Certainly, solving most practical sequence analysis problems today requires nothing more than an Internet connection. Sequence analysis is located roughly between two poles. On one hand we have bioinformatics in its narrow sense: the technologies behind the management of large datasets,

Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

DNA Sequencing, Chapter 30

876

Part V: Functional and Systems Analytics

search, retrieval, consistency, and cross-referencing. On the other hand, we have computational biology: the abstraction of biomolecules and their study with computational methods. This is reflected in the topic of this chapter, biological sequence analysis. Biological macromolecules are heterocopolymers of nucleotide or amino acid units. This makes it simple to define an abstraction that is ideally suited for computational analysis. Each unit is described with a letter of the alphabet and thereby a complex biomolecule is mapped to a simple string, which can be efficiently stored, manipulated, and retrieved. But we must not forget that such abstractions lose information – for example, sequences do not represent posttranslational modifications, or structural conformers, and thus the representation of biology is incomplete. Sequences are models of molecules, and we need to be aware of the models’ limitations as well as the necessity to relate our computational results back to the molecules they represent.

33.2 Sequence: An Abstraction for Biomolecules It is impossible to develop an intuition for biological sequences if one is not familiar with the one-letter code that is used to map amino acids to characters (Table 33.1). One of the most important procedures of sequence analysis is to evaluate sequence changes, and requires relating sequence changes to a change in molecular properties. Without the mapping sequence back to molecules through the one-letter code, we may be doing informatics but it is not “bio”informatics. Nucleotide sequence information basically resides mostly in the identity or non-identity of a nucleotide, and perhaps the complementarity of C/G and A/T pairs. Treating nucleotide sequences merely as abstract strings is a reasonable approximation. Table 33.1 The 20 proteinogenic amino acids and their one-letter code (according to IUPAC-IUB). Compare this with Appendix 1 regarding the amino acids’ biophysical properties and Appendix 3 for the chemical structure. Code

Amino acid

Mnemonic

A

Alanine

A-lanine

C

Cysteine

C-ysteine

D

Aspartic acid

aspar-D-ic acid

E

Glutamic acid

glut-E-mic acid

F

Phenylalanine

F-enylalanine

G

Glycine

G-lycine

H

Histidine

H-istidine

I

Isoleucine

I-soleucine

K

Lysine

Y-turned-sideways resembles K

L

Leucine

L-eucine

M

Methionine

M-ethionine

N

Asparagine

asparagi-N-e

P

Proline

P-roline

Q

Glutamine

“Q”-tamine

R

Arginine

“R”-ginine

S

Serine

S-erine

T

Threonine

T-hreonine

V

Valine

V-aline

W

Tryptophan

with a lisp: t-W-yptophan

Y

Tyrosine

t-Y-rosine

33 Sequence Data Analysis

33.3 Internet Databases and Services Several factors have led to a virtual explosion of biological databases and services on the Web. These include:

 a large volume of sequence and related data;  inexpensive computer storage, processing, and Internet connection;  the availability of free, powerful, well-documented database back-ends for all common computer platforms;

 the availability of free, powerful, well-documented Web servers, such as Apache (http:// apache.org/);

 the ease of use of scripting and programming languages such as R (http://r-project.org),



Perl (http://perl.org), Python (http://python.org/), PHP (http://php.net), and JavaScript (automatically included in all modern Web-browsers) that have excellent support to tie databases to dynamically created Web pages; the availability of important algorithms as free and open software, supported through a large user-base that maintains and develops the large libraries of functions, most notably Biopython (http://biopython.org, although with less comprehensive function libraries for bioinformatics than for many other fields), and R (http://r-project.org) with the associated Bioconductor project (http://bioconductor.org), which has currently the most active community of open-source bioinformatics developers and by far the most comprehensive code base.

These factors have led to unprecedented opportunities for small- and medium-sized research groups to create and publish their own data resources, to make them available on the Internet, and to connect them to other databases. While this is a marvelous development in principle, it does raise the question of maintaining an overview over which resources exist, how well they are updated and maintained, and what is currently considered to be the state-of-the-art. Beyond the NCBI and the EBI and the larger, community maintained model-organism databases, evaluating a given online resource is a non-trivial task for the non-expert. Which alternatives exist? How are the resources generated exactly? How many false-positive or false-negative results are to be expected? How often is the resource updated? Is the longterm, stable existence of the resource ensured? What is the level of user support and documentation? It is obvious that reproducible research requires confidence in these issues, but it is certainly true that not all online resources can satisfy these requirements and that nearly all have room for improvement. There are several resources that help maintain a rough overview: Nucleic Acids Research (NAR, https://academic.oup.com/nar) publishes an annual issue of peer-reviewed database resources and of Web services. These are collected, organized by keyword, and searchable online in the curated Bioinformatics Links Directory of Bioinformatics.ca (https:// bioinformatics.ca/links_directory/). Besides NAR, bioinformatics articles are frequently published in Bioinformatics (https://academic.oup.com/bioinformatics), BioMed Central Bioinformatics (https://bmcbioinformatics.biomedcentral.com), and more than 170 other journals that can be found as the result of a keyword search for “bioinformatics” at the US National Library of Medicine (https://www.ncbi.nlm.nih.gov/nlmcatalog). More information about current developments can be found in conference abstracts, for example, the annual Intelligent Systems in Molecular Biology Conference (ISMB, https://www.iscb.org/) of the International Society for Computational Biology. Finally, dedicated online forums exist that will help find answers to practical questions about current best-practice. The most active one specifically for bioinformatics is Biostars (https://biostars.org), but many related questions have been discussed (and answered) on one of the invaluable StackExchange forums (https:// stackexchange.com/); Quora is a new addition, often with very high-quality answers (https:// www.quora.com/topic/Bioinformatics). To solve a specific problem with the best currently available tools, most likely the best advice is to follow your peers. Find a relevant recent publication – and in this field recent means no older than two or three years – in a well-reviewed journal, and study the methods section carefully. Obviously, the paper’s authors are usually pleased to pass on advice.

877

878

Part V: Functional and Systems Analytics

33.3.1 Sequence Retrieval from Public Databases The first step of sequence analysis is obviously to obtain the sequence. Sequences may come from in-house projects, or downloaded from public databases. Sequences can be retrieved via a feature search (e.g., searching for human hemoglobin), or more specifically via accession numbers. Searches in general are very well integrated across the various NCBI databases, using the Gquery interface to the Entrez database system (Figure 33.1). This is an interconnected system of cross-referenced databases that can be accessed through a unified interface. For example, a search for the Mbp1 transcription factor in baker’s yeast yields links into literature, nucleotide, protein, and structure databases, cross references to the yeast genome, sequences in related organisms, expression profiles, and more (Table 33.2). All database entries – sequences or otherwise – are identified by one or more unique accession numbers. In principle, such accession numbers can be used to cross-reference data between databases. However, one needs to consider whether (biological) objects that are identified by the same accession number are in fact identical; this depends on the exact database semantics and data management policies. In practice, databases usually maintain their own system of accession numbers, and mapping – translating between them – is often necessary. Web services for accession number mapping can be found at the UniProt ID mapping service

Figure 33.1 Result of a keyword search for the Mbp1 protein (by text-word) in Saccharomyces cerevisiae (by organism) at the NCBI. This is a quick and comprehensive way to obtain cross-referenced results. In this example, we see links to journal articles, nucleotide and protein sequences, gene expression profiles, protein structures, and more.

33 Sequence Data Analysis Table 33.2 Selected accession numbers and their semantics for the yeast Mbp1 transcription factors. Accession number

Semantic

Name

MBP1

Standard name

Gene

YDL056W

Systematic gene name

851503

NCBI Entrez Gene ID

NM_001180115

RefSeq nucleotide (m-RNA) ID

X74158.1

MBP1_YEAST

Nucleotide accession number of the gene (NCBI GenBank and European Nucleotide Archive), version 1 GeneInfo – NCBI internal accession number for the gene Swiss-Prot name

NP_010227

RefSeq protein ID

P39678

UniProt (protein) identifier

GI:6320147 S000002214

GeneInfo – NCBI internal accession number for the protein Saccharomyces Genome Database (SGD) ID

sce:YDL056W

KEGG pathway database ID

1BM8

PDB (protein structure database) ID for the DNA binding domain iREF protein interaction database ID

GI:296143308 Protein

Annotation cross-references

16823090

(http://www.uniprot.org/mapping/). Uploading single or multiple accession numbers translates them into their corresponding identifiers at the EBI, the NCBI, organism-specific databases, and many others. Probably the most exhaustive translations are provided by bioDBnet (https:// biodbnet-abcc.ncifcrf.gov/db/db2db.php), which covers more than 200 databases with more than 700 mappings between them. But sequences do not only need to be mapped between individual providers’ resources, the major providers may have the same information in several different databases. One example is the NCBI’s refseq project, a curated database designed to solve the problem that primary databases may contain large numbers of redundant, identical sequences from distinct submissions (https://www.ncbi.nlm.nih.gov/refseq), by storing only non-redundant, wellannotated sequences. A similar purpose is fulfilled by the Swiss-Prot subset of the UniProtKB databases – a collection of curated, reviewed, and manually annotated sequences, probably the gold-standard for biological sequence quality (https://www.uniprot.org). Moreover, UniProt offers the option to retrieve only sequences that exceed a selectable threshold of sequence differences between each other, to be able to remove near-identical sequences from a result.

33.3.2 Data Contents and File Format Modern data technologies often work with well-defined data-grammars, such as XML or JSON. However, bioinformatics is largely still the domain of so-called flat-file formats. Despite difficulties in writing consistent file-parsers, or even accessing precise specifications in the first place, this field sees the advantages of nimble, human-readable formats that are conceptually easy to handle and which can be read and produced by a very large number of legacy applications (Figure 33.2). Such flat-files usually contain records of information, prepended with an identifier. Database entries that are displayed on the Web usually contain linked cross-references and graphical elements – the raw information can be downloaded as text. Since most analyses require only the sequence itself, however, the FASTA format is generally adopted as the de facto standard for data interchange (Figure 33.3). It consists of

879

880

Part V: Functional and Systems Analytics

Figure 33.2 “Anatomy” of a GenBank flat-file formatted database entry. Protein sequence of the Mbp1 protein. Elements of the entry are annotated.

only two elements: a “header” line, prefixed with a single “>” character, and any number of lines containing the actual sequence data in one-letter code, until the end-of-file. Often – but not mandatory – lines are 80 characters in length, a vestige of the size-limits of punch-cards for data transfer. The format is compact, efficient, readable, and easy to open and edit in any wordprocessor. Virtually all analysis programs accept input in FASTA format.

Figure 33.3 GenBank sequence of the Mbp1 protein of baker’s yeast in FASTA format. A subset of the information of the GenBank record’s header is given here: RefSeq ID and organism description. The sequence data is represented in one-letter code, with line-breaks after 70 characters. There is no explicit end-ofsequence character.

33 Sequence Data Analysis

33.3.3 Nucleotide Sequence Management in the Laboratory Dedicated commercial and open-source solutions exist for managing in-house sequences in the molecular biology laboratory, a web search will identify them readily. Many laboratories simply store sequences in text files on their personal computers. Of course, a robust backupscheme is critical in all cases. To edit, translate, and annotate sequence, ApE (A plasmid Editor, http://biologylabs. utah.edu/jorgensen/wayned/ape/) is a popular, free tool for the most common computer platforms.

33.4 Sequence Analysis on the Web Just as with the databases, finding the right source for analysis services is less a question of what is possible, but how to choose among the many alternatives. The annotated NAR issues and the searchable directory of Bioinformatics.ca have been mentioned above; an integrated package of basic analysis function is provided by the EMBOSS package.

33.4.1 EMBOSS The European Molecular Biology laboratory Open Software Suite (EMBOSS) is a collection of programs for basic sequence analysis tasks (http://emboss.sourceforge.net/). They are generally freely available as services via dedicated Web servers, a simple Google search for “emboss explorer” will find access points (Table 33.3). The EMBOSS package is open-source software and can easily be installed on a laboratory’s local computers. Versions also exist within Biopython; thus the construction of automated workflows is possible.

Table 33.3 Selection of applications of the EMBOSS analysis package. Application name

Function

trimmest

Remove poly-A ends from EST sequences

einverted

Find DNA inverted repeats

eprimer3

Select PCR primers and hybridising oligonucleotides

geecee

Compute G/C contents of nucleotide sequences

revseq

Reverse complement sequences

remap

Plot a nucleotide sequence with restriction endonuclease sites and translation

banana

Sequence dependent twist and curvature of B-DNA

transeq

Translate nucleotide sequences

pepstats

Peptide statistics: molecular mass, extinction coefficient, and so on

iep

Compute isoelectric point of a peptide

pepcoil

Predict coiled-coil regions

sigcleave

Predict secretion signal peptidase sites

fuzzpro

Discover patterns in protein sequences

pepwheel

Plot peptides as helical wheels to emphasize amphipathic moment

dotmatcher

Dot-plot sequence comparison

needle

Needleman–Wunsch global-optimal pairwise sequence alignment

water

Smith–Waterman local pairwise sequence alignment

881

882

Part V: Functional and Systems Analytics

Figure 33.4 Partial output of the PEPSTATS routine of the EMBOSS package, for the amino acid sequence of yeast Mbp1. The molecular weight and coefficient of extinction are properties that are frequently needed in a laboratory.

33.5 Sequence Composition Several sequence features can be determined directly from the additive properties of its molecular building blocks. This includes composition – the molar ratios of the constituent amino acids – molecular mass, coefficient of extinction, antigenicity, and isoelectric point (which, however, may be shifted due to the variation of pK values in a folded protein) (Figure 33.4).

33.6 Sequence Patterns Transcription Maps of the Human Genome, Section 36.3.5

As genome sequencing has become readily accessible, the problem of annotation of raw nucleotide data has become increasingly important. Even the distinction between transcribed and untranscribed sequence, between information contained in the translated polypeptide, and regulatory genomic information is challenging. Untranslated sequences may contain functional sites of promoters, terminators, enhancers, isolators, and much more. Such functional sites are mostly associated with the regulation of gene expression – the genome after all encodes both structural and sequential information – but also with the regulation of replication. For the human genome, the ENCODE project (http://genome.ucsc.edu/ENCODE/) is a large-scale effort to annotate the entire genome and provide a list of all of its functional elements. At least for human genes, functional motifs have been experimentally determined and the annotations are available on the Web (Figure 33.5). Genome-level regulatory information is determined both by sequence and by context. The sequence pattern encodes the functional potential, where and when it is expressed depends on the level of nucleosome compaction, epigenetic modifications, occupancy of activation and silencing sites, and much more. While ENCODE is a fundamentally experimental approach, bioinformatics methods complement the experiments. In principle we can distinguish between analysis by signal, that is, pattern analysis, and analysis by contents, that is, sequence comparison. Pattern analysis determines the presence or absence of a defined sequence pattern, which may be continuous and exact, discontinues with fixed or varying gap-lengths, and/or may contain ambiguities. We can distinguish between deterministic analysis and probabilistic analysis. Deterministic analysis gives a yes/no answer: a pattern is either present or it is not. The classical paradigm is the analysis of restriction endonuclease recognition sites in nucleotide sequence. Algorithms are fast and well understood. Programming languages support such searches through regular expression functions. For protein sequences, functional patterns have been compiled in the PROSITE database of domains, families, and functional sites (http://prosite.expasy.org/), which can be used to scan for functional motifs given a Uniprot accession number or an uploaded sequence. Motifs should be sensitive and specific, that is, define a short pattern that occurs in all sequences that share a particular function and is absent in sequences that do not. An example for a sequence motif is the pattern: L-x…6†-L-x…6†-L-x…6†-L

which is Prosite Motif PS00029, a leucine zipper. The notation describes four leucine residues “L”, separated by six unspecified residues. Such a “zipper” can be found in a homologue to

33 Sequence Data Analysis

883

Figure 33.5 ENCODE project annotation patterns overlaid on a human genome map (cf. http://genome.ucsc.edu). The grey bars show experimentally determined ChIP-seq and DNAse sensitivity clusters in the vicinity of the human SOD1 gene. This image indicates the very high level of annotation that is currently available via simple Web queries.

Mbp1, the yeast Swi4 protein with the sequence: 848

LespsslLpiqmspLgkyskpLsqqinkLntkvssL

883

This zipper has not been annotated in the feature table of the Swi4 GenBank entry; this is an example of how such feature tables are helpful for orientation, but are necessarily incomplete. Each Prosite motif is accompanied by an expert-curated description of its biological implications. More complex sequence patterns are commonly analyzed through probabilistic pattern analysis that aims to quantify the probability that a pattern is present rather than giving only a yes/no answer. This is frequently computed via a position-specific scoring matrix (PSSM), or profile, in which each position of the sequence is associated with a vector of possible states it can take, and the associated probabilities. Such a matrix is a computational tool that quantifies how well a particular sequence matches a given sequence motif. The algorithm simply adds the profile-values for the observed character in each position. If these are properly scaled, the result can be interpreted as a probability that the observed sequence was part of the set that the weighting matrix was based on. Such sequence propensities can be displayed in a so-called sequence-logo which represents the information content of every position and its observed states. This corresponds to our biological intuition that conserved positions should be considered to be more important than variable regions (Figure 33.6). Figure 33.6 Sequence logo, computed from 21 Escherichia coli promoter sequences. The height of the characterstack at every position corresponds to the information-theoretical information, in “bits.” This corresponds to the weight of a position-specific scoring matrix. The size of the individual letters correspond to their frequency in a position. These logos were pioneered by Tom Schneider, and can be easily generated online from multiple alignments (http://weblogo. berkeley.edu/).

884

Part V: Functional and Systems Analytics

Figure 33.7 SMART server domain and pattern annotation of the yeast Mbp1 protein. Features are mapped to the sequence and detailed annotations can be studied further. In this case we see a KilA-N family DNA binding domain and a series of ankyrin protein–protein interaction modules annotated by comparison to the Pfam database of protein domain families, and an ankyrin domain annotated by sequence similarity to a known domain (BLAST). Other features such as a coiledcoil dimerization domain (green) and unstructured, low-complexity regions (pink) are determined by first principles from the target sequence itself. Other features (not shown) include acetylation and phosphorylation sites, determined from sequence patterns. These local patterns alone are generally necessary but are not sufficient to make a prediction. Whether they are actually functional depends on their local context in the native, folded protein.

Probabilistic pattern descriptions for annotation and classification are the domain of machine learning in computer science. PSSMs make the implicit assumption that the contributions from each position are independent, but many more powerful tools have been developed over the last two decades that can capture higher-order preferences that may be subtly distributed over the entire pattern. Methods include Hidden Markov Models, Neural Networks, Support Vector Machines, Random Forests, and more. They can be implemented, for example, from program packages for the R programming language (https:// cran.r-project.org/web/views/MachineLearning.html) and excellent online introductory tutorials are available (http://www-bcf.usc.edu/∼gareth/ISL/). The common theme is that the high-dimensional features of objects (such as sequences) with known properties are represented in some consistent way; this is called a training set. Then objects with unknown properties – the test set – are queried to determine which subsets of the training set they are similar to. If the similarity is high, the algorithm can conclude that the new object should share annotations with the annotated subset. This strategy is common to all “supervised” machine-learning approaches, they merely differ in the details of how the information is represented but often lead to results of comparable quality. This leads to very general pattern matching procedures that can be used for sequence annotation, for example, at the SMART database at the European Molecular Biology Laboratory in Heidelberg (http://smart.embl-heidelberg.de/) (Figure 33.7).

33.6.1 Transcription Factor Binding Sites One example of the role of pattern recognition for sequence analysis is the annotation of regulatory sequences. The sequence patterns for experimentally validated transcription factor binding sites are frequently redundant, and present in the genome much more frequently than we would expect for functional reasons. Which of these sites are actually functional? Databases of validated binding sites such as JASPAR (http://jaspar.genereg.net/) can help answer this question. JASPAR allows us to select sets of curated transcription factor binding profiles and predict binding sites in uploaded sequences (Figure 33.8). If only genomic sequences of co-regulated genes are known, but the binding sites and factors have not been determined, motif discovery algorithms can be applied. The principles for several successful algorithms are similar: given a set of nucleotide sequences that are known to contain binding sites somewhere, mask trivially shared sequences and then test all possible sequence factors of a given size to find those that are significantly overrepresented. Excellent results have been achieved with the programs Weeder (http://159.149.160.51/modtools/), Meme/ MAST (http://meme.nbcr.net/meme/), the Gibbs Motif Sampler (http://ccmbweb.ccv.brown. edu/gibbs/gibbs.html), and most recently XXmotif (http://xxmotif.genzentrum.lmu.de/). A server that integrates several approaches by running them independently and clustering the results is WebMOTIF (http://fraenkel.mit.edu/webmotifs-md-programs.html). Beyond merely identifying statistically overrepresented patterns, their functional relevance can be evaluated in the light of biological knowledge. Criteria that support functional relevance include:

 enrichment of motifs in the upstream untranscribed regions of co-regulated genes, relative to the regulatory regions of randomly chosen genes;

 conservation in syntenic genomic regions of related organisms;  higher propensity to appear together with other validated binding sites of regulatory motifs.

33 Sequence Data Analysis

885

Figure 33.8 Information window for the binding profile of yeast Mbp1, curated in the JASPAR database. The binding site propensity matrix is shown as a scoring matrix. Links lead to additional information resources. It is easy to appreciate that the site is at its core only a tetranucleotide, and the biological function of the transcription factor is established through contextual binding in concert with other factors.

Fundamental to this type of analysis is the hypothesis that co-occurring motifs indicate shared function: we call this “guilt by association.” Obviously, there may also be other reasons for the association of recurring patterns, such as segmental duplications. Importantly, highly repetitive sequences can appear as overrepresented patterns, which is technically correct but biologically a meaningless result. Such sequences can be identified with the program RepeatMasker (http://www.repeatmasker.org/) and excluded from analysis. One application domain is the annotation of motifs in “peaks” – frequently identified genomic regions – of ChIP-seq or ChIP-chip experiments. For example, the Bioconductor ChIPpeakAnno package provides functions to retrieve collated genome features and additional annotations such as identifying multiple-transcription factor loci (MTLs) and functional annotations from the Gene Ontology database (GO, http://geneontology.org/).

33.6.2 Identification of Coding Regions Genomes consist in their majority of untranslated sequences, thus the identification of coding regions is one of the key challenges of newly sequenced genomic DNA. This too is a task for pattern recognition, albeit the issue here is less one of recognizing particular signals such as splice donor and acceptor signals. These are important, but their predictive value is determined by an array of more delocalized properties: G/C contents, relative di-, tri-, and hexanucleotide frequencies, absence of stop-codons, degree of conservation in related genomes, and other measures. RNA-seq as an experimental method to discover transcribed regions has become

Analysis of Epigenetic Modifications, Section 31.7

886

Part V: Functional and Systems Analytics

accessible and increasingly important to complement the purely computational approaches. Once again, machine learning methods play a big role to integrate large numbers of features to evaluate evidence. Early successes were primarily based on Hidden Markov Models. GENSCAN is a popular example for eukaryotes (http://genes.mit.edu/GENSCAN.html) while Glimmer is an application that has annotated hundreds of prokaryotic and viral genomes (http://ccb.jhu.edu/software/ glimmer/). Newer approaches, however, make extensive use of EST and RNAseq data and this gives improved performance for situations that are difficult for purely computational approaches: short exons, alternative splicing, overlapping genes, non-canonical splice sites, and RNA editing. A program that facilitates the integration of various types of evidence is JIGSAW (https://www.cbcb.umd.edu/software/jigsaw/). Recent interest has also focused on microRNA prediction and MiPRED (http://www.bioinf. seu.edu.cn/miRNA/) provides a tool with satisfactory performance. This type of analysis scales well to whole-genome annotations, but actually using the additional information that is available through the large number of previously sequenced genomes requires a bit more thought. For prokaryotic genome annotation, the RAST annotation engine (Rapid Annotation using Subsystem Technology, http://rast.nmpdr.org/) accesses the SEED collection of functionally related protein families to annotate genes by comparison to the collection. Once the genome is made public, the annotated proteins are added to the collection, which thus provides a constantly improved, growing body of functional annotation information.

33.6.3 Protein Localization Secretory proteins have a characteristic N-terminal signal sequence that is cleaved after translocation through the inner membrane. The information about the precise cleavage site is distributed along the sequence, which has several characteristic features (Figure 33.9). Predicting these cleavage sites has been one of the earliest successful applications of neural networks to molecular biology and the current version of the SignalP algorithm performs well enough to make it suitable for automated, whole genome annotation (http:// www.cbs.dtu.dk/services/SignalP/), addressing the non-trivial challenge of distinguishing secretion signals from transmembrane helices. It achieves a Matthews correlation coefficient (a number characterizing the relationship between sensitivity and specificity of a procedure) of better than 0.8, and a best-in-class performance of all currently available alternatives. Similar algorithms are suitable for the prediction of protein localization, transmembrane helices, domains, and so on – one server that integrates a number of such predictions is PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/), alternatives include PSORT for localization prediction (http://www.psort.org/) and TMHMM (http://www.cbs.dtu.dk/services/TMHMM) and TOPCONS (http://topcons.cbr.su.se/) for transmembrane helices.

Figure 33.9 Sequence logo for the N-terminus of Gram-negative bacterial secreted proteins, including the signal peptide. There is a clear preference for a positively charged N-terminus, a hydrophobic central stretch with amino acids that have high helical propensity, and a preference for small amino acids in positions 3 and 1 of the cleavage site. Downstream of the cleavage site, hydrophilic residues predominate.

33 Sequence Data Analysis

887

33.7 Homology Pattern-based methods can contribute important basic information to sequence analysis, for practical purposes, however, homology based methods are far more important. The basic procedure is to construct an optimal alignment between two sequences and to evaluate it. If the alignment suggests that the sequences diverged from a common ancestor, one can deduce domain arrangements, phylogenetic history, important functional residues, and much more. The wealth of information is based on the empirical observation that related sequences usually have related function and virtually always have similar structure, and that important functions are associated with the evolutionary conservation of residue motifs.

33.7.1 Identity, Similarity, Homology Identity, similarity, and homology are terms with a precise meaning but they are often used in an imprecise way. Identity is the percentage of residues that are identical in an alignment. The quantification of similarity is difficult (Figure 33.10). It requires a measure of similarity between pairs of amino acids – for example, how similar one would consider a valine to be to a threonine residue or to a leucine. One can base such a measure on biophysical similarity, on the number of required nucleotide changes to get from one to the other, or on other considerations. There is no accepted general way to define similarity, however, and therefore the term should not be used in a quantitative sense. Homology on the other hand means that two proteins have diverged from a shared ancestor. It has an exact meaning and describes a quality, not a quantity. Two proteins can be homologous, or not. To say that two genes share “50% homology” abuses the terminology. Homology does not necessarily have to imply identity or similarity, since homologous sequences can be diverged to a degree that they have lost all recognizable similarity. However, if the degree of similarity is high, shared ancestry is a much more likely explanation than random chance. In the following, we concentrate on amino acid sequences. Since nucleotide variability is generally much higher than amino acid variability, it is usually not very meaningful to compare the untranslated nucleotide sequences of coding regions. Obviously, this does not apply to sequences in which the nucleotides themselves are functionally conserved, such as in promotor-, t-RNA sequences. For these cases the methods of amino acid sequence comparison can be readily generalized – with the exception perhaps that helical stems can have conserved H-bonding patterns despite sequence change and thus may need to be separately considered. We further assume that sequence similarity is an additive property of independent sequence positions. This is a crude approximation that could be improved upon if additional information about amino acid interactions were available. In practice this is, however, usually not the case and assuming independence is the best we can do. Nevertheless, there is potential for improvement of alignment algorithms that could include secondary structure propensities, conservation patterns, solvent accessibility, or the precise position of gaps in the alignment – albeit at the cost of much added algorithmic complexity. Similarity scores of individual amino acid pairs are commonly represented in a “scoring matrix.” All results that use such a matrix for alignment obviously depend on what these values are. The matrix is a tool to quantify how well an alignment reflects a particular model of similarity. If we construct the matrix to represent evolutionary relationships, an optimal alignment – that is, an alignment that gives the highest possible aggregate score for this matrix – will be suitable to evaluate homology. We consider two proteins as likely homologues if the alignment score is higher than what we would expect from two randomly chosen sequences. The matrix that is most commonly chosen for this task is BLOSUM 62, developed by Jorja and Steve Henikoff; it captures exchange probabilities from blocks of aligned, un-gapped sequences at minimally 62% sequence dissimilarity (Figure 33.11). This approach reflects the fact that amino acids from gapped alignments are in structurally incomparable regions, and their alignment is meaningless in the first place.

Figure 33.10 An overview of the biophysical properties of amino acids as a Venn diagram. Amino acids are given in the one-letter code, cysteine appears twice since the properties of the cysteine side chain’s free thiol (CSH) and the disulfide bonded cystine molecule (CS-S) are very different. Individual properties can be accurately quantified, for example, “hydrophobicity” can be measured as the free energy of transfer from a water-phase to octanol, “size” can correspond to the volume of the solventaccessible surface of a residue. However, their relative weight is arbitrary and such measures cannot be combined into a single number in a rigorous way. Which measure is the most appropriate depends on the question. To quantify similarity for the purpose of sequence alignment, measures based on the ability of one amino acid to substitute for another in natural sequences have proven most successful. (Diagram after and adapted from Taylor, W. R. (1986) The classification of amino acid conservation. J. Theor. Biol. 119 (2), 205–218.)

888

Part V: Functional and Systems Analytics

Figure 33.11 The BLOSUM62 mutation data matrix. Amino acids are given in the one-letter code. Positive values indicate high probability of exchange in related sequences and increase the alignment score, negative values decrease the score and indicate that such pairs would be more likely in an alignment of unrelated sequences. An F → Y pair is highlighted, its value is +3. Values on the diagonal correspond to the probability of residues to be conserved. For example, a conserved tryptophan (+11) or cysteine (+9) is a much more significant indicator of homology than the more generic alanine, leucine, or serine (+4).

33.7.2 Optimal Sequence Alignment How can we use the mutation data matrix to determine the correct alignment in which all matched amino acid pairs have descended from the same ancestral residue? The simple answer is: We can’t. In general we cannot prove that a particular alignment is “correct,” unless we could observe the sequence through its entire evolutionary trajectory. But what we can do is to compute an optimal alignment. To the degree that the mutation data matrix represents evolution, the optimal alignment should be our best guess as to what is correct. And to compute an optimal alignment should be simple – just create all possible alignments, score them, and pick the best. Unfortunately, it turns out that this cannot be done. Natural sequences do not always conserve their length, we frequently find insertions and deletions. An insertion for one sequence is a deletion for the other, thus we usually use the portmanteau “indel” to describe this. Since indels exist, if we want to generate all possible alignments, we need to consider three options at every position: a match, an insertion, and a deletion. Thus the number of possible alignments for two sequences of 100 amino acids in length is larger than the number of particles in the visible universe. We generally consider computational problems of that size to be intractable. Fortunately an efficient algorithm to solve this problem was discovered by Saul Needleman and Christian Wunsch in the 1970s. The principle is quite straightforward and relies on the fact that we are treating the aligned amino acid pairs as independent. In this case, assume we want to align two sequences of 100 amino acids optimally. We can simplify this problem by computing an optimal alignment for 99 amino acids, and extending it in the best possible way. But how do we compute that alignment? We can simplify this problem by computing an optimal alignment for 98 amino acids, and extending it in the best possible way. And so on . . . until we need to compute the optimal alignment for just one amino acid pair. This score we can simply retrieve from the scoring matrix. In this way, building the alignment from the bottom up, we can construct the alignment in a recursive procedure. The problem is that the recursive procedure is still not efficient enough to be tractable. But considering how it is computed, Needleman and Wunsch noticed that the same partial alignments needed to be computed over and over again. That is of course not efficient – it

33 Sequence Data Analysis

889

Figure 33.12 Partial output of the program needle in the EMBOSS package. The sequences of yeast Mbp1 and its homologue Swi4 have been aligned, the region covering the DNA binding domain is shown. Identical amino acid pairs are marked with a “|” character, more or less similar characters are marked with “:”, respectively “.”.

is much better to compute partial alignments only once, store them, and reuse them when necessary. This strategy is called memoization or dynamic programming and applying it makes the algorithm quite fast – in fact its computational requirements scale only with the square of the alignment length: 10 000 steps versus 10100 for the brute force approach is a rather large difference. Several details need to be clarified like how to quantify indels, and which global optimum is reported in case there are several equally good ones, but this does not change the essence of the approach. There is no quantitative theory that describes the probability and length of indels through evolutionary change. Empirically we find that indels are less frequent the longer they are, but there is evidence that there may be more than one distinct mechanism that causes them, since log-length versus log-frequency plots of indels from database-scale comparisons show distinct regimes. Still, it is computationally efficient to simply model indels with a constant insertion penalty and an extension penalty that depends on the indel length. The actual values obviously have to harmonize with the pair-scores in the mutation data matrix; typical valves assure that an indel of two residues or so will need to be justified by two extra identities in the alignment to improve the overall score. Default values vary a bit between implementations, for example, the NCBI uses (insertion, extension) penalties of (11, 1), for BLAST alignments (see below), while the EMBOSS package uses (10, 0.5) for optimal sequence alignments (Figure 33.12), both for the BLOSUM62 matrix. Users can adjust the parameters to favor or disfavor indels and perhaps gain some intuition concerning which parts of an alignment are robust and which parts depend critically on the detailed alignment parameters. The Needleman/Wunsch alignment computes global optimal alignments that cover the full length of both sequences. Frequently we are, however, interested in local optimal alignments that cover only those parts of a sequence that can be well aligned in the first place. This is especially important if our sequences have very different lengths, or if we want to focus only on individual domains – which is always to be recommended, since multidomain proteins are not necessarily homologous over the entire length. Temple Smith and Michael Waterman proposed a variation of the Needleman/Wunsch algorithm that achieves this. Two modifications are required: first of all, the scoring matrix must be scaled such that it has a negative overall expectation value for pair scores. Otherwise a random extension of an alignment would still improve the score. Secondly, once a partial alignment has a negative score, that score is set to zero. Then the highest scoring partial alignment is sought out; this marks the beginning of the

890

Part V: Functional and Systems Analytics

best local alignment. Its sub-alignments are considered, until their score drops to zero, which marks the end of the best local alignment.

33.7.3 Alignment for Fast Database Searches: BLAST While the optimal sequence alignment is the best we can create, it requires significant resources to compute on a large scale. And by large scale we mean comparing millions of sequences. In fact the protein section of the NCBI’s non-redundant Refseq database consists of on the order of 100 million sequences. Steven Altschul and colleagues published an algorithm in 1990 that is much better suited for the search of homologous sequences in large databases than optimal alignment. The alignments are not guaranteed to be optimal, but the differences are usually negligible. But the NCBI’s BLAST (Basic Local Alignment Search Tool, blast.ncbi.nlm.nih. gov/) services are powerful enough to freely process the world’s sequence searches each day; BLAST is the most frequently used bioinformatics tool by far. The algorithm works by breaking a query sequence into small fragments and searching for the occurrence of clusters of fragments in a large index table. Once matches are found, the algorithm attempts to extend them locally without indels. Ungapped seed alignments are evaluated for statistical significance and joined to full-length alignments. The algorithm is fast, because the first step – the look-up of initial matches in a table – is fast and thus the efforts of the program are concentrated in those regions of database sequences where matches are promising, while ignoring the very large sequence space in which no significant alignment would be possible in the first place. A key feature of this process is the computation of statistically meaningful measures of significance of a match – BLAST calls these E-values. An E-value computes the probability that an alignment of the same quality for two unrelated sequences would appear in a database of the same size. This means the E-value is not a measure of whether two sequences are homologous, but a measure of whether two unrelated sequences could have the same score as the value being considered. This depends on the database size – and paradoxically this means that the more sequences are available, the harder it becomes for a given alignment to achieve significance. As a corollary, one needs to choose the database carefully in which a sequence search is performed. It should be the smallest database (RefSeq, usually) and the most stringent subset of organisms that we are interested in. Organism subsets can simply be selected on the BLAST input form.

33.7.4 Profile-Based Sensitive Database Search: PSI-BLAST What are the options if a BLAST search yields too few homologous sequences? In many cases a profile-based search with the PSI-BLAST algorithm can yield additional results. Profiles average the sequence information of individual hits and thus increase the sensitivity of searches by enhancing the signal-to-noise ratio of alignments. In a sense, profiles are position specific scoring matrices over the length of a sequence, and they allow the search to focus on the structurally and functionally important conserved regions, while downweighting spurious similarities in highly variable segments of the sequence. PSI-BLAST can simply be selected as a search option on the NCBI’s BLAST input form, and initially works just like a BLAST search. After a first search, hits are assembled into a profile and the search process is iterated. Typically, new sequences whose alignments were initially not significant are discovered by searching with the profile, and they can themselves be added to the profile in several rounds, may be until convergence, that is, until no new homologues are found. In this way homologous sequences can be identified, even though pairwise sequence identity is far below significance thresholds. This is, however, not an entirely automatable process: user discretion is required when adding sequences, since even the addition of a single non-homologous sequence by mistake will lead to profile corruption – the profile will pull in more and more of the false-positive match’s homologues and lead to completely misleading results. To guard against this, one should evaluate sequence annotations and be wary of sequences that have very different annotated functions or cellular locations. Sequences that cover only small portions of the query may also be suspicious. In case of doubt, it is better to manually exclude a questionable

33 Sequence Data Analysis

sequence and observe how its E-value changes from iteration to iteration. If the E-value gets worse or stays approximately constant, this means that the increase of information in the profile does not agree with the sequence information in the questionable sequence and it is unlikely to be a member of the family. Homologous sequences’ E-values are expected to increase significantly throughout the iterations.

33.7.5 Homology Threshold Finally, what is the threshold of similarity at which we may conclude that two sequences are indeed homologous? Homologues may have no recognizable sequence similarity at all, and are sometimes discovered solely from structural similarity, similar function, and similar organization of a protein’s active site. Low sequence similarity does not necessarily exclude homology. On the other hand, as a good rule of thumb, there are virtually no sequences with more than 25% identity over the length of a structural domain that are not homologous. This value is therefore often used as a cut-off for pairwise comparison.

33.8 Multiple Alignment and Consensus Sequences In the post-genomic era, the likelihood is very high that database searches will discover more than one homologue to a given target sequence. The additional information that can be derived from such sequences is significant since it yields information about conservation of individual positions that correlates with their structural- and/or functional importance. A multiple sequence alignment can additionally help in defining domain boundaries, evaluating whether sequence annotations are conserved, and may serve as the input for analyzing phylogenetic relationships. However, while optimal pairwise alignments are exactly computable, this approach cannot be extended to multiple alignments for two reasons. First, there are practical limitations to the algorithm – the resource requirements grow exponentially with the number of aligned sequences. More importantly however, the objective function, that is, the score that our algorithm should optimize is hard to define in a biologically meaningful way. Should we maximize column-wise sequence similarity? Should we minimize the number and size of indels, since these are rare evolutionary events? Should we aim to cluster indel sites outside of secondary-structure elements? Or should we ensure that sequence patterns and motifs are conserved as highly as possible? Each of these objectives suggests a different computational strategy; however, they cannot necessarily be simultaneously achieved. Even today, multiple sequence alignment is not a solved problem and promising new algorithms appear almost every year. Different alignment programs will indeed give different alignments and it is not trivial to choose the best. We can certainly say that the first generation algorithms like CLUSTAL are no longer state-of-the art and definitely should no longer be used. But many modern algorithms show only marginal differences when benchmarked against a curated database of structurally aligned reference domains like BaliBASE (http://www.lbgi.fr/balibase/). Tools that can be highly recommended include T-Coffee, MUSCLE, and MAFFT; these regularly outperform other algorithms in direct comparisons and they are all accessible through a launch page at the EBI (http://www.ebi.ac.uk/Tools/msa/). However, the “winner” by a small margin is often the program ProbCons (http://probcons.stanford.edu/). In practice one will be well advised to perform alignments with several different programs and compare them carefully to identify those parts of the alignment that are robust and independent of the details of algorithm and parameter (Figure 33.13). The program Jalview (https://www.jalview.org/) is an excellent tool to organize alignments, edit them – yes, manual editing and improvement is encouraged since biological background knowledge can often improve an automated alignment – then compute alignment features, including a first-look phylogenetic analysis, and manage annotations. Alignments can be imported and exported in various formats and can be efficiently computed on a dedicated server.

891

892

Part V: Functional and Systems Analytics

Figure 33.13 Multiple sequence alignment of the N-terminal PH domain of pleckstrin and the spectrin and dynamin PH domain. For these proteins structure coordinates are available and the alignment can be validated with the structural superposition. The top three sequences thus represent the ground truth, the bottom three sequences are excerpts of a larger alignment that was computed using CLUSTAL. Secondary structure elements are shown and the conserved residues of the hydrophobic core shaded. Errors in the CLUSTAL alignment are visible where the shaded columns are not aligned correctly.

Figure 33.14 Use of a large multiple alignment to engineer protein stability in an immunoglobulin VL variable domain. (a) A sequence alignment is used to determine positional amino acid frequencies. For example, leucine appears four times as often as alanine in position 15 of the sequence. (b) Experimentally determined reversible folding stability changes: A15L stabilizes the domain by 5.7 kJ mol 1. Virtually all predicted mutations could be experimentally verified. Source: Steipe, B. (2004) Consensus-based engineering of protein stability: from intrabodies to thermostable enzymes. Methods Enzymol., 388, 176–186. With permission, Copyright  2004 Elsevier Inc.

A remarkable application of multiple alignments is the use of consensus sequences for protein stability prediction (Figure 33.14). It can be shown that changes in amino acids that make a sequence more similar to a family-consensus sequence in general stabilize a protein. Such changes have additive, independent effects to a good approximation.

33.9 Structure Prediction The accurate prediction of sequence from structure has long been regarded as the holy grail of computational biology. While we have seen impressive results in this respect in the past years, we cannot yet consider it to be a completely solved problem. The quest for structure prediction has, however, lost its importance in the field that has moved away from considerations of individual molecules and towards more comprehensive questions of systems biology.

33 Sequence Data Analysis

Since homologous proteins have similar structures, a careful sequence alignment can be used as the basis for homology based structure prediction. This simply consists in replacing the coordinates of aligned side chains. For proteins with about 90% sequence identity, homology models are about as accurate as experimentally determined structures. Unfortunately, regions that contain indels cannot be accurately modeled and even elaborate applications of force-field based energy minimization have not been consistently shown to actually improve the modeled structure. The positive side of this is that it keeps homology modeling simple and for most purposes Web server based modeling, for example, at the SwissModel server (http:// swissmodel.expasy.org/), is perfectly adequate. The critical step of homology modeling is not the generation of 3D coordinates, but producing an accurate alignment between target and template. One should aim to produce the best possible alignment from a number of selected homologues, perhaps manually edited with Jalview as the basis of the model. Such alignments can be uploaded to the server and are likely superior to the automated procedures offered by the server itself. Automated ab initio predictions have been successful in many cases with the Rosetta program, pioneered by David Baker. The algorithms are available through the online ROBETTA server (http://robetta.bakerlab.org/) or from the open-source collaboration of Rosetta commons (https://www.rosettacommons.org/). Highlights of successful predictions include building models for crystallographic phasing, and successful de novo enzyme design.

33.10 Outlook Entering the post-genomic era of molecular and cellular biology has changed the field of sequence analysis profoundly. Procedures are easily available on the Internet and can be executed on standard workstations or on free public servers. Genome browsers (http://genome. ucsc.edu/) deliver the annotated maps of whole genomes at the click of a mouse to researchers around the globe. Problems of data integration are being addressed and solved step by step. At the same time, novel problems become apparent, especially regarding mining large datasets for information, visualizing large, high-dimensional relationships, filtering relevant information from the abundance of available resources, and pursuing the complex relationship between genotype and phenotype. We are still far from routinely making confident predictions about cellular processes and their dynamics. However, modern methods of sequence analysis provide a multitude of views on the function of individual components and the stage is set to improve our understanding of how to integrate these views to larger systems. Obviously, computational methods will continue to grow in importance as we pursue a deeper understanding of life.

893

Analysis of Promoter Strength and Nascent RNA Synthesis

34

Renate Voit University of Applied Sciences Bonn-Rhein-Sieg, Department of Natural Sciences, von-Liebig-Straße 20, 53359 Rheinbach, Germany

Maintenance of the basal physiological functions of a cell depends on tissue-specific regulation of gene expression at a particular time. A complex network of cellular pathways and factors control transcription, the first step in gene expression leading to a cell type-specific transcriptome. The nucleus has a highly ordered structure with the genome packaged into more or less tightly compacted chromatin. Since highly compacted chromatin inhibits access of proteins involved in gene expression, transcribed genes are usually located in less compacted areas, the euchromatin, permitting transcription factors to access promoters and adjacent regulatory cissequence elements. There are several steps at which gene expression can be regulated: Longrange first, promoter-proximal and promoter-distal regions including locus control regions (LCRs). Many questions concerning gene expression require analysis of the corresponding RNA produced from the gene locus. In addition to determination of the amount of transcript, which usually reflects the rate of gene expression and strength of the promoter, analysis may consist of mapping of the 5´ and 3´ ends of the transcription unit and mapping of alternative splice variants. Analysis of promoters can be used in one of two ways. The cis-active sequence elements can be altered with in vitro mutagenesis to elucidate the effects of these mutations on the rate of mRNA synthesis. Alternatively, the wild-type promoter can be used to determine how the amount of mRNA transcripts from a chosen gene varies under different physiological conditions. Experimental solutions for both types of question are offered by in vitro transcription in cell-free systems or the analysis of cloned genes in vivo after transfection and expression in cultured mammalian cells. In addition, epigenetic factors such as histone modifications and DNA methylation alter gene expression levels. Experimental approaches related to these topics have been described in detail in other chapters of this book. Histone modifications at specific gene loci can be mapped most effectively with the chromatin immunoprecipitation (ChIP) assay.

34.1 Methods for the Analysis of RNA Transcripts 34.1.1 Overview To understand the regulation of a particular gene, it is important to know the amount of RNA produced by that gene. The quantification of the gene of interest delivers an important indication of the strength of gene expression. Several methods are available to determine the amount of RNA produced by a gene. A prerequisite for the successful execution of all the described methods is excellent quality of the RNA employed. To isolate intact RNA molecules from cells or tissues, contamination with ribonucleases and unspecific degradation during extraction and isolation must be avoided. Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

Isolation of Intact RNA, Section 26.6

896

Part V: Functional and Systems Analytics

Northern Blot, Section 27.4.4 Dot-blot, Section 27.4.5

Polymerase Chain Reaction (PCR), Chapter 29

The methods described below for the analysis of RNAs are based on the hybridization of nucleic acids. Since complementary RNA and single-stranded DNA form a very stable hybrid, the resulting RNA-RNA or RNA-DNA hybrid molecules can be qualitatively and quantitatively detected directly (Northern blot, dot blot) or after treatment with single-strand specific nucleases (nuclease S1, ribonuclease A and T1). With the aid of nuclease S1 analysis and ribonuclease protection assays (RPAs), it is possible to quantify the amount of an RNA of interest and map its introns, and the 5´ and 3´ ends on its gene. The primer extension assay uses oligonucleotides complementary to the RNA of interest that are hybridized to the isolated RNA. The resulting hybrids are extended to their 5´ ends with reverse transcriptase (RT) to generate cDNAs. This technique allows the amount of genespecific RNA to be measured and the 5´ end of the RNA to be determined, even across intron-containing regions. With the fourth technique, Northern blot hybridization, the amount and the mRNAs absolute size can be determined but not the precise locations of its ends. A variant of the Northern blot, dot blot analysis, only serves to quantify the amount of RNA. The most modern method for quantification is quantitative RT-PCR, a method that involves first transcribing the mRNA into cDNA with RT, followed by analysis of the amount of cDNA with the aid of gene-specific primers.

34.1.2 Nuclease S1 Analysis of RNA This widely used procedure, published by Berk and Sharp back in 1977, employs the enzyme nuclease S1, a single-strand specific ribo- and deoxyribonuclease from the fungus Aspergillus oryzae. It is unusual in that it only hydrolyzes single-stranded DNA and RNA molecules, such as single-stranded overhangs in otherwise double-stranded DNA or DNARNA molecules, or single-stranded loops in double-stranded molecules. Cleavage by S1 selectively removes single-stranded segments while leaving double-stranded segments intact. Although this method is suitable for the quantification of transcripts it is superior for mapping of 5´ and 3´ ends of transcripts and for the localization of introns. Reaction Principle In the first step of the nuclease S1 analysis, a labeled, single-stranded DNA probe complementary to the RNA sought is hybridized to the isolated mixture of RNAs in solution. After adding nuclease S1, all single-stranded sections are digested and only the paired RNA-DNA hybrids remain. This double-stranded, cleavage-resistant RNA-DNA fragment is often referred to as the protected fragment. The labeled DNA fragments are then separated on an acrylamide gel and visualized by autoradiography or other suitable means, depending on the label employed. The resulting data provides two elements of information. The size of the remaining fragments allows the calculation of the distance to the end of the transcript or splice site. The intensity of the signal is proportional to the concentration of the complementary RNA species in the RNA mix and therefore provides information about the amount of the specific transcript.

Quantitative Nuclease S1 Analysis If the goal is to quantify an RNA species a complementary DNA probe can be chosen that hangs over the 5´ end of the RNA of interest. After hybridization, the nuclease S1 digests the overhang (Figure 34.1). DNA probes between 40 and 80 nucleotides long are particularly well suited for this assay. T4 polynucleotide kinase can be used to efficiently label the 5´ end of the DNA oligonucleotide. The specificity with which the hybridization probe is labeled is of decisive importance in determining the sensitivity of detection. Alternatively, or in addition, a second labeled probe can be added to the sample in an equimolar amount to the first probe. This allows the direct comparison of the amount of two different RNA species in a single experiment, provided the resulting fragments are of differing length. To assure the validity of the RNA quantification, the hybridization probe must be always present in significant molar excess relative to the amount of RNA to be quantified and the hybridization must run to completion. The ideal hybridization conditions for a particular DNA probe and corresponding RNA samples must be determined empirically.

34

Analysis of Promoter Strength and Nascent RNA Synthesis

897

Figure 34.1 Principle of the nuclease S1 analysis for the quantification of RNA. The schematic diagram shows a typical result with the full-length probe in lane 1 and the protected fragment in lane 2. Lane 3 contains the negative control lacking RNA and lane 4 the nucleic acid size standards. A 5´ -end labeled probe is used for this purpose.

The labeled DNA probe is present in significant excess in the reaction when, in the presence of a fixed amount of DNA probe, the signal resulting from a titration of differing amounts of RNA is proportional to the concentration of RNA. To determine if the hybridization reaction is complete, samples hybridized for varying length of time can be digested with nuclease S1. The reaction can be regarded as complete when the resulting signal no longer increases after longer hybridization periods.

Nuclease S1 Mapping of RNA 5´ and 3´ Ends Figures 34.2 and 34.3 show how nuclease S1 can be used to map the ends of an RNA and localize introns. Depending on the question a suitable hybridization probe must be selected. Should the 5´ end of a transcript be under investigation, a complementary DNA fragment is chosen that hangs over the 5´ end of the RNA and is labeled on its own 5´ end. After incubation with nuclease S1 the non-paired 3´ end of the DNA probe is degraded, resulting in a shortened fragment. When separated on a denaturing polyacrylamide gel, the difference in length between the original probe and the shortened fragment can be measured, which gives the distance between the 5´ end of the probe and the 5´ end of the RNA, the transcription start site. For this sort of high precision mapping a suitable size standard should be present on the same gel, ideally a Maxam–Gilbert DNA sequence reaction of the probe employed.

898

Part V: Functional and Systems Analytics

Figure 34.2 Mapping the 5´ and 3´ ends of an RNA. Lanes of the schematic are as described in Figure 34.1. Quantification makes use of a 5´ or 3´ end labeled probe.

A general problem of nuclease S1 mapping is that under sub-optimal reaction conditions nuclease S1 can hydrolyze double-stranded stretches of nucleic acids, which compromises the specificity of the method. For this reason it is important to always first optimize the reaction conditions in terms of reaction temperature and salt content.

PCR, Chapter 29

Mapping of the 3´ end of an RNA takes place analogously (Figure 34.2). The DNA probe used must be longer than the 3´ end of the corresponding RNA and the label must be on the 3´ end of the DNA. To map exon/intron positions, a probe corresponding to the genomic sequence is labeled and hybridized to isolated RNA. If introns are present in the genomic sequence, the regions corresponding to exons pair with RNA but the regions corresponding to introns remain single-stranded loops (Figure 34.3). In the presence of nuclease S1 the single-stranded regions are digested, leaving the paired regions corresponding to the exons intact. PCR can also be used to determine the intron/exon boundaries. PCR assays for the characterization of RNA species are carried out in parallel with sets of primer pairs on genomic DNA and reverse transcribed RNA. If the resulting products differ in length, this indicates the presence of exon/intron junctions in that segment.

34.1.3 Ribonuclease-Protection Assay (RPA) The ribonuclease protection assay is an alternative to the nuclease S1 technique. Like the nuclease S1 analysis, RPA is based on the hybridization of isolated cellular RNA to a labeled nucleic acid probe, which is complementary to the RNA of interest. RPA is more sensitive than nuclease S1 analysis because the probe is labeled along its entire length and not just at its ends, and therefore has a higher specific activity. In addition, RNA-RNA hybrids are uncommonly thermostable.

34

Analysis of Promoter Strength and Nascent RNA Synthesis

899

Figure 34.3 Locating an exon of a gene. Lanes of the schematic as described in Figure 34.1. Quantification makes use of a homogenously labeled probe.

As hybridization probe one needs an RNA probe complementary to the RNA of interest, generated by in vitro synthesis of an antisense probe, often referred to as the riboprobe. A prerequisite for the riboprobe is that the genomic area containing the gene segment of interest is cloned into a transcription vector. Transcription vectors typically contain specific promoters for one or more of three different bacteriophage RNA polymerases: SP6, T3, and T7. In such a vector it is easy to generate a run-off transcript of the region downstream of the promoter in vitro by adding the phage-specific RNA polymerase and the four ribonucleotides. This is illustrated in Figure 34.4. The resulting probe is useful not only for RPA, but can also be used as a hybridization probe in Northern blots. RPA makes use of the labeled antisense RNA probe for the hybridization with isolated total RNA (Figure 34.5). The resulting perfectly paired RNA-RNA hybrid molecules are incubated with a mix of ribonucleases A and T1. Both nucleases specifically hydrolyze single-stranded RNA but differ in their specificity. RNase A, which is isolated from bovine pancreas, cleaves the phosphodiester bonds after pyrimidine nucleotides, RNase T1 from Aspergillus oryzae cleaves after guanine nucleotides. In the reaction, therefore, all singlestranded RNA regions are digested, including all the RNA molecules, which are not homologous to the RNA, single-stranded overhangs in the RNA-RNA hybrid, and the free, unbound probe. Only perfectly paired RNA-RNA hybrids consisting of cellular and probe RNA remain intact as protected fragments. These hybrids are subsequently separated

900

Part V: Functional and Systems Analytics

Figure 34.4 Creation of a riboprobe from a cloned gene.

Figure 34.5 Principle of the ribonuclease protection assay. The schematic diagram shows the labeled probe in lane 1, the protected fragment in lane 2, and the nucleic acid size standards in lane 3.

34

Analysis of Promoter Strength and Nascent RNA Synthesis

on a gel, as for nuclease S1 assays, and visualized. Since the intensity of the resulting signal is proportional to the amount of hybridized cellular RNA, the amount of this RNA species in the reaction can be determined. Besides quantification of a particular RNA species, this detection method can be used to map the end of transcripts. The same criteria discussed in Section 34.1.2 apply to the choice of hybridization probe except that a transcription vector must be used for the subcloning to allow generation of the probe. RPA has several advantages relative to classical nuclease S1 analysis: The antisense RNA can be created in relatively large amounts and labeled with high specificity. RNA-RNA hybrids are significantly more thermostable than RNA-DNA hybrids, a key requirement for the creation of distinct protected fragments after nuclease treatment. Both play a significant role in increasing the sensitivity of this detection method: If end labeled probes are used in a nuclease S1 assay, using RPA will typically increase the sensitivity by between 20- and 50-fold. In addition, the hydrolysis with the ribonucleases is more reliable and reproducible, since, as previously mentioned, nuclease S1 reactions must first be optimized in terms of temperature, salt, and enzyme concentrations. The high sensitivity of this method has, however, disadvantages. For example, if during the in vitro transcription reaction incomplete transcripts are generated in addition to the full length ones, later, after the RNase treatment, a mix of protected fragments will be observed. Incomplete transcripts can result when the bacteriophage RNA polymerase pauses on the template and the reaction terminates prematurely. Pausing is sequence dependent and can be avoided by ensuring that the DNA template chosen for the generation of the riboprobe is a relatively short gene segment between 100 and 300 bp in length.

34.1.4 Primer Extension Assay Primer extension analysis is often used to determine the 5´ end of an RNA species and its amount in the same assay. The reaction is catalyzed by the RNA-dependent DNA polymerase reverse transcriptase (RT) in the presence of an RNA template and a DNA primer. The primer is single-stranded, complementary to a given RNA, has a length of about 20–40 nucleotides, and is radioactively labeled at the 5´ end using [32P]-ATP and T4-polynucleotide kinase (PNK). In the first step of the reaction, this short 32P-labeled DNA primer (5–10 pmol) is hybridized in solution with total or poly(A)-RNA (1–5 μg). After addition of RT and deoxynucleotides, the sequence of the DNA primer is elongated from its 3´ -OH end by the enzyme, generating a cDNA (copy DNA), which is complementary to the RNA template bound by the primer. Ideally, the reaction stops at the 5´ end of the RNA template. The schematic in Figure 34.6 summarizes the steps of a primer extension assay. The newly synthesized 5´ -end radiolabeled cDNA can be visualized following denaturing polyacrylamide gel electrophoresis (PAGE) by autoradiography or phosphorimaging. The autoradiogram provides information about both the quantity and the length of the original RNA species. The amount of synthesized cDNA corresponds to the amount of the given RNA species in the initially isolated RNA pool. Therefore, a high copy number of the given RNA in the total RNA population leads to synthesis of a high number of cDNAs, which in turn results in a strong radioactive signal on the autoradiogram. On denaturing gels the length of the cDNA can be determined exactly with appropriately sized markers separated on the same gel. Thus, the information about the length of the cDNA and the position of the DNA primer reflects the distance from the primer to the 5´ end of the corresponding RNA. The position and specificity of primers used in this reaction must be optimized. The distance of the primer to the 5´ end of the corresponding RNA should not be more than 100 bases to avoid premature termination of the cDNA synthesis by the RT due to extensive secondary structures in the RNA. Multiple radioactive bands on the autoradiogram may indicate premature termination. However, in such a case, it is difficult to distinguish prematurely terminated cDNA from variable 5´ ends of the original RNA template. Indeed, more than one transcription start site of the corresponding gene will produce heterogeneous RNAs, which only differ in their 5´ end. In addition, RNAs encoded by multigene families may produce cDNAs of different lengths in a primer extension assay, if the target sequence selected for the primer is conserved among members of the gene family.

901

902

Part V: Functional and Systems Analytics

Figure 34.6 Principle of the primer extension assay. The schematic diagram shows the position of the end labeled oligonucleotide in lane 1, the cDNA primer extension product in lane 2, and excess labeled primer in both lanes.

Primer extension is frequently used for the detection of transcripts produced after transient transfection of cells or by in vitro transcription reactions in cell-free transcription systems. To detect variations in the amount of product not reflecting genuine differences between samples, it is important to plan controls into the experiments to detect variations in the efficiency of transfection, RNA isolation, and the efficiency of the primer extension reactions themselves. A heterologous RNA can serve as an internal control, which is detected with a second specific oligonucleotide probe in the same reaction, in addition to the primary probe used to detect the RNA of interest.

Mapping of 5´ ends of RNAs can be accomplished with two other methods previously described in this chapter. What are the basic differences? The nuclease S1 and ribonuclease protection assay are based on formation of stable RNA/DNA or RNA/RNA hybrids between the RNA and the complementary radioactive probe, respectively. Upon hybridization, noncomplementary DNA or RNA is not hybridized and thus subject to digestion by the S1 nuclease (RNA/DNA) or by ribonuclease (RNA/RNA). The resulting protected fragments may represent either the true 5´ end of a processed RNA or an intron/exon boundary of the pre-mRNA. In contrast, primer extension by RT stops at the 5´ end of the RNA template after splicing has been completed, thereby delineating the 5´ end of the relevant RNA.

34.1.5 Northern Blot and Dot- and Slot-Blot Northern Blot Northern blotting is a common and easy method to quantify a particular RNA within an RNA pool. This method is also described in Chapter 27, so only points related to RNA

34

Analysis of Promoter Strength and Nascent RNA Synthesis

903

Figure 34.7 Example of RNA quantification by Northern blot. Each lane contains 10 μg total RNA. The figure shows an autoradiogram of the filter after RNA transfer and hybridization. The hybridization is carried out with a RNA probe complementary to the RNA of interest and a second probe complementary to cytochrome c mRNA to allow the results to be normalized.

quantification will be discussed here. For Northern blotting, either total cellular RNA or poly (A)+-enriched RNA is denatured to unfold secondary structures in the RNA molecules, followed by electrophoresis on denaturing agarose gels to separate the RNAs according to their size. RNA is then transferred from the gels to nitrocellulose or nylon membranes. Specific RNAs are detected by membrane hybridization using a radioactive labeled RNA or DNA hybridization probe that is complementary to the relevant RNA. Northern blotting is useful to determine the relative amount, as well as the size, of a particular RNA, but not to define the precise 5´ or 3´ ends of the RNA. This technique is often used for analysis of a given RNA within a highly heterogeneous RNA pool, or RNA pools isolated from different sources (e.g., different cell lines or tissues). The comparison of the amount of a particular RNA from different RNA sources requires loading of equal amounts of total RNA as well as re-hybridization of the membrane with a normalization probe, specific to an RNA that is similarly expressed in many cell types and tissues to act as an internal standard. Usually, transcripts from so-called “housekeeping” genes – that is, genes that are believed to be expressed at a similar level in a wide range of cell types and tissues, irrespective of signaling pathways and cell differentiation – are used for normalization, such as cytochrome or actin mRNA. To avoid erroneous normalization, more than one housekeeping gene should be selected, since also expression of common housekeeping genes fluctuates in a signaling and cell differentiation dependent fashion. An example of RNA quantification by Northern blotting is given in Figure 34.7, showing the main principles of the analysis. The type and amount of RNA to be loaded depends on the abundance of the particular RNA species. Assuming an abundance of 0.1%, loading of 5–20 μg of total RNA is sufficient to allow quantification of the relevant RNA species. Within a total RNA preparation the amount of an mRNA of interest is pretty low, since the majority of RNA consists of rRNA (80%) and tRNA (15%). Therefore, an abundance of 0.1% of an mRNA of interest within a total RNA pool is very high. Consequently, for detection and quantification of low abundance mRNA species, that is, gene transcripts being present at low copy numbers (50 copies per cell), enrichment of poly(A)+-RNA is required and up to 10 μg of poly(A)+-RNA should be used. In comparison to other methods, Northern blots require a large quantity of RNA for detection, are quantitatively inaccurate for low abundance RNAs, and are time consuming, which has led to the development of several techniques to get around these limitations.

Northern, Section 27.4.4

Dot- and Slot-Blot Analysis Dot blot or slot blot are used to detect RNA in complex RNA samples, similarly to Northern blots, but without prior separation of the nucleic acids by electrophoresis. Instead, the RNA mixture is denatured with formamide, formaldehyde, and heat (65 °C) and then directly spotted onto nitrocellulose or nylon membranes under vacuum using a dot blot or slot blot apparatus. Fixation of the RNA on the membrane and detection of a particular RNA species using specific nucleotide probes is done as described for Northern blotting. This method is quick and often used to compare the amount of a specific RNA in many samples in parallel, for example, to determine expression patterns of genes in different tissues. Therefore, this technique is important for clinical diagnosis, such as monitoring expression of oncogenes. Real-time PCR has replaced this method for this purpose in many cases due to its greater speed and sensitivity. This technique gives no information about the size of the target RNA, since the initial RNA mixture is not fractionated by electrophoresis. To avoid errors due to cross hybridization of the radiolabeled hybridization probe to non-target RNAs, appropriate negative and positive control

Dot- and Slot-blot, Section 27.4.5

904

Part V: Functional and Systems Analytics

samples should be analyzed in parallel. Moreover, quantification of the target RNA in the cellular RNA mixture requires quantification of known amounts of the target RNA synthesized in vitro.

34.1.6 Reverse Transcription Polymerase Chain Reaction (RT-PCR and RT-qPCR)

RT-PCR, Section 29.2.3

Quantitative PCR, Section 29.2.5

A powerful method for quantification of gene expression is transcription of total cellular RNA into cDNA by RT coupled with amplification of cDNA by PCR (RT-PCR) and gene-specific PCR primers. Currently, this technique is commonly used to monitor gene expression. In RTPCR, total RNA is first converted into cDNA by RT, usually using random hexamers for priming cDNA synthesis instead of gene-specific primers. The cDNA pool is then used as template for amplification by PCR in the presence of gene-specific primers. RT-(q)PCR is the method of choice to measure even low gene expression accurately: Its high sensitivity allows detection of one specific RNA molecule among up to 108 unrelated RNA molecules. It has a wide range of clinical and non-clinical applications, including diagnosis of genetic diseases and cancer, analysis of gene expression patterns, or cloning of RNA. While RT-PCR is used for qualitative studies of gene expression or for cloning purposes, RT-qPCR (RT coupled with quantitative PCR) allows relative and absolute RNA quantification. Determination of gene expression levels by RT-PCR techniques is only reliable if the amount of DNA amplified by PCR is proportional to the amount of RNA in the original sample material. Given that random hexamer-primed reverse transcription into cDNA should not be rate limiting, it is crucial to analyze the amount of PCR-amplified DNA during the exponential amplification phase, before the plateau phase of the PCR reaction is reached. In both methods, RT-PCR and RT-qPCR, quantification of a specific RNA is not possible without appropriate internal controls, in particular if expression patterns in samples from different biological materials are compared. To avoid errors in quantification due to variation in RNA preparation or cDNA synthesis efficiency between different samples, the amount of PCR products obtained for the RNA of interest is normalized to the amount of PCR product obtained for a reference RNA. Commonly, ribosomal RNA, actin mRNA, tubulin mRNA, or other transcripts from “housekeeping” genes are used for normalization. If accurate quantification is an absolute must, additional assays for the validation of results should be performed. Alternatively, an artificial RNA template is produced by an in vitro transcription assay. Such an artificial RNA contains the same primer binding sites as the natural RNA template but differs in length. This approach employs a cloned gene fragment encoding the RNA of interest, which is first mutated, either by a short deletion or insertion, then shuttled into an in vitro transcription vector just downstream of the promoter for RNA polymerases SP6, T3, or T7. Substitution of the in vitro transcription assay with the corresponding RNA polymerase (SP6, T3, T7) produces the artificial RNA. After purification and determination of the concentration of the in vitro synthesized RNA, it is reverse transcribed into cDNA along with the natural RNA, followed by PCR together with the natural RNA counterpart. RT-PCR is usually performed as a two-step reaction. In the first step, RNA is incubated in assay buffer with non-specific primers, usually random hexamers, and RT. In the second step, PCR is performed with an aliquot of the cDNA reaction in the presence of gene-specific primers. Quantification of RT-derived cDNA can be achieved either by end-point PCR (RTPCR) or by real-time RT-PCR (RT-qPCR). RT-qPCR is the method of choice to determine if gene expression changes globally and/or with many samples. Nevertheless, semi-quantitative end-point PCR is still used for measurement of gene expression in a small number of samples or if RT-qPCR is not available. Quantification of gene expression by relative RT-PCR requires coamplification of an internal control together with the gene of interest to normalize the samples. As mentioned above, either synthetic RNAs or mRNAs from housekeeping genes are suitable as internal control. Upon normalization, relative abundances of transcripts can be compared across multiple samples. However, quantification is only reliable when both the target and control amplifications are within the linear range of the PCR reaction, which is usually determined in pilot experiments with serial dilutions of target and control cDNA or by trying different cycle numbers. Detection of the end-point RT-PCR products is done either by analysis on agarose gels and staining with ethidium bromide or by performing the PCR reaction in the

34

Analysis of Promoter Strength and Nascent RNA Synthesis

presence of 32 P followed by measurement of the radioactively labeled DNA by phosphorimaging or by scintillation counting. The values are usually presented as ratios of the target gene signal to the internal control signal. Real-time RT-PCR (RT-qPCR) has become the method of choice for quantification of gene expression, mostly replacing conventional RT-PCR and Northern blotting. The accumulation of PCR products is recorded during the individual amplification cycles, and not at the end of the reaction, by continuous measurement of fluorescent dyes that integrate or associate quantitatively with the amplified DNA. Two fluorescent dyes are widely used to quantify the accumulation of PCR products. SYBR-Green is a fluorescent dye that intercalates into any double-stranded DNA in a sequence-independent fashion. Alternatively, fluorescence reporter probes, such as TaqMan probes, are sequence-specific oligonucleotide probes that are labeled with a fluorescent reporter. During PCR, the probe hybridizes to the complementary sequence with the heterogeneous pool of PCR products, thereby quantifying individual amplicons. Since fluorometric detection of the labeled DNA is very sensitive, both approaches allow precise quantification of RNA (cDNA) in the original samples. Usually 10–100 cDNA copies per sample can be detected using SYBR-green or TaqMan probes in combination with the appropriate thermocycler model.

Hybridization Methods, Section 28.1.3

34.2 Analysis of RNA Synthesis In Vivo All the previously described methods are suitable for measurement of the steady-state level of an RNA, but do not provide information about ongoing gene expression. Nuclear-run-on assays, however, allow the measurement of promoter activity and ongoing gene expression in living cells, thus facilitating studies on regulation of gene expression at the transcriptional level. This method was originally established to demonstrate that regulation of transcription initiation rates is the main rate-determining step in the synthesis of the non-mature precursor-RNAs in mammalian cells. Briefly, nuclear-run-on assays measure the synthesis of ongoing gene expression via incorporation of [32P]-UTP into nascent RNA chains. Cells do not take up UTP efficiently from the culture medium. Therefore, cells are permeabilized under conditions that perforate the cell membrane, but do not destroy the nuclear membrane. Upon uptake, [32P]UTP is incorporated into initiated, nascent RNAs by the nuclear RNA polymerases I, II, and III. Since radiolabeling of the transcripts occurs during elongation, most of the transcriptome (steady state RNAs) is not labelled. As a result a nuclear-run-on assay measures the amount of ongoing transcription largely independent of RNA stability. This assay is important for addressing various scientific questions, such as regulation of gene expression during cell differentiation, at distinct phases of the cell cycle, or by chemical compounds. Nuclear-run-on assays consist of the following steps:

   

cell permeabilization/lysis, nuclear-run-on transcription (RNA labeling reaction), isolation of RNA, detection and quantification of the specific RNA.

Alternatively, a halogenated pyrimidine nucleoside, 5-fluoro-uridine (FU), bearing a fluoro substituent at position 5 on the uracil ring, is used for labeling of nascent RNAs in vivo. In contrast to [32P]-UTP, FU is cell permeable, and permeabilization of the cell membrane is not required. FU-labeled transcripts are detected either by immunofluorescence microscopy or by RNA-immunoprecipitation (RNA-IP) using antibodies specific to FU.

34.2.1 Nuclear-run-on Assay The following summarizes the important steps of nuclear-run-on assays using mammalian cells in culture. Cells are chilled on ice, and the cytoplasmic membrane is permeabilized or lysed. This treatment results in pausing of all RNA polymerases. Usually, cell lysis is performed in a hypotonic or isotonic buffer containing the non-ionic detergent NP-40 (NP-40 cell lysis buffer). The intact nuclei are pelleted by centrifugation, while components of the cytoplasm, as well as

Isolation of Genomic DNA, Section 26.2

905

906

Part V: Functional and Systems Analytics

Isolation of intact nuclei, which preserve their RNA polymerase activity, is essential for nuclear-run-on assays. Thus, cell lysis conditions and nuclei purification have to be optimized for a given cell type. Isolated nuclei are either used immediately for nuclearrun-on transcription or are snap-frozen in liquid nitrogen and stored at 80 °C in buffer containing glycerol.

Typically, a nuclear-run-on transcription contains between 5 × 106 and 5 × 107 nuclei per assay. Incorporation of [α-32P]-UTP into nascent RNA drops significantly at low densities of the nuclei, which decreases the detection of genespecific transcripts by reverse dot blotting. Under optimal assay conditions up to 10–30% of the input radioactivity will be incorporated into nascent transcripts during the nuclear-run-on transcription.

the cytoplasmic membrane, remain in the soluble supernatant. Depending on the cell type used, efficient cell lysis may require homogenization with a Dounce homogenizer. In addition, further purification of nuclei is achieved by centrifuging nuclei through a sucrose cushion. Nuclei are then incubated for a short time (pulse labeled) at 30 or 37 °C in the presence of radiolabeled [α-32P]-UTP and non-labeled ribonucleoside triphosphates (NTPs). During this step, new transcripts are not initiated, but [α-32P]-UTP will be incorporated into virtually all nascent transcripts that were initiated at the time point when the cells were chilled and lysed. Finally, DNA and proteins are digested with RNase-free DNase and proteinase K, and RNA is extracted with guanidinium thiocyanate–phenol–chloroform extraction (TRizol). Quantification of the relative amount of nascent transcripts in each sample involves a modified dot (or slot) blot technique (Section 34.1.5). In this reverse dot blot procedure, a nonlabelled DNA probe containing the gene of interest is immobilized on a nitrocellulose filter or nylon membrane and hybridized to the purified [32P]-labeled RNA synthesized during the nuclear-run-on transcription. The membrane-bound DNA probe should be in high molar excess to prevent saturation of the DNA probe during the hybridization reaction, otherwise the calculations underestimate the amount of RNA synthesized. The amount of radioactivity that hybridizes to the membrane is approximately proportional to the number of nascent transcripts. Since the number of nascent transcripts on a gene is thought to be proportional to the frequency of transcription initiation, the amount of hybridized 32P-labeled RNAs reflects the activity of the corresponding gene promoter.

34.2.2 Labeling of Nascent RNA with 5-Fluoro-uridine (FUrd) Genomic loci with high transcriptional activity can be visualized by short-term labeling of nascent RNA with the non-radioactive uridine analogues 5-fluoro-uridine (FU) or BrUTP (BrU) (Figure 34.8). Both analogues can substitute for UTP during transcription of nascent transcripts and are not incorporated into mature RNA. Labeled RNA is subsequently detected by immunochemical methods. In contrast to BrUTP, fluoro-uridine is efficiently taken up by intact cells, and no permeabilization of cells is necessary. Upon uptake, fluoro-uridine is quickly metabolized to fluoro-UTP, the NTP required for incorporation into RNA by the nuclear RNA polymerases I, II, and III. BrUTP is not cell permeable and so additional procedures are required to facilitate uptake, such as permeabilization of the cytoplasmic membrane or purification of nuclei (as described above), microinjection, or use of liposome particles. Fluoro-uridine based pulse labeling of nascent RNA is gentle and quick since the entire cell remains intact. For in situ labeling, FU (2 mM) is directly added to the cell culture medium for 5–30 min. After labeling, cells are fixed, permeabilized, and prepared for indirect immunofluorescence microscopy. Detection of nascent FU-labeled RNA is accomplished by incubation with a primary antibody specific to fluoro-uridine or bromo-uridine, followed by incubation with an appropriate secondary fluorescently labeled antibody. Fluorescence signals are finally visualized under a microscope. If labeling is done for a very short time, visualization of nascent transcripts is restricted to genomic loci with the highest amounts of RNA polymerase activity. Under these conditions, usually only transcripts in nucleoli are detectable. Nucleoli represent the nuclear body, where rDNA is transcribed into rRNA from arrays of more than 150 repeated gene copies by highly active RNA polymerase I (Figure 34.9). Extension of the labeling time up

Figure 34.8 Structures of 5-bromo-UTP (BrU) and 5-fluorouridine (FU).

34

Analysis of Promoter Strength and Nascent RNA Synthesis

to 30 min allows visualization of FU-labeled mRNAs as well, which becomes evident through the increasing accumulation of fluorescence signals in the nucleoplasm. Alternatively, antibodies specific to FU can be used to immunoprecipitate FU-labeled nascent transcripts. Immunoprecipitated RNA is extracted by guanidinium thiocyanate–phenol–chloroform, and purified RNA is analyzed by either RT-qPCR (Section 34.1.6) or RNA-seq.

34.3 In Vitro Transcription in Cell-Free Extracts Transcription initiation is controlled by a multitude of factors, including general transcription factors, gene specific factors, and accessory co-regulators. The interplay between sequencespecific binding of transcription factors to their binding sites in proximity of the core promoter and the sequential assembly of general transcription factor complexes around the core promoter and transcription start site (TSS) is necessary for gene-specific recruitment of RNA polymerase I, II, or III and accurate transcription initiation. The level of transcription initiation is highly variable, depending on the level and activity of gene-specific transcription factors. Often, the amount and activity differ according to cell type, cell cycle, differentiation state of a cell, and intra- and extracellular signaling pathways. Moreover, in vivo, transcription initiation depends on the accessibility of transcription factors to the gene regulatory cis-acting sequence elements, which are embedded in chromatin. Thus, changing the chromatin structure by chromatin remodelers and histone modifying complexes is crucial for regulation of transcription. Many questions related to regulation of gene expression can be addressed by in vitro transcription of cloned gene promoters in cell-free extracts or reconstituted transcription systems. This experimental approach helps to identify protein components that activate or repress transcription, as well as relevant cis-acting regulatory elements on the corresponding gene of interest.

34.3.1 Components of an In Vitro Transcription Assay In vitro transcription needs an appropriate DNA template, which typically contains the core promoter, gene control elements located upstream of the core promoter, and a stretch of the transcribed region located downstream of the transcription start site (TSS). These cis-acting sequence elements are cloned into a plasmid, resulting in the final plasmid construct, which is called a minigene reporter. Since the plasmid is amplified and purified from Escherichia coli, it is nucleosome-free. Transcription of the minigene in vitro requires a protein extract that contains the minimal set of protein components essential for transcription, the ribonucleotides ATP, CTP, GTP, UTP, and radiolabeled [α-32P]-UTP, a defined set of cations (Mg2+, K+, Zn2+, and others, depending on the gene context), and buffer substances (Tris, HEPES). Given that all components of the assay system are active, exact initiation and elongation of transcription in vitro can be assessed. To decipher regulatory cis-elements in the gene of interest, deletions, insertions, or point mutations can be introduced into the cloned gene fragment by in vitro mutagenesis. In vitro transcription of mutated promoter constructs is performed in transcription-competent extracts, and the amount of reporter transcript is compared to the level of transcript synthesized from the non-mutated wild-type minigene. Similarly, up- and down-regulation of transcription by genespecific transcription factors can be assessed. For this purpose, the transcription factor of interest should be depleted from the protein extract, to determine if depletion up- or down-

907

Figure 34.9 In vivo labeling of nascent RNA by 5-fluoro-uridine. Human U2OS cells were cultured in the presence of 2 mM 5-fluoro-uridine (FU) for 10 min. Cells were then fixed and permeabilized. FU-labeled RNA was detected by immunofluorescence microscopy using a monoclonal antibody against FU/BrdU and a secondary Cy3-labeled anti-mouse antibody (red signals). The visualized signals correspond to nascent nucleolar RNA, since they co-localize with the nucleolar transcription factor UBF visualized by rabbit anti-UBF and secondary FITC-labeled anti rabbit antibodies (green signals). The DNA is counterstained with the dye Hoechst 33342 (blue signal).

908

Part V: Functional and Systems Analytics

regulates in vitro transcription. After the depleted protein extract is reconstituted with the depleted protein or protein complex, transcription should be restored to levels observed in the non-depleted extract if the depleted protein/protein complex was the only factor removed from the protein extract. Such a transcription assay also facilitates identification of functional domains in transcription factors, if transcriptional activity is compared in extracts reconstituted with the wild-type factor or with a mutant protein.

34.3.2 Generation of Transcription-Competent Cell Extracts and Protein Fractions Extracts for in vitro transcription assays are commonly prepared from cultured cells. Transcription extracts are divided into three categories depending on the method used for cell lysis and fractionation: cytoplasmic extracts, nuclear extracts, and whole cell extracts. Usually, extracts prepared from isolated nuclei can be used directly for functional studies, including in vitro transcription and RNA processing. Moreover, nuclear extracts are the starting material for purification of proteins involved in these processes. Depending on the cell type used for extract preparation, individual steps have to be optimized to generate reproducibly active transcriptional extracts and to increase the yield of components necessary for transcription. Briefly, cells are incubated in hypotonic buffer, the swollen cells are lysed by Dounce homogenization, and the intact nuclei are collected by centrifugation. Nuclei are then resuspended in high salt buffer to extract soluble proteins from the nuclei. All buffers are free of detergents to prevent inactivation of components of the transcription machinery. All components of the nuclei that are resistant to high salt extraction (e.g., chromatin) are pelleted by centrifugation, while nuclear transcription factors and RNA polymerases remain in the supernatant (nuclear extract). After dialysis of the nuclear extract into a moderate salt solution it is ready to use for in vitro transcription. Alternatively, the nuclear extract can be used as starting material for biochemical enrichment of factors by ion exchange-, gel filtration-, or affinity-chromatography.

34.3.3 Template DNA and Detection of In Vitro Transcripts Accurate transcription in vitro depends on active cell extracts or a well-defined set of proteins fractionated by biochemical methods and a DNA template that contains all essential promoter elements of the corresponding gene. If an important element is missing, synthesis of in vitro transcripts will be impaired, regardless of whether active extracts or protein fractions were used in the assay system. Two types of construct are widely used: (i) DNA templates that contain a Gless cassette (Figure 34.10) and (ii) DNA templates that are linearized by restriction enzymes downstream of the transcribed region producing in vitro run-off transcripts (Figure 34.11). The G-less Cassette The G-less cassette is often used to determine the promoter strength of a gene of interest in vitro. To generate a G-less cassette DNA template, the promoter, together with upstream regulatory elements, or the core promoter alone is inserted into a plasmid containing the G-less cassette. Usually, a G-less cassette consists of a synthetic DNA fragment of 350 bp lacking guanine residues in the sense direction. The promoter is cloned immediately upstream of the Gless cassette. The G-less cassette is transcribed in the in vitro system using the constituents described above (Section 34.3.1) except that GTP is replaced by the GTP analogue 3´ -O-methylGTP. Upon transcription initiation, RNA polymerases transcribe the G-less cassette until the first guanine residue. At the first guanine downstream of the G-less cassette, 3´ -O-methyl-GTP is incorporated into the nascent transcript, which prevents further elongation of the nascent RNA chain and causes termination of the transcript. The 350 nt labeled transcript is released from the template DNA (Figure 34.10) and is subjected to polyacrylamide gel electrophoresis. The amount of label incorporated into the transcript is quantified by autoradiography, phosphorimaging, or other suitable means, and indicates the relative strength of the promoter of the gene of interest. Transcriptions with G-less cassette templates can be performed on circular plasmids. In many transcription systems circular plasmids are more efficiently transcribed than linearized templates, which are used for run-off transcription. In addition, the presence of 3´ -O-methyl-GTP suppresses promoter-independent random and non-specific transcription throughout the plasmid.

34

Analysis of Promoter Strength and Nascent RNA Synthesis

909

Figure 34.10 Use of G-less cassettes for in vitro transcription. The promoter-containing plasmid is incubated with extract or partially purified transcription factors and RNA polymerase, ribonucleotides, and reaction buffer. GTP is replaced by 3´ -O-methyl-GTP in the reaction, leading to chain termination once incorporated into the RNA. The resulting run-off transcripts are subsequently separated on a gel and detected.

Run-off Transcripts For use in run-off transcription assays, the promoter of the gene of interest and regions downstream of the transcription start site (TSS) are cloned into a plasmid. Termination of transcription is achieved by linearization of the plasmid at a defined restriction enzyme cleavage site. Upon initiation, transcripts are synthesized until the RNA polymerase arrives at the 3´ end and “runs off” the linearized template. Since the distance between TSS and the restriction enzyme cleavage site is defined, a 32P-labeled run-off transcript of the predicted size indicates specific initiation of transcription. The size of the RNA is verified by gel electrophoresis. As before, the intensity of the signal indicates the relative amount of RNA produced and reflects the strength of the promoter (Figures 34.11 and 34.12).

910

Part V: Functional and Systems Analytics

Figure 34.11 Principle of the in vitro run-off transcription reaction. The promoter containing plasmid is linearized with a restriction enzyme that has a unique recognition site and is located at an appropriate distance (usually 150–500 bp) downstream of the transcription start site. The linearized template is incubated with extract or partially purified transcription factors and RNA polymerase, ribonucleotides, and reaction buffer, and the resulting run-off transcripts are analyzed as given in Figure 34.10. Tx, transcription.

Figure 34.12 Example of an in vitro runoff transcription analysis. A linearized template containing the rDNA promoter was incubated with increasing amounts (lanes 1–6) of essential transcription factors and RNA polymerase I.

34

Analysis of Promoter Strength and Nascent RNA Synthesis

911

Additional Techniques to Analyze In Vitro Transcripts Sometimes, it is not feasible to use G-less cassettes as DNA template for in vitro transcription. In addition, direct detection of 32Plabeled run-off transcripts on polyacrylamide gels is only possible if strong promoters are tested, for example, rDNA promoter and promoters for viral genes. If a promoter is weak, the levels of spurious transcripts may be as high as promoter-independent transcripts. In such cases, in vitro transcripts should be analyzed by more laborious techniques described above (Section 34.1) to qualitatively and quantitatively validate the RNA: nuclease S1 mapping, ribonuclease S1 protection, primer extension assays.

34.4 In Vivo Analysis of Promoter Activity in Mammalian Cells Many questions relating to promoter activity cannot be definitively solved by in vitro transcription approaches (e.g., changes of promoter activity by growth factors and signaling molecules) but should be assessed in vivo in their proper cellular context. In addition, the significance of gene-regulatory cis-acting sequence elements identified in vitro has to be validated in vivo. Analysis of promoter activity in mammalian cells consists of three main steps: (i) cloning of the regulatory elements of a gene of interest into an appropriate transfer or reporter gene plasmid, (ii) introduction of the cloned plasmid into mammalian cells, and (iii) assays aimed at quantifying the activity of the cloned promoter.

34.4.1 Vectors for Analysis of Gene-Regulatory cis-Elements Identification of promoter elements and promoter-proximal regions that up- or down-regulate the corresponding gene commonly require cloning of promoter elements into an appropriate plasmid vector, which is introduced into mammalian cells, followed by measurement of promoter strength. Usually, vectors employed for this purpose are reporter genes. The vector backbone of such a plasmid has several functional parts (Figure 34.13):

     

bacterial origin of replication (for propagation in E. coli), antibiotic resistance gene (selection marker), multiple cloning site (MCS), reporter gene sequence cloned into the MCS, polyadenylation signal sequence downstream of the cloned reporter sequence; optional elements are: – second antibiotic resistance for selection in mammalian cells, – weak minimal promoter immediately upstream of the reporter gene, – eukaryotic origin of replication (for propagation in mammalian cells).

Reporter genes are used to assess the activity of promoters and additional gene regulatory DNA elements of a gene of interest (target) in cultured mammalian cells. For this purpose elements of the target gene promoter are cloned immediately upstream of the reporter gene coding sequence. After introduction into cells, the reporter gene is expressed under the control of the target gene promoter. The amount of reporter gene product is measured and the results are normalized to the amount or activity of the reporter gene produced by a reference promoter (Figure 34.13). Commonly used reporter genes are luciferase (Luc), β-galactosidase (β-Gal), green fluorescent protein (GFP), and chloramphenicol acetyltransferase (CAT). Detection methods used to measure the expressed reporter gene product involve luminescence, fluorescence, thin-layer chromatography (TLC), or RNA analysis (Section 34.4.3). Mutational Analysis of Promoters Promoter malfunction has been associated with hundreds of diseases, malfunction being often caused by mutation of a promoter sequence or upstream regulatory elements. To decipher the underlying molecular mechanisms it is important to test individual promoter elements of a candidate gene, either alone or in combination, in reporter gene

Isolation of Plasmid DNA from Bacteria, Figure 26.6, Section 26.3.1

912

Part V: Functional and Systems Analytics

Figure 34.13 Principle of mapping the important cis-acting sequences with the aid of a reporter gene vector. (a) Promoter structure of the hypothetical gene. Enhancer, proximal promoter, and the transcription start site are shown. (b) After restriction analysis or sequencing, the entire enhancer region or fragments thereof are cloned into the multiple cloning site of an eukaryotic expression vector containing a basal promoter and reporter and, after transfection, the reporter gene activity of the cell lysates is measured (++++ = high; + = low). (c) To analyze the minimal proximal promoter, the same basic procedure is followed as described for (b). Here, however, a promoter-less plasmid with the reporter gene is used. Without a functioning promoter, no reporter protein is made ( ); if important regulatory sequences are missing, the rate of reporter gene synthesis is low (++). (d) From the results of the experiments described in (b) and (c), the functionally important enhancer regions (dark gray) and promoter regions (red) can be elucidated.

assays for gain-of-function or loss-of-function. Based on mutational analysis, functionally important nucleotides of sequence elements can be identified (Figure 34.13). Commonly, nucleotide point mutations are introduced by site-directed mutagenesis, and sequence stretches are deleted or introduced by PCR and/or restriction enzyme digestion and re-cloning of the modified DNA fragment into the reporter gene plasmid. A gene “enhancer” has no intrinsic promoter activity but activates (enhances) expression of the corresponding gene in cis, though sometimes at a considerable distance from the promoter and orientation-independent. To test enhancer activities, putative enhancer elements are cloned into a reporter gene plasmid that contains a eukaryotic reference promoter. Viral promoters (SV40, RSV, CMV) are usually used for this purpose. Active enhancer elements stimulate reference promoter activity by increasing expression of the reporter gene product.

34.4.2 Transfer of DNA into Mammalian Cells Foreign DNA can be introduced into mammalian cells in two ways, either by transfection or with the help of viruses as vehicles (viral transduction). For each category, several techniques have been established. Transfection of mammalian cells is based on a transient increase of the permeability of the cytoplasmic membrane, which allows uptake of the foreign nucleic acids or modified gene plasmids by the cell. Non-viral introduction of DNA is achieved by chemicalbased transfection, non-chemical-based techniques, or lipofection. Notably, the efficiency of transfection is highly variable between different cell types and the proper method is best determined empirically on a case-by-case basis.

34

Analysis of Promoter Strength and Nascent RNA Synthesis

Transient and Stable Expression Upon uptake, the vast majority of transfected plasmids are not integrated into chromosomal DNA, which results in the transient expression of the cloned genes (transient transfection). The plasmids are diluted in each round of cell division or degraded. However, transfection of plasmids containing viral origins of replication, such as the Epstein– Barr virus (EBV) or SV40, allow episomal amplification of the plasmids in appropriate daughter cells: The EBV origin of replication requires cells expressing the EBV encoded nuclear antigen 1 (EBNA1) and the SV40 origin of replication requires cells expressing the SV40 large-Tantigen. Routinely, HEK293E or HEK293T cells are used for this purpose. Episomal amplification greatly reduces dilution of the plasmid during cell division. In very few cells (approximately 1 out of 104) the foreign plasmids will have randomly integrated into the chromosomal DNA of the host cell (stable transfection). To accomplish selection and enrichment of the few stably transfected cells, the plasmid backbone of eukaryotic expression plasmids usually encodes a drug resistance gene. Since this drug resistance gene is co-integrated into the chromosomal DNA, the drug resistance gene is constitutively expressed. If cells are cultured in the presence of the drug, only those few cells will survive that express the resistance gene, whereas other cells will die. Common drugs used for selection of stable transfected cells are geneticin (G418), puromycin, or zeocin. Chemical-Based Transfection Chemical-based transfection is widely used for the introduction of nucleic acids into a broad range of mammalian cell types. One of the most popular techniques for transfection of cells growing in a monolayer is based on calcium phosphate. The target DNA is mixed with a solution of HEPES-buffered saline (HBS) containing phosphate and calcium chloride. When all three components are combined, a precipitate of positively charged calcium ions, negatively charged phosphate ions, and the DNA is formed. Upon addition to cells in culture, cells take up the DNA–calcium phosphate crystals by a process that is not completely understood. Depending on the cell type, transfection efficiencies of up to 90% or more have been observed. Alternatively, cationic polymers are used such as DEAE-dextran or polyethyleneimine (PEI). The cationic polymers form a complex with the negatively charged DNA, which cells take up by endocytosis.

Non-chemical Methods is a popular and efficient technique to transfect mammalian cells growing in suspension or adherent cells that are resistant to transfection by chemical methods. In this technique, a suspension of cells and the DNA to be transfected are exposed to short pulses of intense electricity using a special electroporation device. Electric pulses of several hundred volts are applied that transiently increase the permeability of the cell membrane, allowing introduction of the DNA into the cells. Expression plasmids can be mechanically introduced into cells or nuclei by microinjection with the use of a glass micropipette under a microscope. This technique is only convenient for analyses of a limited number of cells, although automated systems are available to improve the handling.

Electroporation

Lipofection A broad range of mammalian cells is efficiently transfected by means of liposomes. Lipofection uses cationic liposomes, often in combination with neutral co-lipids, as a carrier for the DNA. Positively charged liposomes and negatively charged DNA form complexes, with the positive charge on the surface and the packaged DNA inside. These DNA–liposome complexes fuse with the cell membrane, releasing the packaged DNA into the cell. Viral Transduction DNA can also be introduced into mammalian cells using viruses as carrier (viral transduction). Delivery of a gene of interest by a virus requires cloning of the corresponding gene into a viral vector. Transduction efficiency is extremely high compared to alternative transfection methods, often close to 100%. In addition, transduced genes are often integrated into the chromosomal DNA of the infected cell, facilitating stable gene expression. Viral vectors have been engineered for use in basic research and gene therapy. Widely used are recombinant retroviruses, lentiviruses (a subclass of retroviruses), adenoviruses, and adeno-associated viruses. The following key features are common to all viral vectors, (i) low toxicity, which reduces undesired side effects on the physiology of the infected cell; (ii) high stability of the viral genome after integration into the genome of the infected cell, preventing frequent gene rearrangements; (iii) cell type specificity, to ensure infection of a broad range of cell; and (iv) marker genes, for identification of cells infected by

913

914

Part V: Functional and Systems Analytics

the virus (i.e., antibiotic resistance genes). In particular, viral transduction is applied for delivering recombinant genes into cells, in which transfection with the aforementioned techniques is inefficient, such as mouse embryonal fibroblasts (MEFs). Widely used are retroviruses due to the property that the recombinant retroviral vector integrates into the mammalian host genome at high frequency. Upon integration, the recombinant gene is stably propagated to daughter cells of the host. The gene of interest is first cloned into a retroviral vector. Such vectors are genetically modified but retain all gene elements, which are required for replication of the recombinant retroviral vector in the mammalian host cell as well as integration into the mammalian genome. Safety is of high priority to minimize risk of handling the viral vectors. Usually, parts of the viral genome necessary for production of infectious viral particles are deleted from the viral vector. Therefore, production of infectious viral particles requires that the recombinant viral vector is transfected into a packaging cell line that encodes the missing proteins for assembly of infectious recombinant viruses, such as envelope proteins. During this process, a recombinant viral stock is generated that can be used to infect appropriate host cells and introduce the gene of interest.

34.4.3 Analysis of Reporter Gene Expression Reporter genes assays have become an indispensable tool in studying gene expression. They are widely used in basic, biomedical, and pharmaceutical research. Reporter gene proteins are expressed under the control of the promoter and/or promoter response elements of interest. Following expression, the cells are assayed for the reporter protein, either by direct determination of its amount or by measurement of its enzymatic activity. Ideal reporter gene proteins are easily detectable and their activity directly correlates with their expression level. This ensures that expression of reporter gene protein reflects the strength of the promoter under investigation. Some reporter gene systems can be used to measure transcriptional activity of the promoter of interest in vivo, that is, if a promoter is fused to the green fluorescent protein (GFP). The basic principles of widely used reporter genes are described below. Chloramphenicol Acetyltransferase (CAT) Assay

In the CAT reporter system, the reporter gene encodes the enzyme chloramphenicol acetyltransferase (CAT). CAT is a bacterial enzyme that detoxifies the antibiotic chloramphenicol by acetylation and confers chloramphenicol resistance to bacteria. Acetyl-CoA dependent acetylation of chloramphenicol (Cm) by CAT produces 1,3-diacetylated chloramphenicol and mono-acetylated 3-acetyl and 1-acetyl intermediates, which are all biologically inactive. The activity of CAT in cell lysates is monitored by acetylation of radioactive labeled [14C]-chloramphenicol. Non-acetylated and acetylated forms of [14C]-chloramphenicol are separated by thin-layer-chromatography (TLC). The ratio between the acetylated derivatives, which are more mobile, and non-acetylated chloramphenicol reflects the CAT activity and the level of gene expression (Figure 34.14).

Figure 34.14 CAT assay: schematic diagram of the thin-layer chromatographic separation of the reaction products resulting from a CAT assay.

34

Analysis of Promoter Strength and Nascent RNA Synthesis

Luciferase Assay The luciferase gene from the firefly Photinus pyralis is the most popular reporter gene. Luciferase is an enzyme that oxidizes D-luciferin in the presence of ATP, oxygen, and Mg2+, yielding CO2 and a fluorescent product (oxyluciferin) that can be quantified by measuring the amount of emitted light. Photon emission is detected using a light sensitive luminometer. Since light excitation is not required for luciferase bioluminescence, background fluorescence is extremely low. Therefore, luciferase-based reporter assays allow accurate quantification of subtle changes in gene expression due to high sensitivity and a broad linear range of the enzymatic reaction. The activity of less than 0.1 pg of luciferase can still be accurately measured in a standard reaction.

915

Non Radioactive Systems: Bioluminescence, Section 28.4.3

β-Galactosidase Assays β-Galactosidase (β-Gal) is an enzyme that catalyzes the hydrolysis of β-galactosides into monosaccharides, and, in addition, hydrolyses several non-physiological substrates. Non-physiological substrates are used to measure β-galactosidase activity in cell extracts. Depending on the substrate, β-galactosidase activity is determined by colorimetric or fluorometric assays, or by chemiluminescence. In the basic colorimetric assay, O-nitrophenyl-β-D-galactopyranoside (ONPG) is used as an artificial chromogenic substrate. ONPG is colorless, while the product of the reaction, o-nitrophenol (ONP) is yellow (λmax = 420 nm). Therefore, β-galactosidase activity is measured by the rate of appearance of the yellow color using a spectrophotometer. Although ONPG is the most commonly used substrate, the sensitivity of the colorimetric assay is low. Fluorescencebased assays, which utilize substrates that fluoresce upon hydrolysis, provide increased sensitivity. The substrate 4-methylumbelliferyl-β-D-galactopyranoside (4-MUG) does not fluoresce until cleaved by β-galactosidase, generating the fluorophore 4-methylumbelliferone (4-MU). The production of the fluorophore is monitored at an emission/excitation wavelength of 365/460 nm. Most sensitive is the chemiluminescence assay provided by 1,2dioxetane substrates (i.e., 1,2-dioxetane-galactopyranoside derivatives). β-Galactosidase catalyzes decomposition to 1,2-dioxetane, which emits light with a maximum intensity at a wavelength of 475 nm. Light production is measured with a luminometer. The chemiluminescence-based assay allows detection of less than 1 pg of β-galactosidase in the reaction. Although ß-galactosidase expression can be used as a standard reporter for monitoring the strength of a promoter or enhancer, it is now predominantly used as an internal control during transfection experiments. When used in this manner, cells are usually transfected with the control plasmid expressing β-galactosidase under the control of a viral promoter, such as the SV40 promoter, together with a second plasmid containing another reporter gene (e.g., luciferase or chloramphenicol acetyltransferase) under the control of the promoter or enhancer of interest. Green Fluorescent Protein Expression of green fluorescent protein (GFP) is a widespread tool used to visualize spatial and temporal gene expression patterns in vivo. Details regarding green fluorescent proteins are described in Chapter 7. Usually, GFP gene reporters are not employed to quantify gene expression levels.

Analysis of the Transcripts from Transfected Cells In addition to, or rather than, reporter gene protein analysis, the amount of mRNA transcribed specifically from the test promoter can be quantified. Many of the methods presented above in Section 34.1 are suitable for this purpose, in particular quantitative RT-PCR, but also nuclease S1 assays, ribonuclease protection assays, and Northern blots are useful. Reporter mRNAs differ from native RNAs in that they are usually not subject to post-transcriptional regulation and are usually considerably more stable than native RNAs. This has the desirable effect of amplifying signals, but in so doing also only shows part of the full picture affecting the expression and regulation of an mRNA. For a native mRNA subject to regulation by miRNAs, for example, measuring the reporter will completely ignore the effects of this regulation. This is useful, in that it allows the researcher to separate promoter activity from mRNA regulation, but if the question relates to regulation of mRNA levels it provides only a partial understanding. Current research clearly indicates that post-transcriptional regulation is more frequent than originally thought, though also limited by the fact that it rarely affects levels by more than about 2–4-fold and never acts as a clear on–off switch, like a promoter can.

Green Fluorescent Protein (GFP) as a Unique Fluorescent Probe, Section 7.3.4

916

Part V: Functional and Systems Analytics

Further Reading Brown, T., Mackey, K., and Du, T. (2004) Analysis of RNA by northern and slot blot hybridization. Curr. Protocols Mol. Biol., unit 4. 9. Doi: 10.1002/0471142727.mb0409s67. Carey, .F., Peterson, C.L., and Smale, S.T. (2013) The RNase protection assay. Cold Spring Harbor Protocol, issue 3. Chen, J.L. and Tjian, R. (1996) Reconstitution of TATA-binding protein-associated factor/TATA-binding protein complexes for in vitro transcription. Methods Enzymol., 273, 208–217. Cornetta, K., Pollok, K.E., and Miller, A.D. (2008) Retroviral vector production by transient transfection. Cold Spring Harbor Protocols, issue 4. Cornetta, K., Pollok, K.E., and Miller, A.D. (2008) Generation of stable vector-producing cells for retroviral vectors. Cold Spring Harbor Protocols, issue 4. Eyler, E. (2013) Explanatory chapter: nuclease protection assays. Methods Enzymol., 530, 89–97. Johnson, G., Nour, A.A., Nolan, T., Huggett, J., and Bustin, S. (2014) Minimum information necessary for quantitative real-time PCR experiments. Methods Mol Biol., 1160, 5–17. Fu, Y. and Xiao, W. (2006) Study of transcriptional regulation using a reporter gene assay. Methods Mol. Biol., 313, 257–264. Ko, H.Y., Hwang, do W., Lee, D. S., and Kim, S. (2009) A reporter gene imaging system for monitoring microRNA biogenesis. Nat Protocol, 4, 1663–1669. Li, L. and Chaikof, E.L. (2002) Quantitative nuclear run-off transcription assay. Biotechniques, 33, 1016–1017. McPheeters, D.S. and Wise, J.A. (2013) Measurement of in vivo RNA synthesis rates. Methods Enzymol., 530, 117–135. Nolan, T., Hands, R. E., and Bustin, S.A. (2006) Quantification of mRNA by real-time RT-PCR. Nat. Protocol, 1, 1559–1582. Percipalle, P. and Louvet, E. (2012) In vivo run-on assays to monitor nascent precursor RNA transcripts. Methods Mol. Biol., 809, 519–533. Romero-Lopez, C., Barroso-del Jesus, A., Mendendez, P., and Berzal-Herranz, A. (2012) Analysis of mRNA abundance and stability by ribonuclease protection assay. Methods Mol. Biol., 809, 491–503. Sambrook, J. and Russell, D.W. (2006) Mapping RNA with nuclease S1. Cold Spring Harbor Protocols, issue 1. Sambrook, J. and Russel, D.W. (2006) Calcium-phosphate-mediated transfection of eukaryotic cells with plasmid DNAs. Cold Spring Harbor Protocols, issue 1. Sambrook, J. and Russel, D.W. (2006) DNA transfection mediated by lipofection. Cold Spring Harbor Protocols, issue 1. Sambrook, J. and Russel, D.W. (2006) DNA transfection by electroporation. Cold Spring Harbor Protocols, issue 1. Schmittgen, T.D. and Livak, K.J. (2008) Analyzing real-time PCR data by comparative C(T) method. Nat. Protocol, 3, 1101–1108. Schnapp, A. and Grummt, I. (1996) Purification, assay, and properties of RNA polymerase I and class Ispecific transcription in mouse. Methods Enzymol., 273, 233–248. Smale, S.T. (2010) Luciferase assay. Cold Spring Harbor Protocols, issue 5. Smale, S.T. (2009) Nuclear run-on assay. Cold Spring Harbor Protocols, issue 11. Southern, M.M., Brown, P.E., and Hall, A. (2006) Luciferase as reporter genes. Methods Mol Biol., 323, 293–305. Spector, D.L. and Goldman, R.D. (2006) Constructing and expressing GFP fusion proteins. Cold Spring Harbor Protocols, issue 1. Tuschl, T. (2006) Transfection of mammalian cells with siRNA duplexes. Cold Spring Harbor Protocols, issue 1. Venkatesh, L.K., Fasina, O., and Pintel, D.J. (2012) RNase mapping and quantification of RNA isoforms. Methods Mol Biol., 883, 121–129.

Fluorescent In Situ Hybridization in Molecular Cytogenetics

35

Michelle Neßling and Karsten Richter Service Unit Electron Microscopy, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany

Molecular cytogenetics aims to characterize the genomic state of clonally dependent cell populations, for example, to determine aberrations that lead to tumorigenic devolution, to disclose gene-defects of unborn children by prenatal diagnosis, or to reveal the level of genomic relationship among species. Like classic karyotyping, fluorescent in situ hybridization is able to reveal numeric (ploidy, polysomy) and structural (translocation) chromosomal aberrations. However, the potential of the method extends further to high-throughput capability for determination of loss or gain of chromosomal regions. The resolution is also significantly better, though diagnosis of point mutations requires other techniques. To reveal aberrations in the genome of a cell, classic cytogenetics resort to the structural analysis of banding patterns from karyograms (karyotyping). Serious drawbacks of this approach are: (i) Only dividing cells are addressable by karyotyping. (ii) Identification of aberrations from the ideotype requires excellent expertise. (iii) Resolution is poor: single bands represent more than 1 Mbp of a genome, while more than one band is required to identify a translocated piece of chromatin. Molecular cytogenetics solves these problems: as pioneering work by John et al. and Pardue & Gall in 1969 demonstrated, genomic sites of interest can be shown by their hybridization with a labeled marker-DNA (probe). Here, two modifications used to diagnose genomic aberrations by fluorescent hybridization will be discussed, namely, in situ hybridization (ISH) and comparative genomic hybridization (CGH). ISH uses fluorescence-labeled oligonucleotides as reporter for their target-sequence within a sample, for example, to show gene-loci within interphase nuclei or chromosome regions in a metaphase-spread. In contrast, CGH reverses this principle: the genome to be investigated is prepared and fluorescently labeled, and hybridized to an immobilized target of known spatial composition.

35.1 Methods of Fluorescent DNA Hybridization 35.1.1 Labeling Strategy Fluorescence as a signal (FISH: Fluorescent-ISH) is now established against older strategies using radioactivity or enzymatic color-reactions. The fluorescent dark field signal is very sensitive, it is quantitative, and a large gamut of fluorochromes is available to label different targets in parallel with different colors. Ratio labeling, in contrast, demands exhaustive exposure times and yet responds with inferior spatial resolution. Furthermore, security issues demand for the management of an expensive hot-laboratory. DNA probes linked to enzymes, instead of to fluorochromes, had been employed to reveal hybridization targets by a downstream enzymatic reaction that produces colored precipitations (e.g., using horse-reddish peroxidase to produce brown diaminobenzidine). Such color precipitates are observed in Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

Physical and Genetic Mapping of Genomes, Chapter 36

Karyogram – Ordered representation of all chromosomes from a single cell nucleus, prepared as so-called metaphase-spread and labeled according to Giemsa, which results in a chromosome-specific banding pattern. Both numeric and structural aberrations relative to the standard banding pattern of the organism (Ideogram) are discernable to the level of a few bands.

918

Part V: Functional and Systems Analytics

Detection Systems, Section 28.4

bright field, showing the histology of the sample in usual counterstain. However, coloredprecipitations by enzymatic reaction are far less sensitive than fluorescence, and quantitative evaluation is approximate only. Two types for fluorescent labeling are commonly distinguished, direct and indirect labeling, according to the strategy used to couple the reporter (fluorochrome) to the probe. Direct labeling with fluorochromes covalently bonded to the probe is particularly beneficial for multicolor applications to reveal multiple targets in parallel. Instead of a fluorochrome, a hapten may be bonded covalently to the probe. A second reaction then is required to reveal the hapten by a fluorescent reporter. Two commonly used systems are biotin/fluorescent streptavidin and digoxigenin/fluorescent anti-digoxigenin. An important advantage of indirect labeling is the potential to enforce signal strength via cascades of second marker reactions. Technical parameters that influence the signal strength are the length of the oligonucleotide probe, the density of reporter-nucleotides per probe, the efficiency of hybridization along the target sequence, and the suppression of background through nonspecifically bound reporter.

35.1.2 DNA Probes

Isolation of Genomic DNA, Section 26.2

The haploid genome of human beings weighs about 3 pg at a cumulated length of 3 × 1012 bp.

Technical demands for hybridization increase on going from repetitive to singular probes. Repetitive probes target clusters of repetitive genomic sequences (e.g., centromeric satellite DNA). Sequences addressed by these probes can be very short (below 1Mb) and still yield strong signals. False positive background is not an issue. In contrast, singular probes, which are designed to label extended genomic regions, provoke a strong false positive background, which needs to be suppressed to obtain meaningful signals. Since unique probes need to cover quite long stretches of target sequence to produce measurable signals they typically include interspersed repetitive sequences (IRSs), which exist throughout the entire genome, such as short and long interspersed nuclear elements. Section 35.1.4 gives details of a special approach called CISS that is used to suppress genomic IRS for hybridization with singular probes. This kind of background suppression is particularly important for the application of painting probes used to study the spatial extension of particular chromosome bands or arms or even whole chromosomes throughout a cell’s genome. Many probes used for medical applications are commercially available. Furthermore, probes may be tailored by applying dedicated primer-extension strategies to genomic DNA. Modern synthetic primer design profits from the quick development of online data bases as indispensable tools for the search of adequate sequences (NCBI, Ensembl, and UCSC). In former times, genomic libraries had been managed to provide specified DNA-fragments (ImaGenes, EMBL, Sanger Center, NIH), amplified in various vector-systems, based on phages, cosmids, BACs (bacterial artificial chromosomes), and PACs (P1 bacteriophage artificial chromosomes). Painting probes for whole chromosomes and chromosome arms and bands, for example, were collected from DNA libraries. Such libraries may be initialized from flow-sorted chromosomes or even needle-scratches from chromosomes on conventional metaphasespreads, which then need equal amplification (e.g., using DOP-PCR). CGH-probes comprise the entire DNA of a cell-population under study. Even small tumorbiopsies yield enough DNA for an experiment. DNA from single cells and small cell populations, however, needs to be amplified. Since CGH measures the balance between two DNA-pools (e.g., tumor versus wild-type), amplification strategies must have an equal efficiency over the whole genome.

35.1.3 Labeling of DNA Probes Methods of Labeling, Section 28.3

Hybridization probes get their reporters upon incubation with modified nucleotides during synthesis via for example, nick-translation, random priming, or the polymerase chain reaction (PCR). The probe length should range between 300 and 800 bp. Long probes are sterically hindered to efficiently pervade the target, while short probes tend to hybridize less specifically. Both cases will increase the background of labeling. One special advantage of nick translation is the potential to control the probe length by the efficiency of the DNase-I digest. After synthesis, free unbound reporter nucleotides must be cleared off the probe-sample (e.g., by ethanolic

35 Fluorescent In Situ Hybridization in Molecular Cytogenetics

919

precipitation of the probe DNA or by column-filtration). One hybridization experiment consumes about 10–100 ng of probe DNA. Nick Translation Nick translation is based upon polymerase-I (Pol-I) dependent incorporation of nucleotides at single-strand breaks (nick). The first reaction step is a DNase-I digestion of the double stranded DNA to obtain single-strand breaks at an appropriate rate. Second, starting at each nick, polymerase-I will exchange one half-strand in the 5´ -direction, deleting the old stand by its 5´ -exonuclease activity while polymerizing new nucleotides to the 3´ -OH residue at the nick, thus allowing incorporation of reporter nucleotides. In this way, the nick moves in the 5´ -direction until another nick on the complementary strand of the DNA is encountered, causing a double-strand break. The length of labeled DNA-fragment therefore depends on the frequency of nicks and can experimentally be controlled by the intensity of DNase-I digestion. Random Priming This is the method of choice for labeling fragments of initial DNA that are too short for nick-translation (i.e., below 2 kb). This DNA pool becomes denatured into single strands to be used as templates for primer-extension by Klenow Pol-I polymerization. The required primers are offered as a mixture of hexanucleotides in all possible combinations of the four bases. PCR Labeling

This is the natural way for labeling probes that are obtained by PCR

amplification.

35.1.4 In Situ Hybridization Strategies towards high-specificity hybridization for FISH and CGH proceed in four distinct steps: denaturation (probes and target-DNA), pre-annealing (saturation of ubiquitous repetitive sequences), hybridization, and stringency differentiation. Further required measures are the improvement of target-accessibility (e.g., acid histone-extraction, protein digestion, and detergence extraction), as well as additional labeling steps, for example, involved with indirect labeling and for counterstaining chromosomes or nuclei.

Specificity of the Hybrization and Stringency, Section 28.1.2

Denaturation The denaturation of DNA occurs a few degrees above its melting temperature Tm. Simple boiling for some minutes will do to melt probe-DNA. However, if structural integrity is of concern, too much heat is not acceptable. This is typically applicable to denature targetDNA. To diminish Tm, incubation-buffers include high concentration of monovalent salt plus 50–70% formamide as a destabilizer. Pre-annealing This is a strategy used to suppress background from IRS (interspersed repetitive sequences), which is necessary for the detection of unique target sequences by singular probes. After melting the labeled probe-DNA, non-labeled Cot 1 DNA is added in excess and allowed to “pre-anneal” with complementary sequences of the probe DNA, which lose their potential to hybridize to corresponding sites of the target (chromosomal in situ suppression, CISS). Hybridization

The hybridization of probe DNA to the target DNA is initiated on lowering the temperature below Tm. To guarantee saturation, conditions are set to low stringency. A reaction time of a few hours is allowed for the hybridization of simple repetitive probes while CGH experiments with complex singular probes require several days. Stringency Another source of background is the hybridization of partners that match only occasionally. The probability for these kinds of partnerships is intrinsically high with painting experiments. Two strategies are useful to suppress this effect: (i) adding non-labeled DNA of another species (e.g., salmon sperm DNA) to the hybridization-mix and (ii) appending a stringency wash following hybridization. Poorly paired partners become re-dissociated in washing buffers of diluted salt concentration and at increased temperature.

Cot is a value used to characterize the complexity of a given DNA pool and is derived from its typical duration for rehybridization. The part of a metazoan genome with Cot < 1 represents its repetitive sequences.

920

Part V: Functional and Systems Analytics

35.1.5 Evaluation of Fluorescent Hybridization Signals Light Microscopy Techniques – Imaging, Chapter 8 DNA Microarray Technology, Chapter 37

FISH and CGH experiments, which are performed for their complex structural information, are investigated by fluorescence microscopy. Microarray CGH, in contrast, by design allows full automatic data-acquisition by a simple chip-reader. In any case, fluorescence is measured – the choice of fluorescent markers evidently must account for the technical capabilities of the instrumentation available (filter settings and laser types). In this respect, multicolor applications are particularly demanding. For such cases, it is worth determining whether or not physical overlap of signals occurs. Thus, to perform CGH, signals are identified unequivocally by their spectral signature (multicolor FISH, Section 35.2.1), as chromosome paints do not overlap in metaphase spreads. The use of the same probes to evaluate vicinal relationships of chromosomes in interphase, in contrast, requires sufficient spectral separation of the detection channels (Figure 35.1) to omit false signaling due to cross-talk. Though, in principle, cross-talk can be eliminated analytically, this will inevitably reduce the signal-to-noise ratio.

35.2 Application: FISH and CGH The aim of FISH is to localize a target within its native environment: The target sequence is known and the experiment designed to localize its abundance, for example, to map a genomic region throughout the genome of a tumor. As a consequence, FISH refers to single-cell events and many observations are to be collected for representative appraisal of a cell population. CGH, in contrast, is designed to reveal numeric genomic aberrations of a cell-collective (e.g., tumor sample). A known target (metaphase spread of a standard cell line, sequence-collection on a DNA-chip) is co-hybridized with the test-genome labeled in one color and a reference genome labeled in another color, and a hybridization-profile along the genome is read out as the normalized signal-ratio between the two probes.

35.2.1 FISH Analysis of Genomic DNA

Figure 35.1 Comparing the absorption (a) and emission-spectra (b) of DAPI, FITC, TRITC, and Alexa633® . All four spectra are separated sufficiently to allow unequivocal allocation of four nonoverlapping targets. However, overlapping is also notable: TRITC is significantly excited at 488 nm, the excitationmaximum of FITC, and emission of DAPI still occurs in the detection range of FITC.

Metaphase FISH Metaphase spreads are prepared from cells in culture, which become arrested in metaphase by treatment with colchicine, a drug that destabilizes microtubules (Figure 35.2). The culture in arrested state becomes swollen by hypotonic incubation and is dropped from some distance onto a glass slide, causing cells to burst while their chromosomes adsorb to the glass surface in groups representing the chromosomal composition of single cells. Hybridizing the probe of interest to this slide then allows us to spot at light microscopic resolution the respective genomic positions on the adsorbed chromosomes. As a matter of course, this approach is applicable only to dividing cells. Fixed biopsies or fresh tumor material that does not grow in culture cannot be investigated this way. While diagnosing ploidies and polysomies is satisfactorily achieved by classical karyotyping, FISH adds the important option to record distinct genomic regions. Typical applications for metaphase-FISH are:

 physical mapping of genes and genomic marker-sequences: the resolution to separate two

Translocation: shift of a chromosomal region to a new position in the genome. Marker chromosome: additional chromosome of unknown origin.

   

locations is limited to a few Mbp because of the dense packing of chromatin in metaphase; a more detailed view is possible, however, using protocols to spread the DNA (fiber-FISH); identification of site of ectopic DNA integration (e.g., upon stable transfection of cells); identification of genome homologies among species (zoo-FISH): hybridizing labeled whole DNA of one species to metaphase-spreads of another species readily reveals the extent of homology; detection of translocations via chromosome-paints; multicolor-FISH (see below) unravels unknown translocations; composition analysis of marker chromosomes via painting probes towards sections or whole chromosomes of interest.

Multicolor FISH This takes advantage of the fact that genomic regions on metaphase chromosomes do not overlap. Thus, to separate signals from different probes, they may be

35 Fluorescent In Situ Hybridization in Molecular Cytogenetics

921

labeled by mixtures of fluorochromes (spectral signature). Three probes, for example, are readily distinguishable when labeled with colors A, B, and a combination of A and B, respectively. The ratio method extends this approach to the use of only two fluorochromes for labeling many probes, each in a well-controlled ratio of the two. Labeling all 22 autosomes plus X- and Ychromosome of the human genome to allow their direct identification by color instead of spatial signature (i.e., size, length of p-arm, and banding pattern) is feasible by the combination of five fluorochromes only (multiplex (M-) FISH). Interphase-FISH and Fiber-FISH As mentioned above, the preparation of metaphase spreads requires that cells do proliferate. However, valuable information is also obtained by FISH from interphase chromosomes of non-proliferating systems. Thus, numeric aberrations are directly accessible by the number of focal signals per cell nucleus. Ploidies, for example, are detected using repetitive probes towards centromeric regions. Though the chromosomal position of a signal is not directly visible, physical distances between targets are accessible, exploiting its linear relationship with the measurable mean squared distance. This approach yields even higher resolution (100 kb) than metaphase genome mapping since interphase chromatin is less condensed. In this respect even higher resolution down to 1 kb provides fiber-FISH, where chromatin becomes splayed by exhaustive spreading strategies (Halo-preparation). Fiber-FISH enables us to display on the gene level genomic aberrations like deletions, inversions, duplications, amplifications, or translocations.

Comparative Genomic Hybridization (CGH), Section 37.2.4

35.2.2 Comparative Genomic Hybridization (CGH) CGH (comparative genomic hybridization) was developed originally to analyze solid tumors for genomic imbalances. Its readout is the listing of genomic gains and losses in the genome of a particular cell population (e.g., tumor) relative to a reference cell population. There is, though, now less interest in this technology as modern high-performance sequencing has become quick and economic in answering the same question with higher accuracy. CGH measures genomic imbalances while ignoring genomic aberrations like translocations and rearrangements, which do not affect the numerical balance (Figure 35.3). Furthermore, ploidies regarding the entire genome remain undiscovered, owing to the normalization of the data (see below). The determination of genomic imbalances is of particular interest in cancer research to gain access to the causes of tumorigenic degeneration. Genomic loss refers to potential tumor suppressor genes, gain to proto-oncogenes. Comparing the gains and losses within huge collectives of patient-material allows us to narrow down minimal altered regions, so called hot spots, to be further analyzed by other methods, for example, to qualify candidate genes and to unravel functional pathways that are involved with tumor development. CGH is a two-color ratio application. Test-DNA (e.g., DNA of a tumor biopsy) labeled in one color and reference DNA (from a healthy donor) labeled in another color are co-hybridized to a target (metaphase preparation: chromosomal CGH, immobilized array of characterized nucleotide sequences: microarray-CGH) and the relative intensities of the two signals is measured as a function of target localization. Sequences that are overrepresented in the test-DNA cause dominance of the test-DNA signal at the corresponding target site, losses are indicated by dominance of the reference-DNA signal. Since labeling efficiency between the two colors technically is not sufficiently reproducible, data-analysis requires normalization of the balanced state for each experiment. Normalization Since genomic imbalances of an unknown tumor sample can be unexpectedly vast, it is difficult to identify a reference region within the measured dataset for normalization. In a simple approach one may accept that most of the genome is balanced and that normalizing the integral signal (test-signal/reference-signal set to 1) is sufficiently precise. Alternatively, there may be good reasons why a certain genomic region remained balanced. Intermixture Imbalances measured by CGH appear weaker than their nominal value. This happens because of non-tumor cells, which are part of each biopsy and contribute a balanced fraction to the extracted test-DNA. Since dampening the ratios due to intermixture reduces the

Figure 35.2 Scheme for common applications of metaphase and interphase FISH: Singular probes towards autosomal sequences regularly label two target loci, the maternal and paternal gene copy, respectively. In metaphase (left-hand side), signals appear paired since chromosomes consist of two sister chromatids, which is not the case in interphase. (a) Metaphase FISH towards three different target sequences directly demonstrates their sequential order, while interphase FISH only reveals neighborhood relationships upon statistical evaluation. (b) Both metaphase and interphase FISH allow direct demonstration of deletions. (c) To show translocations directly also requires metaphase spreads as a target. (d) Gain of genomic regions associated with translocation (asterisk) or polysomy become visible on metaphase as well as interphase preparations. Tandem repeats (triangle), however, are difficult to resolve spatially.

922

Part V: Functional and Systems Analytics

Figure 35.3 Chromosomal CGH and microarray CGH: CGH (comparative genomic hybridization) measures the ratio of fluorescence intensities of two probes, test and reference DNA, cohybridized to the same target and distinguished by their reporter-fluorescence (here: green signal for tumor DNA; red signal for DNA of a healthy donor). Chromosomal CGH (left-hand side) uses metaphase spreads as targets, while the target in microarray CGH is an array of predetermined nucleotide sequences immobilized onto a glass slide (DNA chip). As a result (bottom), genomic gains within the test-genome are expressed by the dominance of green signals over red (e.g., arrowhead); the dominance of red signals over green indicates genomic losses accordingly (e.g., arrow).

sensitivity of the method, eventually only high-level amplifications are significantly detected. To improve the situation, cell sorting or micro-dissection techniques have been exploited to enrich fractions of tumorigenic cells before preparation of the test DNA.

DNA Microarray Technology, Chapter 37 Microarray Comparative Genome Hybridization (CGH), Section 37.2.4

Chromosomal CGH This is based on metaphase spreads as a target. Such spreads are typically obtained from cells of peripheral blood of healthy donors. Preparation methods are established and reliable. The target obviously represents the entire genome including all repetitive sequences, which need to be blocked before hybridization (CISS, see Section 35.1.4). After co-hybridization, samples become counterstained with DAPI, which allows the identification of chromosomes by a banding pattern reminiscent of classic Giemsa staining. Data acquisition and processing proceeds by the following steps: (i) Several metaphase-spreads are digitally imaged into three channels representing the test probe, reference probe, and chromosome structure. (ii) Chromosomes are identified by their structure (DAPI stain) and digitally segmented for each individual metaphase spread. (iii) Chromosomes are rectified by a type of computer-animation and superposed to obtain an averaged ideogram for the acquired dataset. (iv) Intensities for test and reference signals cumulated perpendicular to the chromosome axes are divided by each other, normalized, and read out as a profile for each chromosome ideogram (Figure 35.4). To show lownumber gains they need to stretch more than 10 Mbp; significance for high-level amplifications is reached already for genomic regions shorter than this. Nevertheless, in searching for candidate genes the physical resolution of chromosomal CGH is rather poor. Microarray CGH This technique (also known as matrix CGH or array CGH) uses DNA-chips as a target instead of metaphase preparations. These chips are glass slides with thousands of prechosen target sequences immobilized in an array of tiny spots. In contrast to metaphases, DNA chips do not necessarily cover a whole genome (tiling path resolution) and their composition of target sequences is a strategic decision. Different chips for different purposes are commercially available, for example, tailored for differential diagnosis of related disorders.

35 Fluorescent In Situ Hybridization in Molecular Cytogenetics

923

Figure 35.4 Exemplary demonstration of results from chromosomal CGH (a) and microarray-CGH (b). (a) On the left-hand side, gastric cancer with amplification of 2p23-p24 (arrowhead; see also Figure 35.3, which shows the entire metaphase spread of this experiment). The ratio profile plotted next to the chromosome 2 ideogram represents an average over more than ten single measurements. The nominally balanced state is marked with the black base-line; red lines to the left and to the right indicate the agreed confidence limit beyond which aberrations are scored. On the right-hand side in (a): To allow straightforward comparison among a whole patient collective, the extent of scored imbalances is visualized as bars allotted to corresponding positions in the ideogram (right, gains and left, losses). Recurrent aberrations within a collective of pancreatic cancer is readily appreciable in this example. (b) Bar diagram to visualize microarray CGH ratios measured for a case of glioblastoma (female patient, male reference-DNA): Each bar represents the CGH ratio of one target sequence plotted on a logarithmic scale in chromosomal order (1–22, X, and Y). Nominal value for the balanced state in logarithmic scale is null; ratios of balanced regions scatter around this baseline. Significance levels of aberrations can be calculated from the statistic mean variation of balanced regions for each experiment. Here, the confidence level of three times standard deviation is indicated by the red lines. Gain in chromosomal regions 1q23-q43 and 7p12, and losses in 6q24-q25, 9p21, and chromosome 10, are readily appreciable (in brackets: Candidate genes).

The chip-technology offers two important advantages relative to chromosomal CGH:

 The physical resolution is much better and limited only by the length of target sequence to 

obtain efficient hybridization, roughly 100 bp, which is short compared to the length of single genes. Analysis of microarrays is fully automatic and therefore capable of high-throughput approaches.

In the early days, DNA chips had been composed from DNA libraries. The management of such whole-genome libraries, including amplification, purification, characterization, and archival storage of the DNA, is tedious, while in practice the quality and assignment of ordered material was often insufficient. Nowadays, oligonucleotide-chips are used. The oligonucleotides become synthetized directly on-chip, or they are spotted from solution onto the chips by robotic loading. As they are artificially designed, they do not contain unwanted sequences, in particular no IRS. Hybridization signals are much higher compared to chips based on library DNA. Nevertheless, target-sequences are placed in replicates to increase the statistical significance of the hybridization experiment. Hybridization and data acquisition are performed with special instrumentation to fit a workflow for high-throughput turnover, where false-proof transmission of sample data must be guaranteed. Data analysis involves huge program packages. Starting with two scanned

924

Part V: Functional and Systems Analytics

images of the chip representing the signal intensities for test and reference-DNA, respectively, the single spots are segmented by automatic image analysis and automatic filters applied to exclude spots of poor quality, for example, due to dirt, contact with their neighborhood, or weak intensity. After local background subtraction, the median of mean intensities from replicates is taken as the final measurement for each target sequence and exported in a table for normalization and segmentation of genomic imbalances with other software. In addition, further analysis relies on digital intelligence to be able to manage the enormous amount of data. Typically, one experiment delivers more than 100 genes as potentially disordered. Further criteria are required to filter for relevant clues in the data sets. Strategies span from elaborate cluster analysis to the screening of databases for pathway analysis.

Further Reading Alexandrov, L.B., Nik-Zainal, S., Wedge, D.C., Aparicio, S.A., Behjati, S., Biankin, A.V., et al. (2013) Signatures of mutational processes in human cancer. Nature, 500, 415–421. Gray, J.W., Pinkel, D., and Brown, J.M. (1994) Fluorescence in-situ hybridization in cancer and radiation biology. Radiat. Res., 137, 275–289. Kallioniemi, A., Kallioniemi, O.-P., Sudar, D., Rutovitz, D., Gray, J. W., Waldman, F., and Pinkel D. (1992) Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science, 258, 818–821. Lichter, P., Chang Tang, C.-J., Call, K., Hermanson, G., Evans, G.A., Housman, D., and Ward, D.C. (1990) High-resolution mapping of human chromosome 11 by in situ hybridization with cosmid clones. Science, 247, 64–69. Pinkel, D., Segraves, R., Sudar, D., Clark, S., Poole, J., Kowbel, D., Collins, C., Kuo, W.L., Chen, C., Zhai, Y., Dairkee, S.H., Ljung, B.M., Gray, J.W., and Albertson, D.G. (1998) High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat. Genet., 20, 207–211. Solinas-Toldo, S., Lampel, S., Stilgenbauer, S., Nickolenko, J., Benner, A., Döhner, H., Cremer, T., and Lichter, P. (1997) Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes Chrom. Cancer, 20, 399–407. Trask, B.J. (1991) Fluorescence in-situ hybridization: applications in cytogenetics and gene mapping. Trends Genet., 7, 149–154.

Physical and Genetic Mapping of Genomes Christian Maercker

36

Esslingen University, Kanalstraße 33, 73728 Esslingen, Germany

The creation of genetic and physical maps of genomes is an important area of research in the life sciences and medicine. Work in this field plays an essential role in understanding the genetics of organisms and enables prenatal diagnostics, diagnostics of complex diseases, and therapeutic approaches in personalized medicine. Scientists currently estimate over 10 000 diseases to be monogenic (WHO). In addition, many diseases have polygenic etiology, meaning that several gene defects contribute to the expression of the disease phenotype. One of the goals of gene mapping is to identify all genetic variants relevant for disease. A physical map contains the sequences of markers and shows the distances between markers within the genome (see cytogenetic methods described in the previous chapter). A genetic map is based on the analysis of the common inheritance of defined markers of known position coupled to a certain phenotype, in this case a specific disease.

36.1 Genetic Mapping: Localization of Genetic Markers within the Genome Genetic information in eukaryotes is distributed on discrete chromosomes, which are visible during the metaphase stage of mitosis. An image of all the metaphase chromosomes within a single cell is described as its karyotype. The karyotype of a normal diploid human cell contains 46 chromosomes. Diploid organisms have two copies of each autosome and, in addition, two sex chromosomes. Each chromosome, in turn, consists of two complementary sister chromatids.

36.1.1 Recombination Somatic cells proliferate by mitotic division. Before separation of the cells, cytokinesis, each chromosome is replicated and one copy of each chromosome is transferred into each daughter cell. The germ cells are generated by meiosis, meaning two successive cell divisions, resulting in a haploid chromosome set. During meiosis, genetic material is exchanged between non-sister chromatids in a process called crossing over, resulting in two recombinant chromatids. The probability of a crossover event between two loci on a chromosome is dependent on the distance between these loci. The recombination frequency is high for loci that are far apart, which results in them being inherited independently. Neighboring genes, on the other hand, are usually inherited together, exhibiting genetic linkage. Therefore, the degree of genetic linkage is an indirect measure of the physical distance between genes on a chromosome:  Distance …cM† ˆ

 Number of recombinants  100 Number of offspring

(36.1)

Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

Cytogenetic Methods, Section 36.1

926

Part V: Functional and Systems Analytics

One centimorgan (cM) is therefore defined as a recombination frequency of one percent. However, this type of genetic measurement is only possible over limited distances, since the probability of two crossovers increases with distance between two loci. In this event, two distant genes separated by two crossover events cannot be distinguished from ones where no crossover has taken place – thus the real distance between two loci on the genetic map might be underestimated. However, it is also possible to map longer distances by the use of intermediate steps, when possible. The distances between the respective single steps then are added to determine the genetic distance between the genes of interest. This method is also recommended to provide verification of distances measured directly. In the human genome, the genetic

Figure 36.1 Recombination and crossing over. During meiosis, genetic material frequently is exchanged between nonsister chromatids. This results in recombinant chromatids containing maternal and paternal genetic material. Detecting recombination is only possible if maternal and paternal genetic material can be discriminated. This is possible via polymorphic markers that are differentially expressed. Here the marker at locus 1 appears as a or A, whereas locus 2 is detectable as b or B.

36 Physical and Genetic Mapping of Genomes

distance of 1 cM corresponds to about 1 Mb, when averaged over the entire genome. However, this correlation is of limited value, because the recombination frequency varies between different genomic regions and thus the perceived distance. Some regions are rarely affected by recombination events, whereas other regions show a very high recombination frequency and are thus referred to as recombination hot spots. Therefore, the determined genetic distance frequently differs from the calculated physical distance. To be able to map the chromosomes quickly and completely, markers, fixed points on the genomic map, are determined. A genomic locus qualifies as a marker if it is present in different variants across the whole population. Only when parents carry different variants, alleles, of the markers, can a new combination in the F1 generation be observed. Such markers are called polymorphic markers (Figure 36.1). The quality of markers is determined by their heterogeneity, which is described by the number of possible alleles and the relative frequency in the population. The more differing alleles a marker contains and the more evenly these markers are distributed among the population, the more helpful it is in analysis of the recombination events from one generation to the next. Markers that can be used for an individual analysis in this respect, that is, which appear in different alleles in both parents, are called informative. Markers can be subdivided into two categories. For the generation of genetic maps, polymorphic markers are especially valuable, whereas for physical mapping a unique, well-defined position is most important – each marker must be attributable to a single copy at a single locus within the genome but they need not necessarily be informative. Genetic markers are restriction fragment length polymorphisms (RFLPs), microsatellites, and single nucleotide polymorphisms (SNPs)/single nucleotide variants (SNVs). Physical markers are genes, sequence-tagged sites (STSs), and chromosome breakage sites.

Informative markers are defined points in the map of a genome in which different alleles are present in the parental generation, for example, microsatellite markers, a subset of the polymorphic sequence tagged sites (STSs).

36.1.2 Genetic Markers Restriction Fragment Length Polymorphisms Restriction fragment length polymorphisms (RFLPs) were the most important class of DNA polymorphisms for a long time. They are most often the result of single base changes, but can also result from insertions or deletions. If a change alters the recognition site of a restriction endonuclease, this leads to differences in the length of restriction fragments of the genome, which are easily observed. Although most sequence variations cannot be correlated to phenotypic changes, they nevertheless behave as Mendelian genes and can therefore be used as genetic markers (Figure 36.2). Microsatellites (Polymorphic STS) Microsatellites (polymorphic sequence tagged sites, STSs) are a special type of STS (Section 36.2.3). In the case of STSs, knowledge of the specific sequence of the locus is necessary. Conserved primer sites serve to amplify the polymorphic STS, while the sequences between the primer binding sites differ in length. Therefore, in contrast to non-polymorphic STS, allele-specific PCR products can be generated. These allelic differences are inherited according to Mendelian rules, which qualifies them as genetic markers. In most cases, microsatellites consist of repetitive sequences with varying repetitions in the different alleles. These short repetitions appear in all eukaryotic genomes. Most repetitive sequences contain C/A units, which can be amplified by locus-specific flanking PCR primers. The number of C/A units is very often highly polymorphic, which is very convenient for genetic mapping. In addition, a high density of markers can be found, because the microsatellites can be as frequent as one repetitive sequence cluster per 50 kb of genomic sequence. Single Nucleotide Polymorphisms (SNPs)/Single Nucleotide Variants (SNVs) SNPs/ SNVs are variants of singe base pairs (bps) in a DNA strand. SNPs/SNVs represent about 90% of all genetic variants of the human genome and appear at different densities in certain genomic regions. In two-thirds of all SNPs/SNVs cytosines and thymines are exchanged, because in vertebrate genomes cytosine is very frequently methylated. By spontaneous deamination, 5-methylcytosine is converted into thymine. These exchanges are also generally called “successful” point mutations, that is, as genetic changes that have successfully spread to a certain degree within a gene pool of a population. In the International HapMap Project more than four million SNPs/SNVs were identified. Since SNPs/SNVs are very frequent and can be analyzed

927

Restriction Fragment Length Polymorphisms, Section 27.1.4

928

Part V: Functional and Systems Analytics

Southern Blotting, Section 27.4.3

Figure 36.2 (a) Restriction fragment length polymorphisms (RFLPs). RFLPs can be recognized, when a restriction site (RE) is affected by a polymorphism. RFLPs can be detected with a hybridization probe that is located near the locus. The respective alleles can be detected by autoradiography. (b) Polymorphic STSs (microsatellites) are displayed by PCR, because the distance between the primer binding sites differs between different alleles. The variation between DNA fragments is often the result of a variable number of small repetitive elements, in most cases consisting of (CA)n. For verification, genomic DNA can be separated by high-resolution gel electrophoresis (e.g., PAGE). This allows the detection of differences in length of even a single nucleotide.

36 Physical and Genetic Mapping of Genomes

using microarray analyses or high-throughput sequencing, they are now the most commonly used genetic markers.

929

DNA Micro Array Technology, Chapter 37

36.1.3 Linkage Analysis – the Generation of Genetic Maps The generation of genetics maps is often called linkage analysis, because it investigates the common inheritance of two or more markers “linked” together. The neutralization of linkage happens by reciprocal exchange (double crossing over) of genetic material during meiosis (Section 36.1.1). Recombination between homologous chromosomes during meiosis is a common event. The resulting new combination of markers serves as the basis for linkage analysis. The goal is to find out if two loci are inherited more often than would be expected if they were located on two different linkage groups (chromosomes or distant segments of a single chromosome). During meiosis, each pair of homologous chromosomes is independently subdivided into the daughter cells. One locus therefore co-segregates with a second locus on another chromosome with the probability of 50%. Loci on the same chromosome are expected to be separated by recombination with a probability of less than 50%, in relation to the distance between these loci. This percentage is called recombination fraction (cM) or θ, which is observed between two loci. The value of θ ranges from 0, for closely neighboring loci not separated by recombination, up to 0.5 for distant loci or loci located on different chromosomes. Thus, θ is a measure for the genetic distance between two loci. However, as previously mentioned, this principle is only valid for short distances. Since the probability of multiple recombination events increases with the distance between two loci, θ has to be converted into a genetic distance by a mapping function. Two loci are considered genetically coupled if θ is less than 0.5. The task of linkage analysis is to determine θ, and to calculate its statistical significance, when it is less than 0.5. The χ 2 test is a simple method for the detection of genetic linkage, which, however, is only valid for organisms with a large number of offspring. The statistical significance of a linkage can be estimated. As previously described, linkage analysis is based on the recombination frequency between two defined loci. This recombination frequency is reflected in the relative frequency of different classes of the products of meiosis. Two heterozygous loci in a diploid organism (Aa Bb) can result in four different combinations in gametes: AB, ab, Ab, and aB. If both loci are not linked (not located on the same chromosome), all four possibilities are expected to be observed in the proportion of 1 : 1 : 1 : 1. If the markers are linked, the distribution is different and the marker combinations produced by recombination are under-represented. The first question to be answered is whether the proportion is 1 : 1 : 1 : 1. If this is the case, both loci are not linked. If the relation between the marker combinations is different, both loci are linked. If the deviation from equal distribution is obvious, mapping is simple. However, small deviations from the expected proportions have to be confirmed statistically in order to draw a clear conclusion. An example: 500 products of meiosis are analyzed. The experimental results are as follows: The χ2 Test

Class Class Class Class

1: 2: 3: 4:

AB 145 ab 140 Ab 105 aB 110

The test of recombination frequency reveals 105 + 110 = 215 recombinants (43%, θ = 0.43). If inheritance of these markers is not linked the expected value is 50%, which differs from the observed value of 43%. Is the difference between the experimentally observed value and the expected value (50% – 43% = 7%) significant or is it a random result, because only 500 meiosis products were analyzed? The χ 2 test is the method of choice to answer this question: 1. The null hypothesis posits no linkage. 2. The value of χ 2 is calculated. The number of test points is the critical variable for the evaluation of the significance of the results. For the calculation of χ 2, the number of

Recombination fraction: proportion of recombination by crossing over between two markers during meiosis, which is in relation to the distance between the respective loci on the chromosome. The distance, given in centimorgans (cM), is determined by linkage analysis.

930

Part V: Functional and Systems Analytics Table 36.1 Calculation of calculation of χ 2. χ2 ˆ Class

P

…N E †2 E of all classes (N

E)2

E)2/E

(N

N

E

AB

145

125

400

3.2

ab

140

125

225

1.8

Ab

105

125

400

3.2

aB

110

125

225

1.8

500

500

χ 2 = 10.0

experimental meiosis products (N) is compared to the number of meiosis products that would be expected according to the null hypothesis (E) (Table 36.1). 3. By means of the χ 2 value the plausibility, p, of the null hypothesis is calculated. The first step is to determine the degrees of freedom (df): df ˆ Number of classes

1

(36.2)

In our case: df = 4 – 1 = 3. Table 36.2 shows the probability of the null hypothesis (grey-shaded values). With χ 2 = 10 and df = 3 in our case, the probability of a non-linked inheritance for both analyzed loci is p = 0.015. 4. Acceptance or rejection of the null hypothesis. The value of p = 0.05 is usually chosen as the threshold valid for exclusion of the null hypothesis. Since the determined value of 0.015 is less than 0.05, the null hypothesis is rejected. The assumption is that a linkage exists in this case. Tests for Plausibility For many pedigrees in higher mammals, including humans, it is often not possible to determine or measure all recombinants, as well as non-recombinants. Therefore, in these cases, instead of the χ 2-test, a method based on the calculation of the likelihood of recombination between two certain loci is applied. The calculation of the linkage probability

Table 36.2 Values of the χ 2 distribution. The values with a gray background apply to the example presented in the main text. p

0.995

0.975

0.900

0.500

0.100

0.050

0.025

0.010

0.005

df 1

0.000

0.000

0.016

0.455

2.706

3.841

5.024

6.635

7.879

2

0.010

0.051

0.211

1.386

4.605

5.991

7.378

9.210

10.597

3

0.072

0.216

0.584

2.366

6.251

7.815

9.348

11.345

12.838

4

0.207

0.484

1.064

3.357

7.779

9.488

11.143

13.277

14.860

5

0.412

0.831

1.610

4.351

9.236

11.070

12.832

15.086

16.750

6

0.676

1.237

2.204

5.348

10.645

12.592

14.449

16.812

18.548

7

0.989

1.690

2.833

6.346

12.017

14.067

16.013

18.475

20.278

8

1.344

2.180

3.490

7.344

13.362

15.507

17.535

20.090

21.955

9

1.735

2.700

4.168

8.343

14.684

16.919

19.023

21.666

23.589

10

2.156

3.247

4.865

9.342

15.987

18.307

20.483

23.209

25.188

11

2.603

3.816

5.578

10.341

17.275

19.675

21.920

24.725

26.757

12

3.074

4.404

6.304

11.340

18.549

21.026

23.337

26.217

28.300

13

3.565

5.009

7.042

12.340

19.812

22.362

24.736

27.688

29.819

14

4.075

5.629

7.790

13.339

21.064

23.685

26.119

29.141

31.319

36 Physical and Genetic Mapping of Genomes

931

between two markers is carried out with the aid of computerized programs. The resulting calculations of plausibility are placed in the formula: 

Z …θ† ˆ log10

 L…θ† L…0:5†

(36.3)

Here L is the likelihood of a given θ. The L of θ < 0.5 (linked inheritance) is compared to the L of θ = 0.5 (non-linked inheritance). In linkage analysis, the result is often given as a logarithmic value of the quotient, also called LOD score (Z(θ); logarithm (base 10) of the odds). Generally, we speak about loci being linked at a LOD score of 3 or more. At this LOD score, the likelihood that the loci are coupled is 20 : 1; analogously, a LOD score of 4 indicates a probability of linkage of 200 : 1. At a LOD score of 2 the statistical basis of the experiment should be extended by the investigation of further pedigrees in regard to the two loci investigated. Alternatively, neighboring loci can be included. If further individuals are investigated and the LOD score decreases, it can be presumed that the markers are not linked.

36.1.4 Genetic Map of the Human Genome A genetic map consists of polymorphic markers – RFLPs, polymorphic STSs, and SNPs/SNVs – at a certain distance from each other. The primary benefit of a genetic map is that mutated loci responsible for certain diseases can be located in the genome by familial linkage analyses. This is usually possible without knowing the function of the gene. Decades ago it was already possible to string the linkage of individual pairs of phenotypic (biochemical) markers, together with the determination of their sequential arrangement on the chromosomes, into a primitive genetic map. At present a “complete” map, covering the whole genome with a marker every 0.5–2.0 cM, has been completed (Figure 36.3, also see Section 36.2.3). Most markers are informative (Section 36.1.1) and clearly defined, such as polymorphic STSs (C/A repeat markers) and SNPs/SNVs. In the course of the human genome project, numerous C/A repeats within the entire genome were analyzed. Markers are important for the mapping of certain loci within the genome. SNPs/SNVs, originally identified within the

Figure 36.3 Physical and genetic map of the end of the long arm of the human X chromosome. On the left, the cytogenetic banding pattern is shown as an ideogram. The positions of the genetic markers are given in centimorgans. The physical distances between the same markers are given in megabases. The starting point for both measures is located at the telomere of the short arm. Source: After Nagaraja, R. et al. (1997) Genome Res., 7, 210–222. Copyright  Cold Spring Harbor Laboratory Press. CC-BY-4.0.

932

Part V: Functional and Systems Analytics

Figure 36.4 Candidate genes within the genetic map of Xq28. This simplified genetic map shows some genetic markers of the region that have been used to localize certain disease genes (codes above the line). The positions of genetic defects resulting in certain diseases are shown below. MTM: myotubular myopathy, Barth: Barth syndrome, EDMD: Emery Dreifuss muscular dystrophy, 1P2: Incontinentia pigmenti type 2, Happle: Happle syndrome, and MAS: MASA syndrome.

HapMap Project and later by high-throughput sequencing, are now the most important tool for the identification for genetic causes of disease.

36.1.5 Genetic Mapping of Disease Genes

Genetic predisposition: mutations in certain genes, which determine sensitivity of an individual to extra-genetic risk factors.

A genetic map of high marker density is of great value, since it allows the mapping of genetic diseases within short time periods, if a pedigree of a sufficient size is available. Classical genetic mapping often took years to map a single gene. The mapping data, together with the DNA sequence information (physical map), are an important prerequisite for positional cloning (Section 36.2.4) and isolation of the genes involved in disease (Figure 36.4). Many very common and dangerous diseases have polygenic causes, such as cardiovascular diseases, cancer, and schizophrenia. They are referred to as multifactorial diseases, because they are caused by the interplay between genetic and epigenetic aberrations, as well as environmental influences. Genetic predisposition influences the sensitivity to extra-genetic risk factors. The investigation of individuals with an identical genetic predisposition enables the characterization of the susceptibility of an individual to extra-genetic risk factors and offers a promising approach towards preventive personalized medicine. Personalized medicine involves adjusting medications according to the genetic background of the patient. The quality of mapping or LOD scores, respectively, of multifactorial diseases is limited by the size of the pedigree and marker density. The analysis of a possible genetic disease includes as many generations and affected relatives as possible. The DNA of all individuals is typed and linked inheritance is determined with the help of genetic markers. The linkage is analyzed via probability tests, as used for the generation of a genetic map (Section 36.1.3). Again, the quality of localization of the disease is dependent on the size of the pedigree and how polymorphic the markers are. The locus is usually not located precisely but instead is described as lying between two known genetic markers (Figure 36.4). Genetic mapping is often just as important to exclude genes as candidates for involvement in certain diseases as it is to include them as possible candidates. The phenotype of an affected individual is very often correlated with certain genes, such as, for example, for certain oncogenes frequently being mutated in cancer. If an oncogene is causal for a disease, no recombination is detected by linkage analysis between a marker for the disease phenotype and the candidate gene locus. This, however, does not mean that the same oncogene is relevant in all patients suffering from the same type of cancer, because different mutations, even in differing genes, may lead to the same phenotype.

36.2 Physical Mapping Physical mapping includes all methods that describe the distance in nucleotides between two individual markers within the genome, as opposed to genetic maps, which are based on recombination frequencies between markers.

36.2.1 Restriction Mapping of Whole Genomes One possible way to map chromosomes or a complete genome is the description of the order of the restriction fragments (top-down procedure, Figure 36.5a). After restriction mapping, certain regions can be cloned. This method was usually applied to small genomes such as Escherichia

36 Physical and Genetic Mapping of Genomes

933

Figure 36.5 Two techniques of physical mapping. (a) Top-down approaches start from a cytogenetic map or a restriction map of genomic DNA or a single chromosome. Here, the restriction sites of rare cutting restriction enzymes, which generate large DNA fragments, are shown (N, NotI; S, Spl1; M, MluI). Since the genomic DNA was cloned prior to restriction digestion, “complete” maps arise from this procedure. These maps have a low resolution (>100 kb). Therefore, it is not possible to directly isolate clones or genes, respectively. (b) Bottomup methods start with the generation of a library of clones, from which clones are isolated. Overlapping clones are identified by chromosome walking and combined to contigs (continuous sets of clones). These contigs lead to a highresolution map, which, however, in most cases does not cover a whole chromosome (partial maps). Therefore, both methods are often used in combination.

coli or viruses. Type II restriction endonucleases are used for restriction mapping. These enzymes recognize specific DNA sequences, often, but not always, consisting of 6 bps, and cut the DNA within these sequences. Enzymes that cleave rarely within the investigated genome are favored because they produce long DNA fragments after restriction digest. The 4.5 Mb genome of Escherichia coli was separated into 21 DNA fragments with NotI, a restriction enzyme recognizing an 8 bp sequence. This allowed the sequential position of the DNA fragments within the circular genome to be determined. Known genes or markers, respectively, can be assigned to certain restriction fragments by hybridization. If two markers are located on one restriction fragment, the physical distance between these two markers cannot be greater than the length of the analyzed fragment. For this type of physical analysis, the genomic DNA is cleaved with a restriction enzyme, separated by pulsed-field gel electrophoresis (Section 27.2.3), blotted onto a membrane (Section 27.4.3), and hybridized to a cloned genomic DNA fragment. This type of restriction mapping gets more and more difficult with increasing complexity of the genome, because even infrequently cutting enzymes like NotI generate many restriction fragments of similar length that cannot be effectively separated and individually analyzed. In higher eukaryotes, cell hybrids allow generation of limited restriction maps with lengths of up to several megabase (Mb). Several of these partial maps can then be integrated to complete maps. For radiation hybrid mapping, human-mouse or human-rat cell hybrids are used, each

Restriction Analysis to Certain Restriction Fragments, Section 27.1

Pulsed-Field Gel Electrophoresis, Section 27.2.3 Southern Blotting Blotted onto a Membrane, Section 27.4.3

934

Part V: Functional and Systems Analytics

containing a limited amount of the human genome. A portfolio of about 100 different cell hybrids, each containing a different – possibly overlapping – set of human chromosomes, allows a mapping accuracy of 1–5 Mb. Since the resolution of this mapping method is limited, pocket maps (bins) are generated instead of conventional linear maps. More detailed information is possible by detailed physical mapping (see below). Since whole genomes can now be decoded by high-throughput sequencing within a single day (Section 36.2.3), restriction maps are no longer of much importance.

36.2.2 Mapping of Recombinant Clones For restriction mapping of whole genomes, as described in the previous chapter, native genomic DNA is used. Multiple copies of the entire genome are used in the analysis, such that all loci are represented at an identical copy number. Cloning the genomic DNA is required for further analysis and manipulation of individual genes or gene segments. Not all genomic regions are equally accessible to cloning, which makes it a challenge to achieve complete coverage of the genome. For example, to clone 1 Mb genomic DNA into a DNA vector cloning system that can hold up 20 kb per clone, 500 clones are needed for a tenfold coverage (500 × 20 kb = 10 000 kb). The number of clones (N) that are necessary to clone a certain locus with a defined probability can be calculated: Nˆ

ln…1  ln 1

p†  1 n

(36.4)

Here p is the probability that a certain locus will be contained in a genomic library. The variable n describes the relation between genome size (here: 1 Mb) and the average insert size of the cloning system (here: 20 kb). If the goal is to clone each part of the genome with a probability of at least 99%, it results in the following equation: Nˆ

ln…1 2 ln61 4

0:99†

3 ˆ 228 1  7 1000 5 20

(36.5)

Therefore, 228 clones are necessary to clone each locus of the genome with a probability of 99%. The equation also shows that the number of clones to be analyzed is dependent on three parameters: target probability, genome size, and average insert size of the used cloning system (DNA vector used for cloning). However, the cloning system is the only variable, because the genome size of the organism is constant, and the quality of the library correlates with the probability that one clone can be isolated from this library. Moreover, the calculations described above are based on the estimation that all regions can be cloned with the same efficiency, which is not what has been observed for any organism analyzed so far. Therefore, in practice, it is necessary to clone much higher numbers than calculated. Cloning Systems for Genomic Libraries For the generation of genomic libraries, different cloning systems are in use. The cloning vectors differ in capacity (maximum insert size that can be cloned), reproduction of cloned DNA (copy number per cell), stability of insert DNA (recombination rate), and manageability (DNA isolation, accessibility for DNA sequencing). A summary of the characteristics of cloning systems in use is shown in Table 36.3.

Aside from yeast artificial chromosomes (YACs), all DNA vectors use prokaryotic hosts (usually Escherichia coli). Decisive for the size (clone number) of the library is the cloning capacity of the DNA vector, since the number of clones representing the whole genome decreases with increasing insert size. This may ease genome analysis; however, the quality of clones is also critical. In eukaryotic YAC vectors, a large number of clones undergo rearrangement by recombination. Depending on the library, 25–70% of the clones are affected, which is a big disadvantage for analysis. Moreover, YAC DNA has to be purified by pulsed-field gel electrophoresis (PFGE), which is very labor-intensive. In contrast, isolation of P1 DNA (artificial phage P1 chromosome) and bacterial artificial chromosomes (BACs) in high quality and quantity is much easier. The inserts in these cloning systems are very stable, because they are propagated in modified E. coli hosts with a partially inactivated recombination system.

36 Physical and Genetic Mapping of Genomes

935

Table 36.3 Common cloning systems. Host

Insert size (kb)

Copies per cell

DNA-isolationa)

Direct sequencing

Rearrangement of the insertsb)

Good

Very rare

Lambda

Escherichia coli

5–25

>250

Good

Cosmid

Escherichia coli

35–45

3–50

Excellent

Very Good

Possible

P1

Escherichia coli

70–100

1–2

Very Good

Good

Rare

PACc)

Escherichia coli

70–300

1–2

Very Good

Good

Rare

BACd)

Escherichia coli

50–300

1

Good

Good

Very rare

Saccharomyces cerevisiae

50–2000

1

Difficult

Difficult

Frequent

YAC a) b) c) d) e)

e)

Simplicity of DNA isolation and resulting purity. Proportion of chimeric clones; frequency of deletions of the inserts. P1 artificial chromosome. Bacterial artificial chromosome. Yeast artificial chromosome.

The bottom-up procedure for physical mapping starts with the generation of a genomic library. To generate a contiguous set of clones, referred to as a contig, the cloned fragments are sorted according to overlapping regions by hybridization (Figure 36.5b). Sequencing and computer analysis allow construction of a continuous DNA sequence of the genome, provided it is represented by the genomic clones under analysis. The shotgun strategy circumvents the sorting of clones of a genomic library. Instead, genomic DNA is fragmented by restriction enzymes or other means, cloned into plasmid vectors with small insert sizes (about 2 kb) and randomly sequenced. Computer analysis then allows the identification of overlapping clones to produce a continuous genomic sequence. Newer methods do not even require cloning, because the genomic DNA fragments are directly amplified and sequenced by high-throughput methods (Section 36.2.3 and Chapter 30).

36.2.3 Generation of a Physical Map Using the clone-based strategy, the clones containing genomic DNA are ordered along the chromosomal DNA in sequence (Figure 36.5). The clones are sorted similarly to the pieces of a puzzle, only in this case the pieces (genomic DNA inserts of the clones) overlap. Since it is generally necessary to work with a huge number of clones, a quick and unambiguous method is necessary for mapping (see below). However, since many genomic sequences are accessible in public databases, physical maps of many genomes are readily available. STS Mapping

In the 1980s, the concept of sequence tagged sites (STSs) was developed. These are 100–300 bp DNA sequences that are unique within the genome. The DNA pieces are amplified by PCR and used as markers for physical mapping. The PCR product describes a unique locus within the genome. In principle, each genomic locus is qualified for a STS sequence. First, the sequence is aligned with all sequences in databases to make sure that the primer sequences are not located within repetitive sequences. It does not matter if the region between the pair of primers is repetitive (Figure 36.2b). For physical mapping it is not necessary that the employed STSs are polymorphic, only the amplification product has to be unambiguously detectable by PCR. Possible STS markers are DNA segments of unknown function, polymorphic DNA markers, and genes (Section 36.1.2). Yeast artificial chromosome (YAC) libraries can be used for chromosome mapping, distributed as one clone per well in microtiter plates. PCR is used to determine which YAC clones can be amplified with a STS primer pair. Since YAC libraries consist of several thousand clones and it therefore would be very labor-intensive to individually test each YAC for each STS, special methods have been developed to allow efficient screening. Pools containing DNA from several clones can be pooled into superpools of up to seven levels of smaller pools. By intelligent combination of YACs it is therefore possible to find the possible YAC clone with only a few PCR reactions (Figure 36.6).

Techniques for the Hybridization and Detection of Nucleic Acids, Chapter 28

936

Part V: Functional and Systems Analytics

Figure 36.6 STS screening. Here, a YAC library of the human genome is shown, consisting of 30 720 clones, stored in 320 microtiter plates (96 clones per plate). Pool creation: In the three-step pool system shown in this figure, DNA samples from all clones are isolated in a central laboratory. The DNA of each clone is mixed with DNA from other clones of a library in different combinations (pools). The superpools contain DNA from eight microtiter plates (8 × 96 clones). In our example, this results in 40 superpools. Microtiter-plate pools contain DNA of all clones from a single plate (here: 320 microtiter-plate pools). The pools in rows and columns contain DNA of the clones of one row (12 clones) or one column (8 clones), respectively. Hierarchical screening: In the first step, the 40 superpools serve as samples for the STSPCR. By PCR and gel electrophoresis, the positive superpools are identified, followed by STS-PCR of the microtiter-plate pools, which make up the positive superpool (8), in order to identify the microtiter-plate of the positive clone. In the third round of screening, the twelve row pools and eight column pools of the positive microtiter-plate are used for the amplification to find the precise clone responsible for the positive response.

Mapping by Hybridization to Cloned Libraries Physical mapping by hybridization of cloned genomic DNA is, in most cases, based on reference libraries. The principle is shown in Figure 36.7. Reference libraries are genomic libraries whose clones are sorted into microtiter plates. Each clone has an individual physical address and therefore is accessible individually. The clones are densely spotted on nylon membranes in an ordered and reproducible pattern. The “master filters,” produced by only one or a few reference laboratories, can then be provided to individual working groups worldwide. This allows mapping experiments to be carried out in different laboratories with the same primary material, which makes it easier to compare and integrate the data. Moreover, information about genome libraries has been collected into databases. The starting point of such an experiment is the hybridization of a gene probe to nylon membranes containing genomic clones. With this screening approach it is possible to identify a set of clones representing a whole genome. This approach also is applicable for cross-hybridizations between evolutionary conserved genomes (e.g., human and mouse). Oligonucleotide fingerprinting uses a series of short, 8–12 bp long, oligonucleotides that are hybridized to spotted membranes in numerous cycles. With this approach it is possible to assign a clone to a specific group of clones and eventually to a single clone, which is then added to a library. Hybridization of libraries and mapping with the help of STSs are two techniques based on complementary principles to physically map a genome with recombinant clones. The first approach uses the clones themselves as references, whereas the second approach is based on the known information about STSs. In many cases, both methods have been used in combination, that is, reference clones were partially sequenced to generate STS markers, and the resulting STS amplification products were employed as probes for the next round of hybridization with reference libraries. The clones of many BAC and PAC libraries have been sequenced and are publicly available. Since the sequences have been localized to the chromosomes by the minimal tiling path approach, the specific application of genomic clones (e.g., FISH (fluorescent in situ hybridization), BAC arrays; see below) is possible without the need for an elaborate screening to isolate the clones. Mapping by High-Throughput Sequencing Clone-based mapping procedures are being replaced by direct sequencing without the need to clone the genomic DNA. This is made possible by next-generation sequencing. Most of these high-throughput methods start with a library

36 Physical and Genetic Mapping of Genomes

937

Figure 36.7 Reference system. The libraries used for mapping and gene isolation are provided by a central laboratory. The clones of the library are stored in the wells of microtiter plates. Clones can be transferred reproducibly and at high density onto nylon membranes using robots. These spotted membranes can be used by cooperating laboratories for hybridization. The coordinates of the hybridization signals can then be sent back to the central laboratory, which then sends the positive clones to the cooperating laboratory. Since the experimental results can be stored in public databases, laboratories all over the world have access, allowing for less redundancy of experiments. This practice has not only been successful for human material, but also for model organisms like mouse, rat, Drosophila, and yeast.

generation step, in which genomic DNA is fragmented by ultrasound or nebulization, ligated to standard oligonucleotide primers, and amplified by highly parallelized PCR in oil–water emulsion, or starts from immobilized DNA templates in array format. Each amplification product can then be individually sequenced in parallel. The sequences, with lengths of up to several hundred bases, are aligned by computerized algorithms to generate a complete genomic sequence. The alignment takes advantage of many sequences of different organisms already available in databases. Alternative approaches are sequencing by ligation or single-molecule sequencing.

36.2.4 Identification and Isolation of Genes Since the invention of recombinant DNA technology, several thousand genes involved in disease have been isolated and analyzed. Typically, genes have been identified that lead to a certain disease due to a specific biochemical defect. At first, known protein sequences or protein sequences derived from nucleic acid sequences contributed to the identification of the genes. In some cases, antibodies against particular proteins allowed the isolation of a gene from a protein expression library. With some transformed oncogenes, the function could be used directly to identify the gene. All these techniques are summarized as functional cloning, since the gene product or the gene function serves as basis for the isolation of the gene. The phenotype of a disease can, however, only very rarely be connected to a single protein or protein function. Hence, other strategies are necessary to isolate genomic regions responsible for a certain phenotype. Technologies have been developed that use the position of the gene as the basis for gene isolation. The specified gene locus is the key information and the isolation of

938

Part V: Functional and Systems Analytics

Positional cloning (reverse genetics) The cloning of a gene transcript of known position within the genome for use in functional studies.

the transcript is based on exact mapping within the genome. Originally, this approach was called reverse genetics; today the name positional cloning is more common. The latter more precisely describes the method, which starts with the position of the gene within the genome and leads to the description of the protein via the transcript. Other methods used to determine the possible functions of genes are described below. Prediction of Gene Characteristics – Candidate Genes By exact investigation and monitoring of patients, in some cases it has been possible to describe gene defects responsible for a certain disease. These predictions concern the potential functions of the affected gene, which might be directly linked to the disease. Tissues or organs that are affected by a disease might allow conclusions about cell-type specific gene expression. Diseases linked to developmental disorders are most likely caused by genes expressed during certain developmental stages. A hereditary disease with increasing prevalence, or anticipation, from one generation to the next makes conclusions about the mechanism of mutation possible, which allows predictions concerning single parts of the gene sequence. If all these criteria are considered, the number of candidate genes can be narrowed, sometimes to the point where only one or a few remain as candidates. In the past, this method allowed the identification of genes associated with disease without any mapping information. Gene Modeling Based on the genomic sequences, it is possible to predict the functions of genes. In lower eukaryotes such as, for example, yeast (Saccharomyces cerevisiae) the genomic sequence is mostly collinear with the translated, expressed sequence, that is, the genes only rarely are interrupted by introns. Therefore, after decoding of the complete genome, it was possible to determine open reading frames (ORFs), which are potentially protein encoding regions, within the yeast genome. Afterwards, the prognosticated gene activities were shown experimentally. By this approach, more than 5000 yeast ORFs could be verified. The smallest prokaryotic genome known so far is the genome of Mycoplasma genitalium, which contains a mere 470 genes. The prediction of gene functions is much more difficult in higher eukaryotes, because their primary transcripts consist of exons and non-coding introns. The identification of exons in genomic sequences is often difficult, because the average length of one exon only is about 150 bp. Some exons are as short as 15 bp. Single exons can be predicted with the help of computer programs that are based on neural networks. The reliability of these programs increases with the number of known genes that can be used as training set. Therefore, the predictions based on genomic sequences increases with new genomic sequencing data. However, it is still difficult to recognize complete coding gene sequences based on the genomic sequence, since computer analysis often misses exons or predicts them incorrectly. Moreover, terminal non-coding regions might not be recognized by the computer programs. Therefore, it also is important to identify exons experimentally, using, for example, exon trapping. Here, certain chromosomal regions are cloned into a DNA vector that contains a “minigene” consisting of two exons and the associated splice acceptor (SA) and splice donor (SD) sites. If a chromosomal region containing complete exons and introns together with SA and SD sites is integrated into a DNA vector, the splice pattern of the transfected cells changes, which can be proved by PCR. Positional Candidate Genes This approach to identifying disease genes combines the strengths of the determination of gene characteristics with the exponentially increasing sequencing data of regions with no known function. In the positional candidate approach, all available data about a genetic disease is used to learn as much as possible about the candidate gene. These gene predictions are compared with the genetic mapping of the disease gene. This means that all genes that are located in the genetic locus responsible for the disease are investigated for suitable characteristics. The idea is that characteristics of a disease that was mapped to a certain part of the genome are connected to genes located in this region (Figure 36.8). Candidate genes can then be investigated in patients for the presence of corresponding mutations. The success rate of this approach is proportional to the amount of sequence information available, which is steadily increasing and, thus, so are the prospects for its success. It is not enough to detect the expression of ORFs in order to recognize defects in gene expression. The sequences responsible for gene regulation (e.g., promoters, enhancers,

36 Physical and Genetic Mapping of Genomes

silencers) also have to be identified. The sequences themselves, as well as their localization, are very heterogeneous. Therefore, their identification based on the genomic sequence is difficult. One promising experimental approach is ChIP chip (chromatin immunoprecipitation and chip analysis). Chromatin fragments, consisting of genomic DNA as well as proteins and RNAs bound to this DNA, are immunoprecipitated with antibodies against, for example, transcription factors. Afterwards, the DNA part of the precipitated complex is isolated and hybridized to a DNA microarray covering the whole genome (tiling array). The hybridization signal then corresponds to the DNA sequences bound to the respective transcription factor. Alternatively, the DNA sequences can be identified by high-throughput sequencing (ChIP-seq).

939

Figure 36.8 Positional candidate gene approach. A disease that has been mapped to a specific region of a chromosome can be assigned to a list of genes, which have been physically localized to this chromosomal domain. The characteristics of these candidate genes can be compared to the known properties of the disease. A direct correlation of disease and gene is sometimes possible. Potential biochemical defects of a disease can be compared to protein domains encoded by the gene (Kallmann syndrome). The affected tissues or developmental stages can be correlated with the expression pattern of certain genes (X-chromosome agammaglobulinemia). If symptoms are increased from one generation to the next (anticipation), they are often correlated with instable DNA sequences (myotonic dystrophy). If nothing else, knowledge about an analogous disease in model animals can serve as basis for a search in the human genome. Regions in which genes are represented in the same order in different organisms are called syntenic (homologous) regions. Source: figure modified after Bick, D.P. and Ballabio, A. (1993) Am. J. Neuroradiol., 14, 852–854.

36.2.5 Transcription Maps of the Human Genome Only about 1% of the human genome is transcribed into mRNA. Starting from the 1990s, the number of transcription maps showing expressed genes that have been localized to the genome has been steadily increasing. Physical mapping of expressed sequence tags (ESTs) has been important in this respect. Originally, mainly specific transcription maps covering only a few hundred kilobase up to several megabase were created. Now transcription mapping is performed genome-wide. Sequence comparison is used to map cDNA sequences to the genomic sequence. However, isolation of cDNAs was not always successful for many ESTs, which are only expressed under certain physiological conditions, at or in low copy numbers, or with a short half-life. Since the advent of high-throughput methods, transcripts are normally detected directly by hybridization to DNA microarrays or by high-throughput sequencing. For hybridization experiments, mRNA isolated from the tissue to be analyzed is reverse transcribed into cDNA by reverse transcriptase and transferred into a labeled hybridization probe. The microarray to which this probe is hybridized contains oligonucleotides representing all possible exon sequences, which can potentially be expressed in the respective tissue of the investigated organism. The hybridization signals then indicate which genes are expressed at what copy number. Since it is possible to immobilize millions of oligonucleotides onto a single array, whole chromosomes or even whole genomes can be displayed (tiling array) and therefore detection of transcription activity in general is possible. Hybridization experiments on these arrays showed that the number of transcripts is much higher than previously expected. Many short and long RNAs have been identified that are involved in gene regulation and enzymatic activities, but also many RNAs with so far unknown functions. The ENCODE project even determined that every sequence in the genome is transcribed at some time during an organisms life cycle in at least one direction. Alternatively, expression is determined by direct sequencing of cDNAs. High-throughput sequencing technologies allow the characterization of a complete transcriptome in a single experiment. Again, mRNA is reverse transcribed into cDNA and ligated to standard linkers to

Analysis of Promoter Strength and Active RNA Synthesis, Chapter 33 Protein–Nucleic Acid Interactions, Chapter 32

940

Part V: Functional and Systems Analytics

prepare the DNA fragments for sequencing in a highly parallelized approach. The number of reads of a transcript serves as a measure for the copy number of the respective mRNA.

36.2.6 Genes and Hereditary Disease – Search for Mutations Quantitative trait loci (QTLs) A group of variant loci within the genome that contribute to the occurrence of a certain disease.

Methods of Fluorescence-labeled DNA Hybridization, Section 35.1

Gel Electrophoresis of DNA, Section 27.2.1

Genotyping, Section 37.2.1

The number of genes potentially responsible for a certain disease can be limited by the candidate gene approach (Section 36.2.4). However, the functional proof that a certain mutation is responsible for a disease phenotype still is a challenge. In addition, in many cases, a disease does not go back to a predisposition of only one gene. Instead, very often mutations of several genetic loci contribute to the characteristics of a certain disease (quantitative trait loci, QTLs). Several approaches can help to examine the effects of mutated genetic loci. In the simplest case, when patient data gives evidence that exactly one gene is deleted and that this deletion is responsible for the deletion, the genomic region affected by deletions or translocations, respectively, can be narrowed by cytogenetic methods (FISH, fluorescent in situ hybridization, for example, with BAC clones as hybridization probes). FISH analysis, however, only allows mapping of very large deletions. Another classical approach to detect smaller deletions is the hybridization of the labeled chromosomal DNA to BAC arrays, containing BAC clones representing known chromosomal regions (Sections 36.2.2 and 36.2.3), which were spotted onto glass slides. If the hybridization signal is weaker than the control hybridization it points to a deletion within this chromosomal region. However, normally, several defects are involved, the affected sequences are very short (point mutations, small translocations, deletions, or amplifications), and these deviations have to be compared between many patients. To get an overview of these aberrations, several methods are available: 1. For the detection of single-strand conformation polymorphism (SSCP), short genomic sequences (150–300 bp) are amplified by PCR. By denaturing high-performance liquid chromatography (dHPLC), the PCR fragments differing from the wild-type sequence are detectable. This type of experiments was often performed with large patient cohorts to determine the LOH loci (loss of heterozygosity loci). 2. Exon sequencing: Exons of selected genes are PCR amplified from a patient’s DNA with the aid of flanking primers. The PCR products are sequenced and the sequence is compared to the wild-type version. This allows us to look specifically for altered DNA sequences, which lead to changes in the protein sequence, potentially impairing its function. Both dHPLC and exon sequencing have to a large extent been replaced by highthroughput sequencing methods. 3. SNP/SNV analysis: By hybridization of specified microarrays, single nucleotide polymorphisms (SNPs)/single nucleotide variants (SNVs) and also copy number variants (CNVs) can be detected. The hybridization of millions of SNP variants in parallel on a single array allows the determination of a whole-genome genotype of a patient, and genetic variants in patient cohorts can be detected in high-throughput (genome-wide association studies, GWAS). Alternatively, high-throughput sequencing is used to identify genetic variants (compare Section 36.2.3). Statistical methods are used to determine which markers are inherited together more often than is expected after random recombination. The goal of these approaches is to identify genetic markers that are relevant for future diagnostics, but also to identify drug targets for the treatment of complex diseases. One example of a worldwide research initiative is the International Cancer Genome Consortium; the ICGC project, which includes sequencing and comparison of thousands of cancer genomes. Many other diseases are under investigation to improve treatments for patients suffering from complex diseases.

36.3 Integration of Genome Maps To get the most out of mapping approaches and functional analyses, all sorts of databases all over the world have been made available and linked. The combination of complex data allows new conclusions to be drawn from existing data.

36 Physical and Genetic Mapping of Genomes

Figure 36.9 Integrated map of a genomic region. The different mapping approaches (a)–(g) are explained in the main text.

If genetic and physical maps are compared, it becomes obvious that both types of maps are collinear in most cases, as expected. In other words, the order of the markers is identical on both types of map. However, the relative distances of the genetic maps frequently do not correlate with the absolute distances of the physical maps. Figure 36.9 shows an integrated map resulting from different mapping approaches. The following list systematically address the different kinds of map in the figure: a) The cytogenetic map is shown as pictogram of a Giemsa staining, as seen under a light microscope. The resolution of this type of map is about 5 Mb. Therefore, on this level only substantial changes to the genome can be recognized. Beside an aberrant chromosome number (e.g., human trisomy 21), these include large deletions, insertions, and translocations. b) This genetic map portrays the correct sequential arrangement of markers along this genomic region, as confirmed by sequencing data. The recombination frequency between neighboring markers is included as well. The resolution of classical genetic maps is 0.5 cM (about 0.5 Mb) to 2 cM. The huge number of SNPs/SNVs now available allow resolution down to about 1000 bp. c) Before high-throughput chip and sequencing technologies were available, the genetic map was often generated with the same markers as the restriction map, to be able to use equivalent reference points. This allowed a detailed correlation of both types of map. Since rarely cutting restriction enzymes only provide a resolution of several hundred kilobases, additional markers (which need not necessarily be polymorphic) were used to achieve higher resolution. d) The clone map is based on recombinant clones. With YACs the resolution is comparable to restriction maps. With prokaryotic clones, the resolution is up to about 10 kb. e) The classical transcription maps referred to clones that were mapped by hybridization. In addition, these maps help to isolate genes that were previously localized to certain clones. Mapping is now also possible by direct sequencing of reverse transcribed mRNA (see above).

941

942

Part V: Functional and Systems Analytics

f) Sequencing provides resolution down to the level of individual bases. Sequencing of transcripts (e) as well as the corresponding genomic region (d) allows exact mapping of mRNA (exons), but also introns and different regulatory regions, for example, 5´ and 3´ nontranslated regions (5´ UTR, 3´ UTR) and promoter regions. g) After sequencing of numerous individuals, different sequence variants (SNPs/SNVs) become part of an integrated map. This serves as a basis for the analysis of patients suffering from certain diseases by SSCP, direct sequencing, or microarray hybridization, to detect sequence variants that appear with a significant correlation to disease and therefore might help to identify diagnostic or even, after functional analysis, potential therapeutic targets. Methylation Analysis with the Bisulfite Method, Section 31.2

Moreover, efforts are underway to include and present functional data in appropriate databases. Besides data about gene expression (microarray-hybridization or direct sequencing of cDNAs), for example, epigenetic data are being integrated. For example, DNA methylation profiles can be determined with the bisulfite method, in combination with hybridization of DNA microarrays or direct sequencing. The binding of DNA sequences to modified histones can be determined by chromatin-immunoprecipitation combined with microarray hybridization or sequencing (ChIP-chip, Chip-seq). In addition, the binding sites of many other proteins (e.g., transcription factors) or RNAs (e.g., micro-RNAs) can be determined. These data allow the nomination of sequences responsible for the regulation of gene-expression for verification by functional analysis. The identification of variants of respective sequences detected in patient cohorts can eventually serve as an important pre-requisite for the development of innovative therapies against some diseases.

36.4 The Human Genome

Functional and Systems Analytics, Chapter 33

The Human Genome Project was only a milestone on the road to the understanding of human genetics. The availability of low cost, high-throughput sequencing technologies has led to an explosion in the amount of genomic data available in public databases and advances in bioinformatics have made it easy to work with resulting data. More than 10 million polymorphisms (SNPs/SNVs, insertions or deletions (indels), CNVs) – one in about 1300 bp – have been discovered so far. Low-resolution genetic maps and clone-based physical maps have been replaced by high-resolution maps at the nucleotide level. Therefore, not only many monogenic, but also polygenic diseases – diseases resulting from several genetic defects – have been mapped in great detail. The combination of genetic and physical maps greatly facilitates the study and analysis of the genetic roots of human disease. Gene expression and epigenetic data can be localized to loci thought to be associated with disease and thus enable a search for biomarkers that may help to diagnose a disease early in its course or perhaps even before a disease state has come about. Information from model organisms such as Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, and Mus musculus, and their mutants (e.g., gene knock-outs), can also be consulted to narrow the search for suitable candidates. The identification of genes involved in disease can be considered in the context of functional and signaling networks, which, under the right circumstances, could lead to new therapeutic approaches to treat the disease under consideration. Thus genetic and physical mapping of genomes has become an essential part of the understanding of human genetics.

Further Reading Adams, M.D. et al. (1995) Initial assessment of human gene diversity and expression patterns band upon 83 million nucleotides of cDNA sequence. Nature, 377, 3–174. Birney, E. et al. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447, 799–816. Cawley, S. et al. (2004) Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell, 116 (4), 499–509. Daly, A.K. (2010) Genome-wide association studies in pharmacogenomics. Nat. Rev. Genet., 11, 241–246.

36 Physical and Genetic Mapping of Genomes Fiegler, H. et al. (2003) DNA microarrays for comparative genomic hybridization based on DOP-PCR amplification of BAC and PAC clones. Genes Chromosomes Cancer, 36 (4), 361–374. Hawkins, R.D. et al. (2010) Next-generation genomics: an integrative approach. Nat. Rev. Genet., 11, 476–486. Hoehe, M.R., Timmermann, B., and Lehrach, H. (2003) Human inter-individual DNA sequence variation in candidate genes, drug targets, the importance of haplotypes and pharmacogenomics. Curr. Pharm. Biotechnol., 4 (6), 351–378. Hudson, T.J. et al. (2010) International network of cancer genome projects. Nature, 464, 993–998. Imanishi, T. et al. (2004) Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol., 2 (6), e162. Krebs, J.E., Goldstein, E.S., and Kilpatrick, S.T. (2009) Lewin’s Genes X, Jones & Bartlett, Burlington, MA. Lander, E.S. (2011) Initial impact of the sequencing of the human genome. Nature, 470, 187–197. Maier, E. (1994) Application of robotic technology to automated sequence fingerprint analysis by oligonucleotide hybridisation. J. Biotechnol., 35, 191–203. McPherson, J.D. et al. (2001) A physical map of the human genome. Nature, 409, 934–941. Nakamura, Y. et al. (1987) Variable number of tandem repeats. Science, 235, 1616–1622. Metzker, M.L. (2010) Sequencing technologies – the next generation. Nat. Rev. Genet., 11, 31–46. Ross, et al. (2005) The DNA sequence of the human X chromosome. Nature, 434, 325–37. Strachan, Z. and Read, A.E. (2010) Human Molecular Genetics, Taylor & Francis. Venter, C. et al. (2001) The sequence of the human genome. Science, 291, 304–351. White, R. et al. (1986) Construction of human genetic linkage maps I: progress and perspectives, in Cold Spring Harbor Symposia on Quantitative Biology, vol 51, Cold Spring Harbor Laboratory Press, pp. 29–38. Yoo, S.M., Choi, J.H., and Yoo, N.C. (2009) Applications of DNA microarray in disease diagnostics. J. Microbiol. Biotechnol., 19, 635–646. Zhang, W., Ratain, M.J., and Dolan, M.E. (2008) The HapMap resource is providing new insights into ourselves and its application to pharmacogenomics. Bioinform. Biol. Insights, 2, 15–23.

943

DNA-Microarray Technology Jörg Hoheisel German Cancer Research Center, Head, Functional Genome Analysis & Chairman of Scientific Council, Im Neuenheimer Feld 580, 69120 Heidelberg, Germany

Studying cellular processes at a global molecular level is prerequisite for elucidating functional mechanisms in a cell, a tissue, or an organism as a whole. DNA-microarrays are typical for this kind of research; they were the first experimental format at the level of nucleic acids that allowed various types of comprehensive analysis of many samples. The large quantities of biological information that could be gathered from these platforms were critically important for the development of biology towards a more complete rather than sketchy view of life. Only a global understanding at various molecular levels may eventually allow us to comprehend in its entirety the conversion of basic genetic information into cellular function. Originally conceived as a procedure for mapping and sequencing genomic DNA, microarray technology developed and proliferated quickly into covering a large variety of biological analysis types. It was also instrumental for a conceptual change in biological research. First, the enormous amount of data that was generated made it essential to automate data processing, replacing classical, mostly manual analysis procedures. This kind of biology is based on numbers and statistics initially rather than observation and thereby is more quantitative. In addition, the analyses documented the high degree of dynamics that is intrinsic to biological systems. For the comprehensive nature of the information gained from microarrays, experiments could be designed that were not based on prior hypotheses. In many cases, this approach can actually be advantageous for the advancement of science, since unexpected results, which are bound to happen, are easier to accept for the lack of a preconceived opinion. In fact, the data obtained acted frequently as the nucleus for the formulation of novel theories, turning on its head the usual sequence of events. Microarray analyses also demonstrated that an investigation – never mind how comprehensive it may be – at only one molecular level, for instance a study of the transcriptome alone, is insufficient for understanding a biological system. A vertical component is essential, linking various molecule classes and in particular also taking into account the interactions between them. Consequently, methods and tools were required for combining and merging different data sets, along with a time perspective to be considered. This led to processes that laid the basis for the development of a theoretical biology named systems biology in a process very similar to the development in physics, which split into a theoretical and an experimental branch more than a century ago. Now, after about 25 years of use, many DNA-microarray analyses have become routine. In fact, with the advent of next-generation sequencing there is a technology around that will soon supersede microarrays in many, especially analytical, applications. However, notably, several of the current sequencing procedures actually represent array-based processes; only some more recent methods are based on other principles. For a while, DNA-microarrays will remain, but eventually they will yield to sequencing for most of the applications described below. In compensation for this loss, however, microarrays could be successful as tools for structural analyses or as a platform for molecule production and combination (really synthetic Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

37

946

Part V: Functional and Systems Analytics

biology), for instance. In addition, there are developments in fields other than the analysis of nucleic acids, such as proteomics, or for analytical processes that deal with more than one molecule class. For some of these new applications, microarrays could be the format of choice. In this chapter, an overview is given about the current main applications of DNA-microarrays, and possible directions are discussed along which microarray technology may develop.

37.1 RNA Analyses 37.1.1 Transcriptome Analysis For many scientists, microarray analysis is still synonymous with expression profiling, which stands for transcript analysis (Figure 37.1). This application was the first microarray format that had a strong impact on biological research. For a while, it formed the basis for studying functions encoded in the gene, which had been deciphered by the various genome-sequencing projects. This importance has clearly diminished. Apart from the fact that RNA sequencing is becoming competitive, it has also been recognized that RNA is a molecule class that may not be an optimal analyte in the first place. First, many RNAs are molecular intermediates. At the same time, mRNA especially is very dynamic in nature and regulated by very many processes simultaneously. Therefore, it is still difficult to design a model of mRNA expression, even though very large data sets on mRNA transcript profiles are available. At the same time, most RNA molecules are rather volatile and therefore less suitable for diagnostics; microRNA could actually be an exception, being found to be stable even in various body fluids. For the identification of therapeutic approaches, finally, RNA may actually be too far from the actual activity, which is mostly mediated by proteins. Particularly in term of quantification, sequencing could be superior to microarrays. For the latter, not all experimental biases could be excluded, such as an influence of mass transport, although common minimal requirements for data quality have been defined. However, standards for data analysis, starting from basic issues like normalization, do not exist. A typical and rather common mistake, for example, is the assumption that the variation in the abundance of an RNA molecule has to be by at least a factor of two in order to be relevant. This originated from early microarray analyses of a group at Stanford University. In a concordance analysis of their data, they concluded that a change by a factor of two or more was significant for their particular experimental results. This threshold was subsequently applied to other studies, although a concordance analysis may have yielded a different value for each of them. Another flaw in most studies is the fact that, for both sequencing and microarray analysis, the focus is mostly – if not exclusively – on changes in transcript levels, and the higher they are the better. However, the degree of variation is not necessarily informative. For example, the transcript

Figure 37.1 Schematic view of one format of a transcriptional profiling analysis. RNA is isolated from two tissue samples (e.g., tumor and normal tissue of a kind). One isolate is labeled with a red fluorescent dye, while green is used for the other sample. After mixing, the two samples are hybridized onto an array of DNA-fragments or oligonucleotides that represent the genes. The red and green molecules compete for the identical binding sites. In consequence, the color at each spot provides information about the ratio of the respective green and red molecules. If similar amounts are present, the resulting spots are yellow. The same information can be obtained with one dye and an incubation on two microarrays.

37 DNA-Microarray Technology

947

Figure 37.2 Analyzing RNA splicing. Genes of higher organisms frequently consist of several exons. During RNA processing, different exon combinations may be produced. For analysis, the RNA isolates from different samples are labeled with fluorescent dyes. After hybridization to a microarray that contains representatives of each exon, the signals indicate the differences in RNA splicing. As for transcript level profiling, two arrays and labeling with one dye would produce the same information.

level of actin is very strongly up-regulated in tumors, but is unlikely to be causatively involved in the transformation of a normal cell into a tumor cell. In addition, a lack of any change in transcript levels is nearly as important for the understanding of cell biology as a significant change, but is largely ignored.

37.1.2 RNA Splicing Along with variations in transcript levels, differential splicing is another process responsible for regulating function. An unexpected but not really surprising result of genomic sequencing was the fact that it is not so much the mere number of genes or their sequences that is actually responsible for the differences between organisms, but the molecular modulation and interpretation of the encoded information is what matters. Information about splice variations is obtained by microarrays that contain at least one binding site for each exon of a gene (Figure 37.2). However, only exons that have been recognized by prior sequence annotation can be studied this way. Even for the relatively simple genome of the yeast Saccharomyces cerevisiae, several hundred exons had not been recognized right away. For the more complex genome of the fruit fly Drosophila melanogaster, this number was more than 2000 originally. One could argue that with the enormous and redundant amount of sequence information that is available meanwhile, there should no longer be such a problem. However, this may not be true after all. In view of the fact that the number of RNA molecules encoded in the human genome is probably in the range of 130 000 – mostly not protein-encoding genes, and many of them still not confirmed – rather than the only ca. 22 000 RNA molecules annotated originally, the number of not yet appropriately annotated exons may still be rather large.

37.1.3 RNA Structure and Functionality The folding of RNA molecules – best known from the cloverleaf structures in which transferRNA (tRNA) molecules are usually visualized – can act as a strong indicator of variation in the primary RNA sequence as well as in the function of the RNA. During the early phases of microarray-based analyses, Edwin Southern and colleagues already studied the influence of RNA structure on the binding onto oligonucleotide microarrays. The array-bound oligonucleotides represented in their entirety a complementary sequence of the complete RNA molecule. Especially for RNA molecules, which either directly exhibit an activity, such as ribozymes, or act as a structural component, such as the rRNA in the ribosome, it is very likely that a change in structure goes along with a transformation of the activity or functionality. In comparison to a mere sequence analysis, the procedure has the advantage that the structural variation is tested immediately, which makes the results much more relevant.

948

Part V: Functional and Systems Analytics

Figure 37.3 Identification of singlenucleotide polymorphisms (SNPs). Two DNA fragments, which differ in sequence by one base only, are hybridized to a microarray of oligonucleotides, which represent sequence variations. If hybridization is specific, the two DNA fragments bind to different complementary oligonucleotides. Alongside the schematic representation, examples of real data are shown. Source: adapted and modified from Hoheisel, J.D. et al. Ann. Biol. Clin., 50, 827–829.

37.2 DNA Analyses 37.2.1 Genotyping An initial objective during the development of microarray technology was the establishment of a method for a quick deciphering of DNA sequences. While this never materialized properly in the form of a hybridization-based process, it did work out with the help of enzymatic reactions (Section 37.2.3). For the identification of individual sequence variations (single nucleotide polymorphisms, SNPs), however, microarrays were and still are employed, although sequencing is bound to take over eventually. Technically, microarray-based SNP analyses are developed to a degree that meets the requirements for clinical application. One reason is the fact that qualitative information is sufficient to call the difference. In addition, controls for an immediate internal quality assessment are readily available in the form of oligonucleotides that each contain one of the four possible nucleotides at the position in question. The actual analysis is based on the differences in duplex stability of an analyte DNA upon hybridization to probe molecules on the microarray surface that represent all four sequence variations, or 16 in case of a dimer sequence (Figure 37.3). Even more accurate results are obtained by a continuous detection of the hybridization and dissociation process between analyte DNA and probes (dynamic allelespecific hybridization). The additional information gathered from the association and dissociation curves permits an optimal discrimination between fragments with fully homologous sequences and those in which a single nucleotide (or more) is not complementary. Alternatively, an enzymatic reaction can be utilized, for example, by adding to the reaction a polymerase and dideoxynucleotides that are labeled with a base-specific fluorophore (Figure 37.4). Combining the selectivity of DNA hybridization and the specificity of a polymerase, high discrimination is achieved even for sequences that are difficult to analyze. Genotyping is frequently used for the identification of microorganisms in diverse areas, such as the health system, food quality control, or wastewater treatment plants. In biomedicine, the creation of a high-resolution map of the human genome is basically complete; microarrays with as many as up to 10 million SNPs are used to define the position of genes that are associated with particular diseases.

37.2.2 Methylation Studies About 4% of all cytosines in the human genome – the ones that are part of a d(CG) dinucleotide sequence, frequently called the CpG dimer in consideration of the two consecutive bases and

37 DNA-Microarray Technology

949

Figure 37.4 SNP analysis by means of a polymerase extension reaction. (a) In one setting, an array-bound oligonucleotide primer is used that reaches to the sequence position that is just in front of the nucleotide in question. Upon addition of the four labeled dideoxynucleotides, only the complementary nucleotide is incorporated so that the color indicates the sequence. Since the 3´ -hydroxyl group is missing, only one nucleotide can be incorporated per DNA strand. In (b), typical results are shown. (c) Alternatively, the oligonucleotide primer includes the position that is being queried. Only one labeled deoxynucleotide is required but four primers; for simplicity, only two primers are shown here. Only the fully complementary primer will be extended by the polymerase, thus creating a signal at the respective array position.

the phosphate group that lies in between – exhibit variation in methylation, which is regulated by enzymes adding or removing a methyl group at either strand of the double helix. The resulting epigenetic patterns represent one of many dynamic features of DNA, which is frequently but wrongly assumed to be of a rather static nature. The binding or progression of DNA-binding proteins is influenced by the degree and position of DNA methylation, thus affecting transcription, for example. Variations in the promoter regions of tumor-associated genes play a major role in cellular transformation, for instance. Since methylation is a rather stable DNA modification, it is attractive for utilization as a biomarker. Currently, microarray analyses are still the method of choice for studying methylation at a global level but with, nevertheless, a single-base resolution (Figure 37.5). For practical reasons, only about 1.5% of all possible methylation sites are looked at in most studies. For the analysis, genomic DNA is treated with bisulfite. While methylated cytosine is not affected by the chemical reaction, the bisulfite converts unmethylated cytosine into uracil, which turns into thymine upon PCR amplification of the material. This dC into dT conversion is nothing other than a chemically induced SNP, which is subsequently studied accordingly as described above (Section 37.2.1). Alternatively to this approach, antibodies exist that bind specifically to methylated sequences. Incubation of DNA with such antibodies followed by DNA cleavage and antibody precipitation co-isolates the methylated DNA, which is then incubated on microarrays that represent many genomic regions or analyzed by DNA sequencing for identification. While microarrays are unlikely to get much beyond the analysis of a few percent of the methylation sites, although with high accuracy, sequencing will allow an analysis of all the ca. 30 million sites in a single experiment.

37.2.3 DNA Sequencing The objective of reading DNA sequences in high-throughput was a starting point for the development of microarrays. In principle, sequencing is simply an extended genotyping. However, all possible sequence variants and all nucleotides of a DNA fragment need to be covered. One approach to achieve this was based on the hybridization of DNA to a comprehensive library of all 65 000 octamer oligonucleotides. By splitting the octamer

950

Part V: Functional and Systems Analytics

Figure 37.5 Methylation studies. Treatment with bisulfite converts unmethylated cytosine into uracil, which is turned into a thymine upon PCR-amplification of the DNA. Hybridization to a microarray that consists of the respective complement sequence in form of oligonucleotides will make the conversion, or the lack of it, apparent by the hybridization pattern produced. In the analysis shown, the DNA binds mostly to the oligonucleotides, which contain a dA at the relevant position(s) and are therefore complementary to a DNA fragment that was unmethylated originally.

sequence in two halves of four nucleotides each and placing an unspecific dinucleotide in between, thus using an array of decamer oligonucleotides, a read-length of about 2000 nucleotides would have been possible. The accuracy of this process, however, was never sufficient to compete successfully with the gel-based sequencing methods of the early 1990s. As for genotyping, a polymerase reaction offers an alternative. The combination of the DNA binding to the arrayed oligonucleotide primers and the selectivity in extending these primers by a polymerase reaction permits high accuracy. In principle, a process that could yield long reads was already known as early as 1994, but it took more than a decade to develop the idea into a working technology. The array-bound primers are extended with nucleotides that are labeled with a base-specific dye. Only one nucleotide can be incorporated at a time as long as the fluorescence label is not cleaved off. After detection of the fluorophores at all array positions, directly indicating which bases were incorporated, the dyes are removed and another cycle of extension begins (Figure 37.6). Since this reaction takes place at very many molecules in parallel, enormous amounts of sequence data can be accumulated.

Figure 37.6 High-throughput sequencing. Surface-attached DNA primers are hybridized with genomic DNA fragments. After addition of fluorescently labeled nucleotides and a polymerase, the nucleotide is incorporated that is complementary to the respective genomic fragment. Color detection allows the base to be identified. The fluorescent dye is then cleaved off, thereby creating the 3´ -end required for the incorporation of the next labeled nucleotide.

37 DNA-Microarray Technology

In a similar approach, the progress of the polymerase reaction is monitored by detecting the pyrophosphate, which is released upon incorporation of an (unlabeled) nucleotide triphosphate (pyrosequencing). In this process, four times as many cycles are required, however, since only one of the four bases can be added at a time. Otherwise, it would not be known whether incorporation of dATP, dGTP, dCTP, or dTTP is responsible for the release of the pyrophosphate. Pyrosequencing was the first process that yielded a next-generation sequencing device. The main obstacle in developing the process into a high-throughput method was miniaturization; the basic reaction had already been used for a while in a microtiter plate format. Instead of a polymerase, other enzymatic reactions also allow the reading a DNA sequence. One example is DNA ligation. A mixture of short, synthetic DNA-oligonucleotides, which in combination represent all possible sequence variations of this length, and a DNA ligase are added to a primer-bound DNA template, rather than a polymerase and nucleotide triphosphates. Only one of the short oligonucleotides will then fit entirely to the sequence of the DNA template next to the primer molecule and will be incorporated by DNA ligation. Rather than base by base as in a polymerase reaction, small pieces of DNA are added consecutively. Detection is again via molecule-specific labels.

37.2.4 Comparative Genomic Hybridization (CGH) Analyzing the copy number of particular genomic regions, named comparative genomic hybridization (CGH), is an important diagnostic procedure for the detection of genetic aberrations. Usually, there are two copies (alleles) of a genome in each human cell, with the exception of most of the genetic content of the X-chromosome in a male person. However, the local copy number can vary largely and differ between people or between healthy and diseased tissues. This is analyzed by hybridizing labeled genomic DNA onto a representation of the genome or parts thereof. A signal intensity that is stronger or weaker than that obtained with a reference DNA indicates regions in which amplifications or deletions have taken place. Especially for tumors, the connection between such variations and disease has been documented and is being used as a diagnostic tool. However, such variations also occur in healthy tissues, acting as a normal regulation process. Originally, the analyses were performed on metaphase chromosome spreads. A microarray version, however, simplified the handling and improved resolution. Initially, the array-bound fragments had a length of about one million base pairs each; nowadays, microarrays of short oligonucleotides are used for a high-resolution scanning down to few kilobase pairs. Analysis by DNA sequencing pushes the resolution to its ultimate limit – one base pair – and produces more quantitative results, since the frequency of a regional sequence can actually be counted.

37.2.5 Protein–DNA Interactions A central component in the regulation of transcription is the binding of transcription factors to the promoter regions of genes. However, many other interactions of proteins and DNA are also important for the functioning of a cell. The position of this kind of interaction is frequently determined by chromatin immunoprecipitation and a subsequent analysis of the isolated DNA by genome-representing microarrays (ChIP-on-chip; Figure 37.7) or DNA sequencing. To this end, all DNA-bound proteins are chemically crosslinked to the DNA. An antibody is used to isolate (precipitate) the protein in question. Because of the crosslink, the protein-bound DNA fragment is co-precipitated. After removal of the protein by an enzymatic digestion, the DNA can be analyzed. For this kind of analysis, both the specificity of the protein–DNA interaction and the selectivity of the antibody are critical. The procedure was instrumental for obtaining functional information about the cellular regulation of transcription in yeast, for example. Even in stationary cells, the entire protein machinery needed for transcription was found to be present, but in an inactive state. RNA polymerase II sits in the promoter regions of several hundred genes that are critically important for a quick response of the yeast cells to a change in the environmental conditions, when nutrition becomes available. However, microarray analyses also provide quantitative information about the specificity and intensity of protein–DNA interactions. In fact, this aspect will be an important application of microarrays in future, while the mere determination of binding sites will be done by sequencing.

951

952

Part V: Functional and Systems Analytics

Figure 37.7 ChIP-on-chip analysis. Protein–DNA complexes are crosslinked in a chemical reaction. A particular protein is then isolated by means of an antibody. The co-isolated DNA fragment is released by digesting the protein enzymatically. After labeling with a fluorescent dye and hybridization to an array that represents the genome or a portion thereof, the location of the DNA and thus the binding position of the protein is identified.

Most sequence-specific proteins actually bind to double-stranded DNA. Therefore, processes had to be established to produce double-strand DNA on microarray surfaces. Apart from spotting pre-produced molecules (e.g., PCR products) it is also possible to create double-strand DNA from single-stranded oligonucleotides in situ. One option is a synthesis of long oligonucleotides, which are self-complementary in sequence and fold back on themselves, forming hairpin structures. Alternatively, a short terminal self-complementary sequence could act as a primer site for a polymerase reaction. Such an approach requires the attachment of the oligonucleotide to the solid support via its 5´ -end, which in turn requires a chemistry for DNA synthesis that is identical to the natural direction of enzymatic DNA synthesis but reverse to the 3´ -5´ direction of standard chemical synthesis. To analyze the binding behavior of transcription factors, for example, microarrays have been produced that contain all possible 10 bp doublestrand sequences.

37.3 Molecule Synthesis 37.3.1 DNA Synthesis Principles of the Synthesis of Oligonucleotides, Section 27.6.1

Oligonucleotide synthesis is very much automated nowadays, based on a combination of solidsupport synthesis protocols and phosphoramidite chemistry. For biomedical applications, there have been two diametric tendencies during recent years. For some applications, such as the use of oligonucleotides as therapeutic agents, gram to kilogram amounts of relatively few molecules are needed. In contrast, a few picomoles or even femtomoles of an oligonucleotide are sufficient in many areas of molecular biology. By design, oligonucleotide microarrays are an ideal tool for the synthesis of many oligonucleotides with different sequences in small quantities. Systems have been established that allow the synthesis of DNA stretches, which put together represent complete genes or even entire microbial genomes. By programmable in situ synthesis, controlled by light induction, for example, any sequence can be produced. Since the yield of chemical synthesis on microarrays is quantitative, relatively long fragments are within reach. In addition, both synthesis directions are made possible. Thus, either the 3´ -ends or the 5´ -termini of the products are well defined. In a 5´ -3´ synthesis, the means exist to block chemically all truncated molecule derivatives, which are short of the desired product, so that only full-length oligonucleotides act as substrates in subsequent reactions, such as a polymerase extension. For double-strand formation, the oligonucleotides are cleaved off the microarray surface after synthesis; the eluted material finds its complementary sequence in solution or by hybridization to another microarray, on which the complementary oligonucleotides have been synthesized.

37 DNA-Microarray Technology

37.3.2 RNA Production Functional studies are critical for understanding the information that is encoded in a genomic sequence. Knockout or knockdown experiments are essential to such ends. Technology in this field has advanced greatly in recent years with the advent of methods such as the CRISPR-Cas9 system to target basically any genomic region. Another successful technology is based on the use of short-hairpin RNA (shRNA) constructs, which are delivered to cells via a lentiviral system and act as inhibitory RNA (RNAi) molecules, knocking down the expression of their target genes. Beside such artificial systems, more and more natural RNAi molecules are being found, which regulate transcription of a gene or a group of genes. Both natural and artificially designed inhibitory transcripts exhibit strong differences with regard to their effectiveness; inhibition is basically never complete. In addition, it is well established that even a complete deletion of a particular gene may not necessarily result in phenotypically identifiable changes, because of compensation processes at the molecular level. Therefore, a mixture of RNAi activities may be needed to produce a phenotypical result. To work out the effect of naturally occurring RNAi molecules individually and in combination, for the identification of the best gene-inhibiting molecule, or to establish the optimal RNAi mixture for obtaining a particular phenotype, microarrays provide a rather simple process for the production of inhibitory transcripts. This could be cheaper and more flexible than producing a comprehensive library, of which only small portions would be used. Concerning the production of RNAi pools, complexity is limited to no more than a few hundred molecules; otherwise, the viscosity of the resulting solution would be too high for the solution to be handled. For RNA production, DNA-oligonucleotides are synthesized on the microarray. Next to the complement to an RNAi sequence, they contain the promoter sequence of an RNA-polymerase, frequently T7 RNA-polymerase. The promoter is put in place as a complete fragment in a single chemical synthesis step. The parallel synthesis of the individual molecules is then performed on this common promoter fragment. As a last step, again an entire fragment is added, this time a T7 terminator adapter. By on-chip PCR, all oligonucleotides are made double-stranded. These molecules are used as templates for RNA production. Either they are first released from the microarray, eluted, and used as templates in solution or the templates remain bound to the microarray and only the enzymatically synthesized RNA is eluted. Because of the enzymatic production process with T7 RNA-polymerase, amounts of RNA can be produced that are relatively large with respect to applications in molecular biology. Since the RNA production has similar yields for each individual RNA molecule – being short and with the transcription initiated on the same promoter – the complexity of a mixture and their ratio is defined by the frequency with which the respective oligonucleotides are present on the array surface.

37.3.3 On-Chip Protein Expression Quite a few proteins do not express well in Escherichia coli or other cellular systems. In addition, the entire process of cloning, transfer, expression, and purification is rather complex and elaborate. As an alternative, several different cell-free systems have been established for protein expression. Their combination with an in situ synthesis on template DNAs that are arranged on microarrays offers an avenue to produce very many proteins in parallel and make them accessible to subsequent analyses, such as interaction studies. Already during the synthesis process, each protein is modified, for example, by adding a terminal histidine tag, so that it binds to the microarray surface immediately next to the DNA template at the location of its synthesis. Thereby, complicated isolation and purification processes are avoided. Any modification in the DNA sequence will be readily translated into the respective variation of the protein. Even an actual protein array printing is made possible. For this, the initial microarray with the PCR products is covered with a second, empty microarray. By virtue of appropriate surface chemistries, the in vitro synthesized proteins do not bind to the DNA-microarray but to the surface of the second microarray, which thereby carries a protein copy of the initial DNA pattern. Since the DNA-microarray is not affected in any way during the protein expression, the process of protein production can be repeated several times over, producing many protein arrays from a single DNA-microarray. The technology of producing protein microarray by in situ protein expression (Figure 37.8) has been around for more than a decade now, and several

953

954

Part V: Functional and Systems Analytics

Figure 37.8 In situ protein expression. PCR products were placed onto two microarrays, which represented about 1600 human genes each. After parallel in situ transcription and translation on each individual PCR fragment, the newly synthesized proteins were detected with two red and green labeled antibodies that bind specifically to the amino- or carboxyl-end, respectively. From the color at the individual microarray positions it could be concluded that about 70% of the proteins were present in full length. Meanwhile, technology has advanced and the rate of full-length protein is now usually in the range of 95%.

technical modifications exist, but several quality measures are still to be defined and established before the technology’s true potential can be utilized.

37.4 Other Approaches 37.4.1 Barcode Identification For functional studies, more and more artificial DNA molecules are introduced into cells in order to determine the changes that occur upon addition or replacement of a particular genomic segment. Because of the complexity of regulative processes, but simultaneously driven by new capabilities made possible by new technical developments, these studies are based on the addition of different DNA constructs to individual cells, and this in very many combinations. To keep track of this, barcode sequences are frequently utilized. Barcodes are synthetic DNA sequences that do not occur in any known organism. With common primer binding sites to the left and right, they can be isolated from the genomic DNA by PCR. The first genome-wide use of this approach was a library of knockout yeast mutants. In each mutant, a particular gene was deleted; instead, two barcode sequences were introduced that were specific for the respective mutant. The library was incubated under various growth conditions, for example, and the barcode sequences were used as an indicator of whether a particular mutant cell was growing well or badly in comparison to all the others. Another, more recent example of such an approach is shRNA libraries for genome-wide gene knockdown (Figure 37.9). Each shRNA sequence is accompanied by a specific barcode sequence of 60 bp, which is again amplifiable with a common primer pair. The shRNA constructs are transduced into host cells with the help of lentiviruses. Each construct integrates into the genome of its host cell, from which the RNAi is then transcribed at a very high and constant level. Currently, up to 55 000 different shRNA constructs can be used

37 DNA-Microarray Technology

955

Figure 37.9 Schematic representation of a genome-wide gene inhibition analysis. A mixture of shRNA constructs is transduced into cells by means of lentiviruses. The red construct represents a shRNA, whose target gene is essential for cell survival. The green construct knocks down a gene that is not essential. All cells that do not contain any of the constructs are killed by addition of antibiotics. The remaining cells are then incubated at selective conditions. At the beginning and end of the incubation period, the genomic DNA of the cells is isolated, from which the barcode sequences are PCR-amplified with a common primer pair. Hybridization to an array of the barcode sequences shows that the relative number of “green” barcode sequences and thus “green” cells remained the same, while the “red” cells have disappeared, as the target gene was essential for survival.

simultaneously (one construct per cell). Growing the cell mixture under any sort of selection pressure will increase the number of cells for which the particular RNAi-triggered knockdown is advantageous and decrease the cell number if the knockdown is a disadvantage. The frequency of all barcodes – which is equivalent to the number of cells containing the respective construct – is determined at the beginning of the experiment and at its end. Any significant variation between the two measurements indicates an effect of knocking down a particular target gene. From the barcode sequence, it is immediately known which shRNA, and thus which gene, is involved. Until recently, this read-out was performed by hybridization of the barcode PCR-products to microarrays. The signal intensity acted as a measure of frequency. Nowadays, next-generation sequencing permits a better accuracy, since the frequency of each barcode is counted.

37.4.2 A Universal Microarray Platform Most microarray platforms are designed for one particular type of assay. This means that a specific microarray has to be produced for each analysis form and each organism. In addition, while the physical separation of reactions is useful for analyzing complex processes, the array surface is an inhibitory factor for many assay reactions; they work better in solution than on a solid support. To circumvent all this, a common basic microarray format could be created that is used only for the data read-out, isolating, and separating a mixture of molecules. This molecular mixture is responsible for the actual assay, which is performed in solution prior to the array-based analysis. To such ends, a special kind of “barcode array” is required, usually called a zip-code microarray. Zip-codes are oligonucleotides that are unique in sequence and do not cross-hybridize. In addition, they exhibit very similar thermodynamic parameters so that hybridization works equally well on all of them. The actual assay is done with a mixture of free oligonucleotides; in addition to the sequences needed for the assay, they contain tag-sequences

956

Part V: Functional and Systems Analytics

that are complementary to the zip-codes. After the assay is performed in solution, the material is incubated on the zip-code microarray, upon which the assay molecules bind to separate array position and can thus be read-out individually. To avoid any cross-hybridization between assay molecules and zip-codes, the use of enantiomeric L-DNA – the mirror-image form of natural DNA, forming a left-turning double helix – has been suggested for the zip-code oligonucleotides and the tag sequences, while the assay oligonucleotides consist of natural D-DNA. Since there is no interaction between L-DNA and D-DNA, the two oligonucleotide groups could only hybridize to each other via the complementary L-DNA sequences.

37.5 New Avenues 37.5.1 Structural Analyses An analysis of the structural variations of DNA is still an underdeveloped and underestimated field, although their role in regulative processes, and thus their functional consequences, is likely to be enormous. Already the helical angle in a DNA double helix varies between 30° and 40°. The actual structure is dependent on the sequence, and also on environmental parameters, and is of central importance to the activity of many DNA-binding proteins. In addition, DNA conformations exist that are very different from the typical right-helical image we all have in mind when thinking about DNA. Sequences of alternating purines and pyrimidines, for example, and particularly regions that consist of repetitions of the dinucleotide d(CG) – even more so when they are methylated – can flip over into a left-helical Z-DNA conformation under physiological conditions. Intriguingly, the number, position, and length of CpG stretches in promoter regions and concurrently their methylation state influence the DNA structure, and thereby modulate in a reversible manner the binding behavior of proteins that are important for transcription. To date, no direct functional involvement of Z-DNA sequences has been documented, although Z-DNA binding proteins do exist and Z-DNA does occur in living cells. However, the CpG sequences may not actually need to be in the Z-configuration to exhibit their function but may only alter the overall DNA topology of longer DNA stretches to make their influence matter. DNA topology is a feature that is only partly sequence dependent. Sequence may actually act more as a means for the fine-tuning of topologically induced variation, defining where and under what conditions structures such as cruciform DNA, bend DNA, or single-strand stretches occur. Basically any sequence could switch into Z-conformation, for example, if enough topological stress is provided. Correspondingly, a transition of Z-DNA into a B-conformation in even a small DNA fragment will change the superhelical status of long DNA stretches. Quite a few genes are known whose transcription is directly dependent on the degree of DNA superhelicity. All this can be studied in much detail on microarrays. Attachment of both ends of a DNA fragment to the array surface allows the enzymatic introduction or removal of any number of turns, in order to create DNA pieces with structural features at a higher level than the double helix, so called supercoils. It is known that DNA has the capacity to store information not only in its sequence but also in its three-dimensional structure. Array-based studies could therefore lead to new insights into DNA-based regulation processes.

37.5.2 Beyond Nucleic Acids Many technical aspects that were developed for the microarray analysis of nucleic acids are being adapted to studies on other molecule classes or biological systems, such as proteins, tissue sections, or even living cells. Particularly in the area of proteome analysis, microarray-based analysis forms have advanced enormously. Proteome-based diagnostics is bound to be more informative than studying nucleic acids, since proteins are usually the direct effectors in cellular processes. In addition, about 80% of the proteins assemble in multiprotein complexes, and microarrays offer a platform for analyzing the very many interactions in such complexes in a quantitative manner. Processes exist to express all human proteins – assuming one gene encodes one protein – on a single array (Section 37.3.3), and this number is bound to increase in order to also cover protein derivatives. Antibody microarrays are used for an analysis of

37 DNA-Microarray Technology

expression variations as well as studies on structural changes or post-translational modifications. All this could be done simultaneously. What is missing most for protein studies is a comprehensive coverage with antibodies. Although nearly 2.95 million antibodies targeting 19154 proteins are listed in Antibodypedia (status in November 2017), a database of available antibodies, coverage is far from sufficient. Many binders lack specificity or exhibit inadequate affinities, others bind to denatured proteins only or do not recognize modifications, such as polysaccharides. In addition, many antibodies, and this also includes monoclonal ones, are not a resource that is reliable enough for long-term reproducibility of their characteristics. Recombinant systems may be a better option in the long term – also because antibody identity is confirmed by sequence. The capability of revisiting microarray spots with high accuracy, thereby introducing the option of performing different types of analysis subsequently to each other – for example, analyzing the bound material by mass spectrometry after an optical measurement of signal intensities – offers an analysis of more and more molecule classes in combination. Expansion of this development towards ever more complex technical platforms that generate data from ever more complex biological systems, in vitro or in vivo, is a forthcoming direction of microarray technology, while many of the merely DNA-based analyses will soon be replaced by sequencing. Integration of information is already an essential part of data analysis and interpretation, and this will not be limited to in silico studies in future but be validated and utilized on novel, integrative experimental platforms.

Further Reading Early, Conceptual Publications Cantor, C.R., Mirzabekov, A., and Southern, Ε. (1992) Report on the sequencing by hybridisation workshop. Genomics, 13, 1378–1383. Drmanac, R., Labat, I., Brukner, I., and Crkvenjakov, R. (1989) Sequencing of megabase plus DNA by hybridisation: theory of the method. Genomics, 4, 114–128. Gains, W. and Smith G. (1988) A novel method for nucleic acid sequence determination. J. Theor. Biol., 135, 303–307. Khrapko, K., Lysov, Y., Khorlyn, A., Shick, V., Florentiev, V., and Mirzabekov, A. (1989) An oligonucleotide hybridization approach to DNA sequencing. FEBS Lett., 256, 118–122. Maskos, U. and Southern, E.M. (1992) Oligonucleotide hybridisations on glass supports: a novel linker for oligonucleotide synthesis and hybridisation properties of oligonucleotides synthesised in situ. Nucleic Acids Res., 20, 1679–1684. Poustka, A. et al. (1986) Molecular approaches to mammalian genetics. Cold Spring Harbor Symp. Quant. Biol., 51, 131–139.

Other Relevant Publications Berger, M.F. and Bulyk, M.L. (2009) Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nat. Protocol, 4, 393–411. Betzen, C., Alhamdani, M.S.S., Lueong, S., Schröder, C., Stang, A., and Hoheisel, J.D. (2015) Clinical proteomics; promises, challenges and limitations of affinity arrays. Proteomics Clin. Appl., 9 (3-4), 342–347. Brazma, A. et al. (2001) Minimum information about a microarray experiment (MIAME) – toward standards for microarray data. Nat. Genet., 29, 365–371. Hoheisel, J.D. (2006) Microarray technology: beyond transcript profiling and genotype analysis. Nat. Rev. Genet., 7, 200–210. Moffat, J. et al. (2006) A lentiviral RNAi library for human and mouse genes applied to an arrayed viral high-content screen. Cell, 124, 1283–1298. Ramachandran, N. et al. (2004) Self-assembling protein microarrays. Science, 305, 86–90. Schena, M., Shalon, D., Davis, R.W., and Brown, E.Ο. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270, 467–470. Tian, J. et al. (2004) Accurate multiplex gene synthesis from programmable DNA microchips. Nature, 432, 1050–1054.

957

The Use of Oligonucleotides as Tools in Cell Biology Erik Wade and Jens Kurreck Berlin University of Technology, Institute of Biotechnology, TIB 4/3-2, Department of Applied Biochemistry, Gustav-Meyer-Allee 25, 13355 Berlin, Germany

Since the beginning of the post-genomic era, which began with the completion of the sequencing of the human genome, the greatest challenge to applying this wealth of knowledge is the analysis of the function of the approximately 20 000–25 000 human genes and their relevance to human diseases. One common way to investigate the role of a gene is to block its function and analyze the resulting loss of function phenotype. One of the best ways to do this is by using oligonucleotides that specifically bind to the mRNA and inhibit gene expression. Since they are complementary to the mRNA, which is, by definition, the sense-strand, these molecules are referred to as antisense oligonucleotides. Expressed in pharmacological terms, the target sequence in the mRNA is the specific “receptor” for the antisense oligonucleotide. After hybridization to a target RNA, the antisense oligonucleotides work in two ways: they inhibit translation by sterically blocking the ribosome and by triggering the degradation of the bound RNA by cellular RNases. In the early 1980s it was discovered that not only proteins but also oligoribonucleotides (short RNA fragments) can have catalytic activity; at the time this was quite surprising. These RNA enzymes are called ribozymes and can also be used to specifically inhibit the expression of a gene. Similar to antisense oligonucleotides, they bind to an mRNA through specific base pairing; they have the additional ability to completely inactivate the target mRNA by cleaving it into two or more parts, without the aid of proteins. A more recent and very promising discovery is the phenomenon of RNA interference (RNAi). This involves short double-stranded RNA molecules, called small interfering RNAs (siRNAs) that cleave and degrade an mRNA with the aid of cellular enzymes. This particularly efficient mechanism of post-transcriptional gene silencing can be used, like the previously described ones, to specifically inactivate a gene. Small double-stranded RNA molecules are, however, not only artificial tools useful for research purposes and for therapies, they also play a very important role as natural regulators of gene expression. These molecules, denoted as microRNAs (miRNAs), are involved in many normal physiological processes and pathological events and have increasingly become the focus of recent research. The use of oligonucleotides to inhibit gene expression promises levels of specificity unknown for small molecules. Most current medicines are small molecule compounds, which bind to proteins and inhibit a catalytic center, for example, or block a receptor. Unspecific, or at least unintended, binding to other proteins can lead to side effects. Antisense oligonucleotides, ribozymes, and siRNAs can be so conceived that they are significantly more specific for the target molecule. There is a very high statistical probability that a sequence of 15–17 base pairs only appears once in the human genome. An antisense molecule of this length hybridizes with this single type of mRNA and inhibits its expression. The described techniques can be categorized as anti-mRNA approaches, since the oligonucleotides bind to a target mRNA through complementary base-pairing in each case. The oligonucleotides, or the ribonucleases (RNases) they recruit, work like molecular scissors and inhibit the expression of genes through the degradation of the mRNA. Oligonucleotides possess Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

38

960

Part V: Functional and Systems Analytics

Figure 38.1 Mechanism and location of oligonucleotides used as tools in cell biology: (1) Triplex forming oligonucleotides bind to DNA double strands and block transcription. (2) Newly transcribed pre-mRNA contains introns (single lines) between the exons (rectangles) that need to be spliced out. Antisense oligonucleotides modulate splicing. (3) In the nucleus, antisense oligonucleotides induce the cleavage of pre-mRNA or spliced mRNA by RNase H. (4) In the cytoplasm, short, double-stranded RNA molecules (siRNAs) trigger the phenomenon of RNA interference (RNAi). A target RNA is cleaved by the RNA-induced silencing complex (RISC). (5) In addition, antisense oligonucleotides inhibit translation by blocking initiation of the ribosome (shown here) or elongation. (6) Ribozymes bind to the target RNA through Watson–Crick base pairing and cleave them, without the aid of proteins. (7) Aptamers bind target structures with high affinity. They are often directed against extracellular molecules (signaling molecules, membrane proteins), but can also be used intracellularly.

another interesting property, which was ignored for a long time: They form complex secondary and tertiary structures and can interact through these structures, similar to antibodies, with other macromolecules in cells, such as proteins, carbohydrates, lipids, and nucleic acids. In the early 1990s, a combinatorial approach was developed to identify an oligonucleotide with a structure that has a high affinity for a target, for example, a special protein. In this procedure, a very large number of oligonucleotides with different sequences are created and those with the highest affinity to the target are selected in each round, which is repeated until an oligonucleotide with the desired properties is found. Such an oligonucleotide is called an aptamer (from Latin aptus, fitting). As summarized in Figure 38.1, oligonucleotides can be used in different ways to investigate gene function in functional genomics. They are not only of interest to basic researchers, they are also used in drug discovery to determine if a new target for a drug, for example, a new cancer treatment, plays, in fact, a central role in the course of the disease (target validation). Finally. oligonucleotides can be used as therapeutics.

38.1 Antisense Oligonucleotides Rous sarcoma virus is an avian retrovirus that can trigger the growth of a tumor.

In 1978 the American scientists Paul Zamecnik and Maria Stevenson first described the principle that the expression of a gene can be specifically inhibited by a relatively short, synthetic oligodeoxyribonucleotide. They synthesized a 13 nucleotide long antisense oligodeoxyribonucleotide, which was complementary to a region of the Rous sarcoma virus (RSV). Addition of the oligodeoxyribonucleotide to the cell culture slowed the growth of the virus. Notably, the authors already postulated that the principle could be used therapeutically, that the RNA or DNA sequence of a gene could serve as a receptor for a pharmacologically active oligonucleotide and that the interaction should be specific. Antisense oligonucleotides hybridize with target sequences in the single-stranded mRNA, which are complementary to their own sequence. They function by different mechanisms, of which the three most important will be described in more detail:

 induction of a RNA degrading enzyme, ribonuclease H (RNase H),  inhibition of translation,  changes in RNA splicing.

38.1.1 Mechanisms of Antisense Oligonucleotides Induction of RNase H The most important mechanism that leads to degradation of a target mRNA is the induction of the cellular endonuclease RNase H. It is expressed in almost all cells

38

The Use of Oligonucleotides as Tools in Cell Biology

and plays an important role in DNA replication. The relevant characteristics of RNase H are its ability to recognize DNA-RNA hybrids and to degrade the RNA portion of such hybrids. Precisely this mechanism is used in antisense approaches (Figure 38.2): an antisense oligodeoxyribonucleotide hybridizes with its complementary mRNA target sequence and a DNA-RNA hybrid is formed. RNase H binds to this hybrid and destroys the RNA by endolytic cleavage. In vitro experiments proved that the oligonucleotide must be at least four nucleotides long for the resulting hybrid to be recognized and degraded by RNase H. Corresponding to its physiological function in DNA replication, the majority of the cellular RNase H activity is found in the nucleus. What part of the mRNA an oligonucleotide binds to is not important in triggering RNase H degradation; however, experiments have shown that the efficiency of this induction can vary a great deal. The current idea is that mRNA forms complex secondary and tertiary structures and binds proteins, so that not all segments of the RNA are equally accessible for hybridization with oligonucleotides. Since the endonuclease activity of the RNase H is only activated when both the oligonucleotide and the enzyme bind to the target RNA, it is understandable why certain RNA sequences are more sensitive to oligonucleotide-induced degradation than others.

961

Figure 38.2 Induction of RNase H by antisense oligonucleotides. When an antisense deoxyribonucleotide binds to a complementary mRNA, a DNA-RNA hybrid is formed, which is recognized by ribonuclease H. This results in the cleavage and degradation of the RNA component of the heteroduplex. The oligonucleotide can dissociate and induce the degradation of another mRNA molecule.

Inhibition of Translation The position of the target sequence in the mRNA is very important for another antisense mechanism, the inhibition of translation by the blockade of the ribosome by antisense oligonucleotides. If translation is to be inhibited by blocking the binding of the ribosomes to mRNA, the sequence of the antisense oligonucleotides must be so chosen that it hybridizes in the 5´ region of the mRNA. If the antisense oligonucleotide binds to the translated portion of the mRNA instead, movement of the ribosome along the mRNA is blocked, thereby inhibiting elongation. It has been shown, however, that a ribosome that is reading an mRNA can, under certain circumstances, displace bound antisense molecules, so that inhibition of translation initiation is usually the more efficient way to inhibit gene expression. Changes in RNA Splicing The transformation of the information contained in a DNA sequence into a corresponding protein sequence begins with the messenger RNA synthesis by a type II RNA polymerase. The resulting primary transcript is a pre-mRNA, which contains non-coding introns, in addition to the coding exons. The maturation of the mRNA requires splicing and takes place on specific, well-conserved sequences, the splice acceptor and splice donor sites. If an antisense oligonucleotide binds to one of these sequences, the splice site is no longer recognized by proteins and RNAs of the spliceosome and the maturation of the mRNA is inhibited. Through the blockade of a splice site it is not only possible to inhibit the expression of a gene, it is also possible to deliberately change splicing and thereby induce therapeutic effects. For example, the β-globin mRNA is incorrectly spliced in one form of β thalassemia, resulting in a lack of a functional protein capable of transporting oxygen (Figure 38.3). If the incorrect splice site is blocked, it is skipped and the mRNA is correctly spliced such that functional β-globin is produced. Clinical trials using oligonucleotides to mask a splice site, causing the spliceosome to splice out the defective exon, are underway for Duchenne muscular dystrophy. The oligonucleotide must be composed of modified components that do not induce RNase H activity since the mRNA should not be degraded by RNase H in this case (see Section 38.1.3).

38.1.2 Triplex-Forming Oligonucleotides An interesting alternative to the previously described antisense oligonucleotides, which work at the level of mRNA, are oligonucleotides that bind to a DNA double strand, thus forming triple helices. This is referred to as an anti-gene strategy, since the triplex forming oligonucleotides (TFOs) work directly on gene expression by blocking transcription, rather than translation. TFOs bind to DNA strands with a long sequence of purines (adenine and guanine). While the two complementary strands of the DNA duplex are bound together by Watson–Crick hydrogen bonding, additional bonds to the additional oligonucleotide are formed via Hoogsteen base pairs (Figure 38.4). In the first example of the figure, a cytosine forms hydrogen bonds with a G/C base pair. In the second example a thymine interacts with an A/T base pair. However, Hoogsteen base pairing is not restricted to the two types shown but instead can take place between other nucleotides.

Figure 38.3 Alternative splicing of the human β-globin pre-mRNA. The exons are shown as rectangles, the introns as solid lines. The dotted lines show the various splice variants. Additional splice sites are present in β-globin as a result of mutations in β thalassemia patients (labeled 3´ and 5´ ). The resulting mRNA contains a piece of the intron between exons 2 and 3 (left), so that a functional protein cannot be synthesized. An antisense oligonucleotide directed against the extra 5´ splice site (thick, short line) leads to the correct splicing of the premRNA (right).

962

Part V: Functional and Systems Analytics

Figure 38.4 (a) Sequence example for an intermolecular triple helix, which is formed by the association of an oligonucleotide with the double-stranded DNA. (b) Two examples of Hoogsteen hydrogen bonding.

38.1.3 Modifications of Oligonucleotides to Decrease their Susceptibility to Nucleases

Figure 38.5 Possible positions for modifications of nucleotides.

PNA probes Section 28.2.3

Unmodified RNA- or DNA-oligonucleotides are completely degraded within a few minutes or hours in biological fluids, usually before they can reach the site at which they are intended to work. Therefore, oligonucleotides must be protected from nucleolytic degradation by the incorporation of modified nucleic acids. There are three basic positions where nucleic acids can be changed: the bases, the sugar, or the phosphodiester bond (Figure 38.5). Although it has been shown that modification of the bases serves to increase nuclease resistance, this strategy plays a secondary role in current research. The following section will focus on the more common approaches involving the use of derivatives with modified phosphodiester bonds or a modified sugar. Numerous nucleases recognize single-stranded oligonucleotides and cleave their phosphodiester bonds. An obvious approach, therefore, is to change these such that they no longer act as a substrate for the nucleases, which stabilizes the oligonucleotides for a long time in serum or in cells. One of the first and most common modifications is the phosphorothioate, in which one of the two oxygen atoms, which are not directly involved in the phosphodiester bond, is replaced by a sulfur atom (Figure 38.6). Oligonucleotides with phosphorothioate bonds are stable for hours in human serum. They hybridize through Watson–Crick base pairing with complementary RNA molecules and induce their degradation by RNase H, like oligonucleotides with normal phosphodiester bonds. However, phosphorothioates have a few disadvantages. For example, their affinity for the target mRNA is less than that of an unmodified DNA oligonucleotide. In addition, oligonucleotides with phosphorothioate bridges bind to a few proteins, particularly those that interact with polyanions such as heparin-binding proteins, and can trigger toxic side effects as a result. The binding of phosphorothioates to albumin in serum of experimental animals or patients also has the advantage that they increase the retention of the oligonucleotides in blood, since they are otherwise rapidly excreted. Owing to the disadvantages of phosphorothioates, nucleotides are also modified on other positions. A possibility is to link functional groups with the C2´ position of the ribose (Figure 38.6). The most common substituents are methyl or methoxyethyl groups. These modifications also confer a high nuclease resistance on the oligonucleotides. The toxicity of these second-generation oligonucleotides is less than that of phosphorothioates. In addition, their membrane penetration is increased by the linkage with lipophilic substituents and their affinity to the target RNA is increased. Over the years, nucleic acid chemists have succeeded in improving the properties of oligonucleotides by creating hundreds of DNA analogs with differing properties. A very drastic step is to completely replace the sugar backbone of the DNA. An example is peptide

38

The Use of Oligonucleotides as Tools in Cell Biology

963

Figure 38.6 Frequently used modified components for antisense oligonucleotides.

nucleic acids (PNAs). In these oligonucleotides, the ribose group and the phosphodiester bonds are replaced by amide bonds, similar to those between amino acids (Figure 38.7). Other examples shown in Figure 38.6 demonstrate the variety in use today. The N3´ -P5´ phosphoramidates replace an oxygen bridge with an amino group. In 2´ -fluoro-arabino nucleic acid and morpholino-phosphoramidates, the ribose of the nucleotides is replaced by a fluorine-substituted arabinose or even by a six-membered ring. Locked nucleic acids are bicyclic compounds that are conformationally locked by a methylene bridge between the C2´ and C4´ atoms of the ribose. Many of these newer modifications have been used successfully in antisense experiments. They are characterized by high nuclease resistance, high affinity for the target mRNA, and low toxicity. A serious disadvantage of most oligonucleotides constructed from monomers with modifications to the ribose is that they are not recognized by RNase H as a substrate and therefore

Figure 38.7 Comparison of the backbone of an oligodeoxyribonucleotide (DNA) with that of a peptide nucleic acid (PNA). With this modification, the sugar backbone is replaced by amide bonds.

964

Part V: Functional and Systems Analytics

Figure 38.8 Example of a gapmer. The first five nucleotides on the 5´ - and 3´ -ends are modified monomers (2´ -O methyl RNA or locked nucleic acids), which protect the oligonucleotide against exonucleases. The gray-shaded monomers in the center are unmodified deoxyribonucleotides or phosphorothioates, which guarantee the activation of RNase H.

cannot induce degradation of the target mRNA. A solution is to employ chimeric oligonucleotides that combine different monomers. For example, the ends of the oligonucleotide can be protected from the exonucleases that are dominant in serum by using 2´ -O methyl components. A middle piece of 5–8 deoxynucleotides guarantees the induction of RNase H. Such oligonucleotides are referred to as gapmers, due to the “gap” between the protected bases (Figure 38.8). They can be protected from other endonucleases by replacing the phosphodiester bonds by phosphorothioates. Interestingly, the toxicity of the phosphorothioates in this combination is significantly reduced.

38.1.4 Use of Antisense Oligonucleotides in Cell Culture and in Animal Models

LNA Probes, Section 28.2.4

Amplification of RNA (RT PCR), Section 29.2.3

Antisense oligonucleotides have been used successfully in many studies in cell culture and animal models. One of the most important hurdles for antisense experiments is the transport of the antisense oligonucleotide to the point at which it works, usually referred to as delivery. The same problem is shared when using other types of oligonucleotides, such as ribozymes or small interfering RNAs, as described below. Oligonucleotides are negatively charged molecules that cross hydrophobic cellular membranes inefficiently. In cell culture experiments cationic lipids are often used as transfection reagents to get the oligonucleotides across the membranes. Remarkably, in vivo oligonucleotides are taken up to a certain degree, though the mechanism is unknown. Improving transfection efficiency in vivo is essential if oligonucleotides are ever going to become broadly applicable to a variety of diseases. Recent results indicate that antisense oligonucleotides can be drastically shortened, down to 12mers, by using high affinity nucleotides like LNAs. These short oligonucleotides can be used in cell culture without transfection reagents to knock-down the expression of a target gene. This allows them to better mimic the expected results of use in vivo, which also do not require the use of additional reagents. It is extremely important to prove the specificity of an oligonucleotide in every antisense experiment. The reduction of the expression of the mRNA of the target gene should be shown with Northern blots or quantitative RT PCR and of the protein with Western blots. In addition, negative controls with an independent sequence need to be tested and shown to lack activity. In this way one can be sure that the observed phenotype is caused specifically by the inhibition of gene expression by an antisense oligonucleotide. Unspecific effects can be triggered, for example, by CpG motifs, since they induce an immune response. These effects are undesirable for most antisense applications, but in some special cases are used intentionally, as a component of cancer therapy. Thousands of studies have described cases in which the expression of target genes has been successfully inhibited with antisense approaches and numerous questions have been investigated this way. While many biochemical techniques are suitable for the investigation of specific questions, such as the analysis of membrane proteins or kinases, a great advantage of the antisense strategies described here is that they are universal, since all mRNAs are structurally similar, even if they code for different proteins. At least theoretically, any desired gene can be downregulated by an antisense approach. For example, antisense strategies are used to investigate the function of closely related kinases that cannot be distinguished with classical biochemical or pharmacological methods. In this way it is possible to determine which isoform is particularly relevant for tumor growth. In other studies, the role of a special receptor in the transmission of signals in different forms of pain was investigated. The special thing about antisense oligonucleotides is that they are not only suitable for basic research, they may also be used as therapeutics. They are suitable for the therapy of any disease that is caused by the expression of a harmful gene.

38.1.5 Antisense Oligonucleotides as Therapeutics The use of oligonucleotides that are specific only for certain selected mRNAs or DNAs is appealing, since it largely avoids unspecific and unwanted effects often seen for small molecules that can inhibit molecules other than their intended targets and thereby cause

38

The Use of Oligonucleotides as Tools in Cell Biology

side effects. The modulation of miRNAs also holds promise for modulating entire classes of related mRNAs. For natural selection to work on miRNAs, they must necessarily work on functional groups of mRNAs involved in some phenotype. In cases where the phenotype is disease-relevant, miRNAs represent natural nodes that allow tuning of gene activity. Human clinical trials with antisense oligonucleotides were begun in the 1990s, after their effectiveness and safety could be shown in experiments in different animal models. The first antisense molecule approved by the US Federal Food and Drug Administration (FDA) in 1998 was used for the treatment of cytomegalovirus-induced retinitis. This progressive, infectious inflammatory disease leads to blindness, most often in immunocompromised AIDS patients. The phosphorothioate oligonucleotide is injected directly in the eye and prevents the loss of vision, conveniently bypassing the usual delivery problem when trying to treat other tissues. It took another 15 years before a second antisense oligonucleotide was approved by the FDA: the 2´ -methoxyethyl-modified antisense oligonucleotide mipomersen is directed against apolipoprotein B and is used for the treatment of hypercholesterolemia. About a dozen antisense oligonucleotides are currently in various stages of clinical development. A few of these molecules are designed to inhibit the replication of viruses. There is a particularly great need for the development of new therapeutic approaches in this area, since many of the currently available antiviral drugs, mostly polymerase inhibitors, are very unspecific and have strong side effects. At the same time, the sequences of many viruses are now known, so that antisense oligonucleotides can be synthesized that are specifically directed against viral RNA. Antisense oligonucleotides are used in other studies for the treatment of cancer after conventional chemotherapy has failed. Since oncogenes promote the growth of many tumors, inhibiting their expression with antisense strategies is an obvious approach. Other oligonucleotides are used in studies to treat inflammatory diseases, like ulcerative colitis, and heart disease. The first clinical trials used phosphorothioate antisense oligonucleotides; however, more recent human clinical trials increasingly employ oligonucleotides with modifications of the second and third generations. The expectation that antisense oligonucleotides bind completely specifically to their target sequence in the mRNA without causing unspecific effects on other cellular components has not yet proven to be the case. Nevertheless, the side effects observed in most cases have not been severe. The bigger problem for the use of antisense oligonucleotides is their limited therapeutic efficacy, which is, at least in part, due to delivery issues.

38.2 Ribozymes 38.2.1 Discovery and Classification of Ribozymes Until the beginning of the 1980s, it was assumed that only proteins possess enzymatic activity. Then Thomas Cech and coworkers discovered that certain RNA molecules of the protozoan Tetrahymena thermophila splice themselves. This means that even in the absence of proteins these RNAs have catalytic ribonuclease activity. Cech named ribonucleic acids with enzymatic activity ribozymes. A short time later, the group led by Sidney Altman discovered that the ubiquitous RNase P, which processes the 5´ ends of tRNAs, is a ribozyme. RNase P is a complex of RNA and protein; interestingly, the RNA is catalytically active, even without the protein component. An important feature of this ribozyme is the fact that it does not process itself, but instead processes an independent substrate. It emerges unchanged from the reaction and thereby fulfills the classic definition of an enzyme. Cech and Altman were awarded the Nobel Prize in Chemistry in 1989 for their revolutionary discovery. In the meantime, different types of ribozymes are known that can be roughly categorized into those of a few hundred to 3000 nucleotides and small ribozymes of 30–150 nucleotides in length. Particularly well studied are the hammerhead ribozymes, which were discovered in plant pathogens as self-splicing RNAs. For practical use it was important to develop a variant that is capable of catalyzing an independent substrate. The hammerhead ribozyme shown in Figure 38.9a consists of a catalytic center of 22 nucleotides and two substrate recognition arms of seven to nine nucleotides, which bind the target RNA. The binding arms can be so designed that they can recognize any chosen sequence. In a metal ion-dependent step, the target RNA is

965

Cytomegalovirus DNA virus of the family Herpesviridae, which is endemic in most human populations. It may lead to blindness in immunocompromised patients.

966

Part V: Functional and Systems Analytics

Figure 38.9 Structure of a hammerhead ribozyme (a) and an RNA-cleaving DNA enzyme (b). The arrows show the position of the cleavage sites.

cleaved and the two cleavage products diffuse away, so that the ribozyme is available for cleavage of further RNAs (Figure 38.10). In recent years catalytic activities of ribozymes in other biochemical processes have been discovered. The elucidation of the structure of the ribosome showed that there are no proteins near the catalytic center. This includes a central process of cells, the peptidyl transferase reaction during protein synthesis, which is catalyzed by a ribozyme. However, it is not definitively established whether protein components might still be involved in some way in the coupling of amino acids. Ribozymes have also been found in the 5´ untranslated region of mRNAs in bacteria. These are regulated by metabolites and thus control the expression of genes that are important for metabolism. Finally, a ribozyme has been discovered that catalyzes the separation of RNA polymerase II from the mRNA during termination of the transcription of the globin gene in eukaryotes. The discovery of more and more new ribozymes supports the RNA world hypothesis. This hypothesis holds that before the current world dominated by DNA and proteins, life forms existed in which RNA played the most important role, since RNA can both transmit inheritable information and catalyze reactions.

38.2.2 Use of Ribozymes Similar to the selection of antisense oligonucleotides, there are a few important issues in the selection of ribozymes. Different ribozymes have differing preferences for cleavage sites;

Figure 38.10 Catalytic cycle of ribozymes: The ribozyme binds by sequencespecific Watson–Crick base pairing to its complementary target mRNA. In the presence of metal ions, the reaction moves through a transition state, resulting in the cleavage of the substrate. The two product strands are set free and the ribozyme is available for another round of target RNA cleavage.

38

The Use of Oligonucleotides as Tools in Cell Biology

967

hammerhead ribozymes cleave after GUC or AUC triplets particularly efficiently. These cleavage sites must, in addition, be readily accessible to the ribozymes, just like the binding region of antisense oligonucleotides. Since ribozymes consist of RNA components, they are more sensitive to nucleases than DNA oligonucleotides. In systematic studies, ribozymes were protected from degradation by the incorporation of modified nucleotides, primarily 2´ -O-methyl RNA. To preserve the catalytic activity, unmodified ribonucleotides must remain in five positions. This lead to the idea of creating enzymes from DNA, which is less susceptible to nucleolytic degradation in biological media than RNA. This succeeded with the help of a combinatorial approach, as described in Section 38.4 about aptamers. The resulting DNA enzyme (Figure 38.9b) cleaves target mRNAs very efficiently. It is inexpensive and easily synthesized and is more resistant to nucleases than unmodified hammerhead ribozymes. Not only can ribozymes composed of RNA be chemically synthesized and transfected into cells or injected into animals (exogenous application), they can also be transcribed directly in cells (endogenous expression). The DNA sequence that codes for the ribozyme is cloned into a vector, which is introduced into the target cell. The ribozymes are then synthesized continually within the cell. Ribozymes have been successfully employed in animal models since the 1990s. Similar to antisense oligonucleotides, they have been tested in various indications, covering viral infections, cardiovascular disease, cancer, and rheumatoid arthritis. Some ribozymes have been tested in the early phases of clinical trials on patients. Patients infected with HIV had blood withdrawn and transfected ex vivo with retroviral ribozyme vectors and subsequently readministered to the patients. The intent of this experiment was to make the lymphocytes resistant to HIV. In other studies, chemically synthesized ribozymes with modified nucleotides were given to hepatitis C or cancer patients. The ribozymes were usually well tolerated in the clinical trials but their effectiveness was inadequate. A possible explanation is the dependence of the ribozyme catalyzed reactions on metal ions. In most in vitro studies the concentration of magnesium ions is significantly higher than the intracellular concentration. Presumably the versions of the hammerhead motifs used in most studies as shown in Figure 38.9 lacked important peripheral regions. These regions are necessary for a high catalytic activity of the natural ribozyme at low magnesium ion concentrations.

38.3 RNA Interference and MicroRNAs 38.3.1 Basics of RNA Interference A particularly promising method for the specific inhibition of the expression of a gene is RNA interference (RNAi). In 1998, Andrew Fire and Craig Mello discovered that double-stranded RNA molecules in the nematode Caenorhabditis elegans can be used to sequence-specifically silence genes. In the meantime the mechanism has been elucidated in detail (Figure 38.11): the double-stranded RNA is first cleaved by an RNase, Dicer, into short RNA duplexes, which are called small interfering RNAs (siRNA). Next the siRNA is incorporated into a protein complex, the RNA-induced silencing complex (RISC), during which one of the two strands is discarded. The other strand guides RISC to the target mRNA and hybridizes with it by conventional base pairing. RISC contains an endonuclease that cleaves the mRNA at a defined position. The protein complex containing the antisense strand of the siRNA then dissociates from the substrate and becomes available to initiate a new cleavage. The cleaved ends of the mRNA are no longer protected by a cap or poly A tail and are quickly degraded by cellular RNases. Similarly to the use of antisense oligonucleotides and ribozymes, the synthesis of the protein is inhibited. The phenomenon of RNAi is an evolutionarily conserved mechanism whose natural function is still not completely understood. There is evidence that they serve to defend against viruses and to protect cells from mobile genetic elements, such as transposons. After their discovery, RNAi was used intensively for research in model organisms such as Caenorhabditis elegans and Drosophila melanogaster. The method could not be used in mammalian cells, however, because double-stranded RNA triggers an interferon response, which leads to a general block of protein expression. It was therefore an important

Interferon response Interferons are species-specific proteins with antiviral and immunomodulatory properties.

968

Part V: Functional and Systems Analytics

Figure 38.11 Mechanism of RNA interference.

Figure 38.12 Typical example of a small interfering RNA (siRNA).

breakthrough when Thomas Tuschl and his coworkers showed that short, 21–23 nucleotide long siRNA molecules sequence-specifically block gene expression in mammalian cells, since the interferon response is only triggered by double-stranded RNA molecules that are over 30 nucleotides long. Figure 38.12 shows a typical example of a siRNA: it consists of two 21 nucleotide long strands, which form a 19-mer duplex. Each end usually has two overhanging nucleotides, which in most cases are deoxythymidine. For the practical use of RNAi, an efficient siRNA must first be generated, which requires a sophisticated design based on particular sequence criteria, such as the GC content and the relative stability of the two ends of the complex. In addition, the structure of the target mRNA plays a role. Most researchers assume that a good siRNA is significantly more potent than an antisense oligonucleotide or a ribozyme, which means that a lower concentration is required to turn the target gene off. Although RNA interference can be seen as a sequence-specific and very efficient method to post-transcriptionally inhibit gene expression, it is important to remember that it can also trigger unspecific side effects. Under certain conditions siRNAs can inhibit partially homologous RNAs other than their intended targets. In addition, in some cases siRNAs, depending on their sequence, trigger an unspecific interferon response and also induce other pathways of the innate immune system, among others those downstream of Toll-like receptor 3. At high concentrations, the RNAi-triggered cellular processes can disturb the microRNA pathway described below, which can have toxic effects. These results show that RNAi experiments need to be carefully planned, including numerous controls, the results need to be interpreted with great care, and particular caution must be exercised for therapeutic approaches.

38.3.2 RNA Interference Mediated by Expression Vectors The double-stranded siRNAs are unexpectedly stable for RNA molecules, since they are presumably protected from nucleolytic degradation by proteins. They can be stabilized further by the incorporation of modified nucleotides (Section 38.1.3). However, even with these measures, the effects of chemically synthesized siRNAs are transient and only last a few days. A longer lasting blockade of gene expression can be accomplished with special plasmids encoding expression cassettes for short hairpin RNAs (shRNAs) (Figure 38.13). These self-complementary RNAs are expressed intracellularly and subsequently processed to a typical siRNA by the enzyme Dicer, as described above. The shRNA is coded by the expression vector and

38

The Use of Oligonucleotides as Tools in Cell Biology

969

Figure 38.13 Vector expression of short hairpin RNA (shRNA). The shRNA is transcribed under the control of polymerase III promoters and intracellularly processed to siRNA.

transcribed under the control of an RNA polymerase III promoter, which is suitable for the expression of short RNAs without a cap structure or poly A tail. Owing to the continuous intracellular transcription of the double-stranded RNA, its effects are relatively long lasting. It is even possible to stably transfect eukaryotic cells with such an expression vector so that the siRNA is always present. In addition, the expression cassettes, consisting of a promoter and the shRNA coding sequence, can be incorporated into viral vectors. Such constructs have already been used in clinical trials. In particular, oncoretro- and lentiviruses, adenoviruses and adeno-associated viruses are used, which are modified to efficiently transduce a therapeutically effective gene into cells and, for safety reasons, are incapable of replicating themselves or spreading. Viral vectors can be used for the transduction of cells for research purposes; in addition, they offer the promise of the therapeutic application of RNA interference.

38.3.3 Uses of RNA Interference RNA interference (RNAi) is one of the most important developments in molecular biology since the development of PCR. In only a few years it has advanced to become a widespread standard method that has been employed in thousands of publications. The uses of RNAi range from functional studies of individual genes to genome wide screens to therapeutic applications. For many research projects, the function of individual genes is studied by inhibiting its expression. In this case RNAi is significantly faster than antisense oligonucleotides or ribozymes. For example, genes that are suspected of playing a role in cancer can be turned off by siRNA. The effect on cell growth can subsequently be studied in cell culture or animal models. A completely new possibility opened by RNAi is carrying out genome-wide screens, in which every gene in an organism is turned off, one at a time. In this way, all the genes can be identified that are involved in a cellular or pathological process. Libraries have been created for both the human genome and model organisms that consist of siRNAs or vector-encoded shRNAs against every individual gene. Such screens have been used, for example, to identify hundreds of host factors that HIV-1 requires for infection and replication. In other experiments, previously unknown proteins have been identified that are responsible for the uncontrolled proliferation of tumor cells. After elucidation of the human genome, RNAi has provided a method that allows determination of the function of the encoded proteins. A further step to the characterization of a gene is its investigation in animal models. However, for this purpose the biggest problem in the use of RNAi must be solved, namely, the efficient delivery of the siRNA into the cells of the target tissue. As a result, many methods have been developed for the transfer of oligonucleotides. For example, siRNAs can be brought into the cells with positively charged lipids or nanoparticles, similar to what has been described for antisense oligonucleotides. Alternatively, lipophilic molecules like cholesterol can be coupled directly to the siRNA to ease passage across the membrane. If an siRNA needs to be specifically targeted to just one cell type, antibodies and aptamers (see Section 38.4 below), which recognize specific markers on the cells, can guide them to cells and facilitate entry. In addition, viral vectors can be used for RNAi approaches. As previously described, shRNA expression cassettes can be transfected efficiently into cells with the help of these vectors. Through the selection of suitable capsid proteins, it is possible to optimize this process for special target tissues. RNAi developed into a standard method for the analysis of gene function in molecular biology laboratories very rapidly. At the same time its potential as a technology for therapeutic use was recognized. Only a few years after the discovery of RNAi, clinical trials were begun, some of which

970

Part V: Functional and Systems Analytics

have already reached advanced phases. The first treatments have been for diseases that can be treated with local application, such as the eye, to get around the delivery problem. However, most of these trials were terminated due to unspecific modes of action of the siRNAs. In the meantime clinical tests have also been initiated in other fields such as cancer, virus infections, and metabolic disease. These approaches involve systemic delivery and the application of viral vectors. For example, a clinical trial investigating the potential of RNAi as a new therapeutic strategy aims at treating infections with the respiratory syncytial virus (RSV). While this virus produces mild symptoms in healthy adults, it can lead to severe, and in many cases fatal, complications in newborns and premature babies. RSV infects the respiratory tract, so that the therapeutic siRNA can simply be delivered by inhalation. Tests with infected adults resulted in significant antiviral effects. More than two dozen clinical trials based on RNAi have been initiated in the meantime. Most of them aim at treating eye disease, virus infections, and cancer. In general, the RNAi treatment was found to be well tolerated. Now the therapeutic benefits have to be demonstrated in later phases of clinical development.

38.3.4 microRNAs The natural match to siRNAs are intracellularly expressed, 21–23 nucleotide long RNA molecules, which are called microRNAs (miRNAs). The human genome codes for almost 2000 miRNAs. The miRNAs are transcribed as part of a longer pri-miRNA (Figure 38.14). While still in the nucleus, the long transcript is processed by the RNase Drosha to the roughly 70 nucleotide long pre-miRNA, which is exported into the cytoplasm by exportin-5. Once there, Dicer performs the final cleavage to the mature form. The fully processed miRNA is loaded into RISC, like the siRNAs. The exact mechanism of the miRNA-induced post-translational repression of gene expression has not been completely elucidated and they possibly work by more than a single mechanism. When highly complementary to the target RNA they induce its cleavage and degradation, like siRNAs. They can work, however, even when they are only partially complementary to their target RNA. In this case, they repress translation without causing the cleavage of the mRNA. In addition, miRNAs can destabilize the target mRNA by

Figure 38.14 Schematic of the microRNA pathway described in the main text.

38

The Use of Oligonucleotides as Tools in Cell Biology

inducing the removal of the protective poly A tail and the cap at the 5´ end. The preferred binding region for miRNAs is the 3´ untranslated region of target mRNA. According to current results, miRNAs control the expression of more than 60% of all proteincoding genes. It is therefore not surprising that they are involved in almost every cellular and pathological process investigated so far. In a first example of the importance of miRNAs in physiological processes, it was shown that they control the expression of genes that trigger the release of insulin. But miRNAs are also involved in numerous disease processes. In many types of tumor miRNAs are deregulated, which can lead to a dysregulation of genes involved in apoptosis or control of the cell cycle. There is also a complex relationship between cellular miRNAs and viruses, which will be described in more detail below. To investigate the relevance of miRNAs in a disease process, the miRNA levels of cells in diseased tissue is compared with the levels in healthy control cells. DNA arrays containing a complete collection of probes complementary to all miRNAs can be used for this purpose. Alternatively, the deep sequencing technique can be used. By massive sequencing of the short RNAs conclusions can be drawn about the frequency of a miRNA and the level of its expression. With this method, some miRNAs are usually identified that are particularly strongly or weakly expressed in a disease process. These results can be verified with different methods. A frequently used technique for this purpose is quantitative RT PCR. But, in contrast to conventional quantitative RT PCR, only very short, 21–23 nucleotide long RNAs need to be reverse transcribed and amplified. Therefore, reverse transcription usually employs a stem loop primer that lengthens the fragment. Systems have been developed that allow analysis of the complete pattern of expression of all known human miRNAs by this technique. Other methods that can be used to investigate the expression level of miRNAs are Northern blots, which must also be adapted to hybridization with very short RNAs, and RNase protection assays. Surprisingly, miRNAs can even be detected in the circulation. They are therefore seen as potential biomarkers for various cellular states and diseases. For example, it has been shown that the level of miRNA-208a, detected with quantitative RT PCR, increases after a heart attack. A corresponding test would allow the early diagnosis of myocardial infarction. Many miRNAs are expressed in a tissue-specific manner. For example, miRNA-122 is exclusively expressed in the liver. Owing to their high affinity for complementary RNA, LNA modified probes are particularly suitable for the detection of this sort of tissue specificity by in situ hybridization. Such a detection can be carried out in a whole organism, for example, transparent zebrafish embryos. The identification of the cellular targets of the miRNAs remains a major challenge. Various internet tools exist to determine the potential target RNAs for a given miRNA. However, exclusively bioinformatic approaches still have a high failure rate. Genome or proteome wide investigations of the regulator effects of a miRNA with DNA arrays and proteomic methods are still, in comparison, very labor-intensive. In general the function of upregulated miRNAs is investigated in a further step with the aid of complementary antisense oligonucleotides. Modified antisense oligonucleotides, which bind with high affinity to the miRNA, are particularly useful for this purpose. This approach reduces or eliminates the effects of the miRNA, which allows conclusions to be drawn about their role in cellular processes. For example, it has already been shown that replication of hepatitis C virus is dependent on the expression of the previously mentioned liver-specific miRNA-122. If the miRNA is blocked by an antisense oligonucleotide, the virus can no longer proliferate. This result was used in the first clinical trial involving miRNAs, in which an LNA-modified antisense oligonucleotide against miRNA-122 was used to treat hepatitis C infections. If a miRNA is underexpressed in a pathophysiological process, chemically synthesized miRNA or endogenous expression of same can be used to balance the level.

971

Quantitative PCR, Section 29.2.5

38.4 Aptamers: High-Affinity RNA- and DNA-Oligonucleotides 38.4.1 Selection of Aptamers In the techniques described up to this point, the critical property of oligonucleotides was their ability to hybridize with complementary sequences via Watson–Crick base pairing.

Aptamers and the SELEX Procedure, Section 32.9.2

972

Part V: Functional and Systems Analytics

Figure 38.15 SELEX-strategy for the isolation of RNA aptamers with high affinity. PBS: primer binding site.

PCR, Chapter 29

Oligonucleotides form, however, complex three-dimensional structures through which they can bind to target molecules with high affinity, similar to antibodies. In the early 1990s a procedure was developed that allowed selection of those structures of a large pool of oligonucleotides that specifically bind to a given target. These high-affinity oligonucleotides are called aptamers and can be directed against all sorts of target structures, such as ions, small organic molecules, amino acids, nucleotides, proteins, or even whole viruses or cells. The combinatorial approach used to isolate aptamers is called systematic evolution of ligands by exponential enrichment (SELEX). It consists of a cyclical process in which the sought-after molecule is successively enriched in the population (Figure 38.15). To begin with a combinatorial library of single-stranded DNA molecules is chemically synthesized. These oligodeoxyribonucleotides consist of two flanking, fixed sequences that serve as binding sites for PCR primers and a random section between them, which is the basis for the variety of the oligonucleotide mix. The theoretical complexity of a library of oligonucleotides with n randomized positions is 4n (e.g., for a random sequence section that is 25 base pairs long 425  1015 different sequences are possible that can form different structures). First the oligodeoxyribonucleotides are amplified by PCR to generate double-stranded DNA molecules and to replicate the library. Owing to the additional hydroxyl group, RNA oligonucleotides were initially thought to be able to form more complex three-dimensional structures, so that DNA libraries are usually transcribed with T7 RNA polymerase into a mix of RNA oligonucleotides. However, more recently it was shown that DNA aptamers with similar properties and target affinities as RNA aptamers can be selected. Simple strategies solely based on DNA molecules have become the most common approach. The decisive step is the selection of those oligonucleotides that, due to their structure, bind particularly tightly to the target. There are different procedures for this; for example, a protein can be bound to a solid support in a column and the oligonucleotide mix is run through the column. Non-binding oligonucleotides pass through the column, while those with a high affinity to the target remain in the column. They are then eluted after repeated washing. They are now enriched; however, only a very small amount is available so that additional amplification steps are necessary. This is accomplished by reverse transcribing the RNA oligonucleotides into DNA oligonucleotides and amplified with PCR, followed by further rounds of selection. After several SELEX rounds have been completed the strongly binding oligonucleotides are highly enriched in the mixture, so that they can be cloned and sequenced. Sequence families are identified through bioinformatic analysis and the structures are predicted. Finally, the oligonucleotide can be optimized through further evolutionary procedures (e.g., reselection with intentionally error-prone PCR) and by shortening. The result of this process is an aptamer with a high affinity to the chosen target molecule. Besides this in vitro procedure, other methods have been developed to select aptamers against cell surface components of whole cells in culture or even tumor markers in living animals. Aptamers are comparable to antibodies in terms of their function, binding to a target molecule. Both types of molecule have similar binding constants; however, aptamers possess a

38

The Use of Oligonucleotides as Tools in Cell Biology

few advantages: antibodies are typically generated by immunizing animals. This can cause problems if the protein is not immunogenic and fails to trigger antibodies. The standard SELEX procedure is, in contrast, a purely in vitro procedure in which aptamers can be generated against any chosen target molecule, at least in principle. The aptamers can be reproduced at any time, easily and in high purity, by chemical synthesis. They are larger than small molecule compounds, but usually about an order of magnitude smaller than antibodies, so that they penetrate dense tissue, such as in a tumor, better than antibodies. Another advantage is their high specificity. By a targeted procedure, a so-called counter-SELEX, aptamers are obtained that can differentiate between very closely related chemical structures (e.g., theophylline and caffeine). In addition, SELEX has been standardized to such a degree that it can be carried out automatically by pipetting robots. This allows aptamers to be generated quickly and in parallel. For membrane proteins, aptamers selected by in vitro SELEX sometimes do not recognize their target in the cellular context, since the protein folds differently in its natural environment. To overcome these problems, sophisticated strategies to select aptamers against membrane proteins in whole cells have been developed. In addition, even the selection of aptamers against tumors in living animals has been carried out successfully. So far, in vivo testing of aptamers has not found them to be toxic or immunogenic. However, as for other oligonucleotides, aptamers must also be protected against nucleases. Regions of the aptamer, which are not directly involved in the binding of the target molecules, can be stabilized with modified nucleotides, but the regions directly involved in binding cannot be modified this way after SELEX without risking loss of affinity. As a result, the modifications are often employed during the SELEX. They must, however, be accepted by the enzymes employed in SELEX (RNA polymerase, reverse transcriptase, DNA polymerase), which severely limits the selection. Frequently, monomers carry an amino group or a fluorine atom on the 2´ position of the ribose (Figure 38.16). An alternative is the use of Spiegelmers, which consist of L-RNA instead of the naturally occurring D-RNA. These enantiomeric monomers are not recognized by nucleases, so that they are extremely stable. However, their selection requires a sophisticated selection procedure in which first a normal aptamer is generated against a mirror image of the target molecule. The enantiomer of this aptamer is then the sought-for Spiegelmer, which then binds to the normal target. As described above, a major challenge for the application of oligonucleotide-based strategies is their intracellular delivery. Aptamers differ from antisense oligonucleotides, ribozymes, and siRNAs, in that they can also be directed against extracellular targets such as growth factors and other signaling molecules in the blood stream thereby avoiding the need to cross the cell membrane. However, aptamers can also be used to investigate the intracellular function of proteins. In this case, the aptamer either has to be transferred into the cell or it can be transcribed intracellularly by an expression cassette.

38.4.2 Uses of Aptamers Owing to the large number of possible target molecules for aptamers, they can be used in many applications. Even the potential of aptamers as therapeutics has been investigated in clinical trials. A particularly successful example was one that treats age-related macular degeneration. It is directed against a growth factor that controls the growth of blood vessels, the cause of the disease. This aptamer was approved at the end of 2004 by the FDA as a drug for the treatment of age-related macular degeneration. Aptamers for the treatment of cancer and for the inhibition of blood coagulation have also been tested clinically. Roughly a dozen aptamers were or are being tested in clinical trials. High-affinity oligonucleotides can be used for diagnostic purposes, as well as therapeutic. Aptamers that recognize tumor markers expressed in the extracellular matrix have been developed. If a radioactive isotope is coupled to such an aptamer, the precise position of the tumor can be determined. Besides applications directly applicable to medicine, aptamers have found many other biotechnological uses, which can only be mentioned briefly here. One application involves coupling high-affinity oligonucleotides to the surface of chips and sensors. These detection systems are used for various purposes, from protein analysis to the testing of soil samples for the presence of toxins. Interestingly, aptamers can also be used to control ribozymes allosterically. The aptamer is coupled to a hammerhead ribozyme with a spacer. Binding of the ligand to the

973

Figure 38.16 Modifications of the 2´ position of the ribose by fluorine (a) or an amino group (b), which are used to stabilize aptamers.

974

Part V: Functional and Systems Analytics

aptamer triggers a conformation change that activates or inactivates the ribozyme. These allosterically regulated ribozymes, called aptazymes, have proven to be very sensitive sensors. Aptamers are not only artificial products of biotechnology, they are also naturally occurring regulatory systems. These biological sensors are called riboswitches and are usually found in the 5´ untranslated region of an mRNA of various genes. They “measure” the level of a metabolite and regulate the expression of genes through conformational changes. If the product of an encoded gene is present at high concentrations, protein biosynthesis is inhibited. With the aid of analogs of the ligand, the function of the riboswitch can be analyzed more precisely and, possibly, the growth of pathogenic bacteria can be inhibited in this manner.

38.5 Genome Editing with CRISPR/Cas9

Oligonucleotide Synthesis, Section 27.6.1

Figure 38.17 Genome editing with the CRISPR/Cas9 system. A guide RNA directs the Cas9 protein to the target sequence in the double-stranded DNA. The endonuclease then cleaves both strands of the genomic DNA. X indicates cleavage site.

Another oligonucleotide-based method with great potential in molecular biology is the CRISPR/Cas9 technology. In prokaryotes, Clustered Regulatory Interspaced Short Palindromic Repeats (CRISPR) and their associated cas genes serve as an adaptive immune system to protect them from infection by bacteriophages. The bacterial system was adapted for use in eukaryotic cells and can be used for precise genome engineering. Cas9 is a DNA endonuclease with two active sites that each cleave one strand of doublestranded DNA. For technological applications, a guide RNA (gRNA) is designed that is complementary to the target region (Figure 38.17). However, the selection of the target is not fully free, as CRISPR/Cas9 systems require a short sequence, called the Protospacer Adjacent Motif (PAM), which is located directly adjacent to the target sequence. For the most widely used Cas9 protein from Streptococcus pyogenes the PAM sequence is NGG. Another element of the gRNA is the crRNA (for CRISPR RNA) which interacts with the Cas9 protein (the hairpin in Figure 38.17). Similar to the double-stranded RNA triggering RNAi, the gRNA can either be synthesized chemically or expressed intracellularly. Many applications of the CRISPR/Cas9 technology have been developed in recent years. In the simplest version, the gRNA directs the Cas9 protein to a chosen target sequence and induces its cleavage. The double-strand break will be repaired by a cellular mechanism known as NonHomologous End Joining (NHEJ). This process, however, is error-prone and often leads to the introduction of mutations which inactivate the targeted gene. This approach can be used to produce and study loss-of-function phenotypes or to inactivate deleterious genes in therapeutic applications. A more sophisticated strategy aims at introducing point mutations or even larger fragments of genetic material. In this case, a DNA fragment with homologous ends to the target region is required, in addition to the gRNA and the Cas9 protein. Following the double-strand break introduced by the Cas9 protein, the homologous DNA fragment is inserted by the Homology-Directed Repair (HDR) pathway. For further applications of the CRISPR/Cas9 technology that repress or activate transcription or influence other cellular processes, the interested reader is referred to the references given at the end of the chapter. Compared to the antisense and RNAi approaches discussed above, the CRISPR/Cas9 technology is extremely efficient and easy to use. In addition, the CRISPR/Cas9 system enables full knockout of the target gene while the other methods only result in partial

38

The Use of Oligonucleotides as Tools in Cell Biology

knockdown of gene expression. Finally, some studies suggest that the CRISPR/Cas9 system produces less off-target effects than the other methods. Within only a few years, the CRISPR/Cas9 technology has already found its way into thousands of research laboratories worldwide and has become an indispensable research tool. In addition, several clinical trials involving CRISPR/Cas9 have already been initiated and several more have been announced for the near future. As the technology makes gene editing in germlines technologically viable, its employment in this manner raises serious bioethical concerns.

38.6 Outlook Antisense approaches possess a strong appeal due to a simple idea that is inherent in their nature: theoretically any given target RNA in a cell can be specifically and selectively blocked or destroyed by the use of complementary oligonucleotides. These techniques have been used successfully for many years in experiments in which the function of a gene is determined by studying what happens when it is inhibited. Antisense strategies often achieve desired results faster than the development of small molecule inhibitors of proteins or the generation of knockout animals; they are used by pharmaceutical companies for target validation, before the real drug is sought in high-throughput screening of a large compound collection. Since the antisense principle is universally applicable, it can also be used against genes whose corresponding proteins cannot be addressed by small molecules due to their size, structure, or tissue distribution, often referred to as non-druggable targets. In many cases, however, the practical application of antisense oligonucleotides and ribozymes encounters significant difficulties and the results of clinical studies have often been disappointing. Apparently there are problems in getting enough molecules into the cells of the correct tissue at the right time. Great hopes are placed, as a result, in the newer method of RNA interference, which appears to be significantly more efficient. It has already been employed with great success in numerous studies and valuable new discoveries have been made in large screening campaigns in the areas of virology and tumor biology. Whether small interfering RNAs also prove suitable for therapeutic use remains to be seen. Then the vision could become reality, that the development of a therapeutic can be significantly accelerated by use of RNAi. New potential targets could be identified for the treatment of a disease by employing large libraries of siRNAs. The siRNA used to identify the target can be used for target validation and further developed as a therapeutic without having to begin the search for a low-molecular drug anew. The significance of miRNAs in the fine regulation of cellular processes is becoming increasingly evident. It is astounding that such a central regulatory mechanism of the cell could remain undiscovered for so long. Many of the most modern technologies of bioanalysis play a role in the investigation of the function of miRNAs. Deep sequencing, array technologies, and quantitative PCR are used to determine the expression level of the miRNAs. Bioinformatics and modern proteomic methods are employed to determine the target mRNAs of the miRNAs and oligonucleotides can be used to inhibit the miRNAs, in order to investigate their function or to correct their deregulation in disease processes. The main challenge to all antisense-based strategies, the intracellular delivery of the oligonucleotides, can be avoided by using aptamers, since these can be directed against extracellular targets. Aptamers can also be used ex vivo for various purposes, from protein analysis to the monitoring of samples for toxins or biological warfare agents. In the coming years we can expect to see an expansion of the areas in which oligonucleotides for diagnostic and therapeutic purposes are used. Recently, the CRISPR/Cas9 technology has been developed as a new approach for precise RNA-directed genome editing. It has great potential in biomedical research as it is an easy method to disrupt or modify targeted genes. In addition, it will soon become a new therapeutic option for genetic disorders and viral infections.

975

976

Part V: Functional and Systems Analytics

Further Reading Bennett, C.F. and Swayze, E.E. (2010) RNA targeting therapeutics: molecular mechanisms of antisense oligonucleotides as a therapeutic platform. Annu. Rev. Pharmacol. Toxicol., 50, 259–293. Burnett, J.C. and Rossi, J.J. (2012) RNA-based therapeutics: current progress and future prospects. Chem. Biol., 19, 60–71 Doudna, J.A. and Lorsch, J.R. (2005) Ribozyme catalysis: not different, just worse. Nat. Struct. Mol. Biol., 12, 395–402. Fabian, M.R. and Sonenberg, N. (2012) The mechanics of miRNA-mediated gene silencing: a look under the hood of miRISC. Nat. Struct. Mol. Biol., 19, 586–593. Haussecker, D. (2012) The business of RNAi therapeutics in 2012. Mol. Ther. Nucleic Acids, 1, e8. Hille, F. and Charpentier, E. (2016) CRISPR-Cas: biology, mechanisms and relevance. Philos. Trans. R. Soc. Lond. B. Biol. Sci., 371, pii: 20150496. Jackson, A.L. (2012) Developing microRNA therapeutics: approaching the unique complexities. Nucleic Acid Ther., 22, 213–225. Keefe, A.D., Pai, S., and Ellington, A. (2010) Aptamers as therapeutics. Nat. Rev. Drug Discovery, 9, 537–550. Kurreck, J. (ed.) (2008) Therapeutic Oligonucleotides, Royal Society of Chemistry Publishing, Cambridge. Kurreck, J. (2009) RNA interference: from basic research to therapeutic applications. Angew. Chem., Int. Ed. 48, 1378–1398. Lightfoot, H.L. and Hall, J. (2012) Target mRNA inhibition by oligonucleotide drugs in man. Nucleic Acids Res., 40, 10585–10595. Mulhbacher, J., St-Pierre, P., and Lafontaine, D.A. (2010) Therapeutic applications of ribozymes and riboswitches. Curr. Opin. Pharmacol., 10, 551–556. Thiel, K.W. and Giangrande, P.H. (2009) Therapeutic applications of DNA and RNA aptamers. Oligonucleotides, 19, 209–222. Watts, J.K. and Corey, D.R. (2012) Silencing disease genes in the laboratory and the clinic. J. Pathol., 226, 365–379. Wang, X., Huang, X., Fang, X., Zhang, Y., and Wang W. (2016) CRISPR-Cas9 system as a versatile tool for genome engineering in human cells. Mol. Ther. Nucleic. Acids, 5, e388.

Proteome Analysis

In 1975 O’Farrell and Klose independently published articles with spectacular images in which they showed that the combination of isoelectric focusing and SDS gel electrophoresis is able to separate extremely complex protein mixtures. This new procedure, two-dimensional gel electrophoresis, quickly became established as the most successful high-resolution technique for separating proteins. Attempts were soon made to use the information contained in the protein patterns to solve biochemical and medical questions. By comparing the protein patterns of different, defined states of a cell or a body fluid (e.g., sick or healthy, different metabolic conditions, etc.) changes in protein patterns become visible, which were characteristic for these conditions (Figure 39.1). This strategy for investigating biological questions is called a subtractive approach. However, the analytical methods for protein characterization (sequence analysis, amino acid analysis) were at that time not in a position to analyze such small amounts of protein as could be separated and visualized by 2D gel electrophoresis. To make matters worse, the proteins were embedded in a gel material that was incompatible with most protein chemistry techniques at the time. Therefore, the results of subtractive approaches were of mostly descriptive character. Significant changes in protein pattern could be recognized, but the identity of the proteins involved remained unknown. This changed only when the proteinchemical methods improved and became much more sensitive. At the same time methods were developed to cleave the proteins within the gel matrix and to extract the resulting peptides from the gel matrix. Alternative methods were developed in which proteins are transferred from the gel matrix to chemically inert membranes, where the immobilized proteins can then be directly sequenced or be further cleaved with enzymes. Triggered by the successes of this subtractive strategy in conjunction with the enhanced analytical techniques the idea was born to represent the protein pattern of a whole cell and interpret it quantitatively – this is the objective of proteome analysis. The term proteome was introduced in 1995 by the Australian Marc Wilkins, as the so-called “protein equivalent of a genome.” This term describes the complete protein pattern of a cell, an organism, or a body fluid.

39.1 General Aspects in Proteome Analysis Friedrich Lottspeich Peter-Dörfler-Straße 4a, 82131 Stockdorf, Germany

Although the terms genome and proteome sound very similar they describe two fundamentally different things: a genome is a static entity, which is precisely defined by the type, order, and number of its nucleotides. However, in cell life, at no point in time are all the genes turned on and translated into proteins. A proteome is therefore a tremendously dynamic object that is affected by a large number of parameters (Figure 39.2). The delicate balance between protein Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

39

Proteome The quantitative representation of the total protein expression pattern of a cell, an organism, or a body fluid under precisely defined conditions.

978

Part V: Functional and Systems Analytics

Figure 39.1 Subtractive approach: A cell (Escherichia coli) is brought from an initial state (a) by an event (e.g., other culture conditions) to another state (b). The protein patterns of both states are compared. The changes in protein patterns (indicated by gray arrows) can be attributed directly or indirectly to the triggering event.

Figure 39.2 Effect of different parameters on the protein expression. The current amount of a protein in a cell is determined by various factors and is highly sensitive to changes in these variables.

Analysis of Post-translational Modifications, Chapter 25

synthesis, protein degradation, and many highly dynamic protein modification events can be very different under different metabolic or environmental conditions. However, the sensitive dependence of the proteome pattern on various parameters also provides the possibility to use the protein pattern as a sensitive sensor and to detect specific network-like connections by small changes in the parameters. The great technical challenge during proteome analysis is not to change the quantitative ratios of proteins as they are present in nature. This is especially true for all steps of sample preparation and protein isolation. Ideally, the analysis of a proteome provides the known amount of each protein. This is data that in principle cannot be obtained using molecular biology techniques, since there is no strict correlation between the amount of mRNA and the amount of the corresponding protein. Translational regulation, mRNA stability, protein stability, and protein degradation cannot be detected at the mRNA level. Therefore, it is not possible to predict from the nucleic acid level the amount of protein existing in a cell. Clearly, from these remarks, only in rare cases is a non-quantitative proteome analysis useful. A simple determination of the protein composition of a proteome is performed only in very simple proteomes, subproteomes of organelles, or large protein complexes. In addition, this information is also usually only an initial basis for further work. Another important piece of information that in principle cannot be obtained from nucleic acid sequence data is post-translational events. The active form of a protein almost always differs from the form that is present after the first translation of the mRNA. Very often the newly synthesized protein is processed, that is, amino acid residues or peptides from the N- or

39 Proteome Analysis

C-terminal end of the translated protein are cleaved off. This is very common for enzymes that often were converted by cleavage of a peptide bond from an inactive into an active form or an active enzyme may be inactivated by processing. This mechanism of the so-called limited proteolysis plays a very important role in nature (e.g., blood clotting) and is used for the regulation of entire reaction cascades. Other post-translational modifications take place frequently. The most common among the more than 150 known post-translational modifications are phosphorylation, sulfation, acetylation, methylation, and the linkage of certain amino acids with lipids or glycans. Almost none of the post-translational modifications is determined or can be predicted at the DNA level; but they influence decisively the biological function of a protein. The proteome provides – contrary to the analysis of DNA or RNA – information on the quantity and the post-translational modifications of each protein.

In addition to the quantity of the expressed proteins and their post-translational modifications, the location and the neighborhood of the individual proteins is also of great importance for their interaction and thus for the function. These aspects, which are often also expected directly from the proteome analysis, are treated separately in Chapters 41 and 43. The main objective of proteomics is to unravel network-like complex functional relationships that otherwise are extremely difficult to access. This is generally achieved by monitoring quantitatively the changes in the protein (proteome-) pattern when applying a disturbance to a given biological state (perturbation analysis). All proteome analyses have in common four complex and critical areas:

   

definition of the initial conditions and the question, sample preparation, quantitative analysis of proteins, bioinformatics analysis.

39.2 Definition of Starting Conditions and Project Planning If one considers that a proteome reacts extremely sensitively to any changes and is defined to a great extent by the environmental conditions (Figure 39.2) it soon becomes clear that a detailed description of all conceivable parameters has an extremely high priority. Indeed, the biological context and the environmental parameters under which a proteome analysis is applied are crucial aspects. Several issues must be considered before a proteome analysis is performed. Only the most important are mentioned here:

 The individual proteome stages must be clearly defined, which can only be achieved by taking into account a suitable (at least reproducible) sample preparation.

 It makes sense to keep the differences between the states small to keep the number of changes in the protein pattern manageable.

 The influence and the degree of genetic heterogeneity, polymorphisms, or mutations should be kept in mind in the evaluation of differential protein patterns.

 The dynamics of proteins in living systems must be considered: biological processes at the protein level are sometimes very rapid. Regulatory processes (such as phosphorylation, dephosphorylation, transport processes, degradation processes, etc.) often take place within seconds or minutes. This means that the composition of the protein network in these short periods changes significantly and must be analyzed accordingly – an extremely challenging task, especially for project planning and preparation of samples. An important point in order to assess properly the expected results when planning a proteome project is the amount of sample available. Almost all protein-chemical analysis techniques work routinely on the femtomole level. For clarification: 1 mole = 6 × 1023

979

980

Part V: Functional and Systems Analytics





molecules; 10 fmol = 10 14 mol = 6 × 109 molecules. Thus, to be able to successfully analyze a relatively rare protein (1000 copies per cell), one must start at least with 106 cells, which is a significant amount. For extremely rare proteins that are expressed in a few copies per cell, very large amounts of starting material (108–109 cells) must be available, which is often a problem in practice. Any improvement in the sensitivity of analytical methods is therefore of utmost importance, will lead to greater depth of analysis, and will enable a more comprehensive proteomic analysis. In some cases, certain classes of proteins can be selectively enriched during sample preparation via affinity chromatography, special modified magnetic beads, or sample preparation arrays or chips. This has the advantage that the complexity of the original sample is significantly reduced and the following analysis is thus considerably facilitated. However, it must be carefully considered during the project planning whether the “subproteome” also contains the relevant information. Finally, one has to consider the statistical significance of the results: in proteomic analyses the experimental and biological variability for each set of samples has to be determined and included in the project planning and evaluation of results. For a successful and meaningful protein-chemical determination of a protein (quantification, identification) today more than 109 molecules are necessary.

39.3 Sample Preparation for Proteome Analysis The first step for quantitative proteome analysis immediately after sampling must normally be sample preparation, which differs significantly from the classical sample preparation procedures. The main objective of proteome analysis is to determine quantitative ratios of proteins in different proteomics states. In this respect, the usual steps of protein purification are inappropriate for proteome analyses. With classical methods, usually a specific protein (or a few proteins) is isolated from a complex matrix, with the focus on high purity and a good quantitative yield of the target protein(s). The other “uninteresting” proteins are treated as unimportant and are ignored. In the classical method of sample preparation and purification, typically multilevel techniques are used, all of which are inevitably connected with proteinspecific losses. In addition, the separation of the proteins into individual fractions is by no means complete: proteins are mostly found in several fractions, so that the total amount of these proteins is extremely difficult to determine quantitatively. During sample preparation for proteome analysis ideally all the proteins of a proteome must be brought into solution for subsequent quantitative analysis. The sample preparation may help to prepare a well-defined and thus meaningful proteome sample. There are virtually no limits to creative strategies, such as:

 for example, cell sorting may be important to ensure that the cells examined are all in the same cell cycle stage;

 with tumor tissue studies tumor cells could be enriched by laser microdissection, after which the cells are contaminated only with minor amounts of other cells;

 centrifugation or free-flow electrophoresis can be used for organelle separations to obtain preparations that are as homogeneous as possible. Artificial changes in the protein composition by proteolysis or other modifications (e.g., oxidations) must be avoided. Therefore, in proteome analysis the length and reproducibility of sample preparation play an important role. Each manipulation with proteins leads inevitably to losses, mainly due to absorption to any kind of surface as a result of the hydrophobic character of proteins. Unfortunately, different proteins exhibit different degree of losses, and these are not predictable. Consequently, after the sample preparation steps the original quantitative composition of the sample is no longer guaranteed. It follows that in general the sample preparation for proteome analysis should consist of very few steps. For the next steps – the quantitative determination of the amounts of protein – different ways have emerged in recent years, which are dealt with in the following sections. Together, all these strategies have significant limitations,

39 Proteome Analysis

981

which are based mainly on two properties of a proteome: the enormous complexity and the large dynamic range of proteins, which makes determining low abundance components so difficult:

 Complexity: For example, the human genome has probably only about 20 000 genes, but it



produces by various processes on the way from the gene to the protein (e.g., by alternative splicing, mRNA editing, processing, post-translational modifications, etc.) hundreds of thousands, indeed probably even millions, of protein species (also called proteoforms). Although these different protein species may be derived from a single gene, they are at molecular level – and often functionally – clearly different. For a comprehensive proteome analysis each of these proteoforms must be characterized and quantified individually. But this causes serious and fundamental difficulties because, to date, there are no techniques available that can unravel hundreds of thousands of components. Therefore, a concept called “reduction of complexity” is often adopted, which does not, as originally intended, analyze the pattern of all proteins (proteome) but only a subset of the proteome, a subproteome. Such focusing on a subproteome is usually also associated with a restriction of the biological question that can be addressed. A typical example is the study of all phosphorylated proteins, the phosphoproteome, which is particularly important, for example, in signal transduction processes. Along the lines of this reduction of complexity is the analysis a subproteome with a targeted proteomics strategy (Section 39.5), which has proved extremely successful in the study of molecular machines (e.g., spliceosome, the ribosome, proteasome, etc.) and in the elucidation of protein complexes, which are responsible for the execution of many biological functions. Despite the successful application of such strategies, one should be aware that an essential aspect of a holistic analysis, that is, to elucidate unexpected, far-reaching, or transient functional contexts without prior knowledge, is hardly possible. Low abundance proteins: A particular challenge in a proteome analysis is the very different amounts of protein species. There are, for example, in human blood plasma proteins such as albumin that occur in the concentration range of milligrams per milliliter. But there are also important proteins – for example, the prostate-specific antigen – that occur in concentrations of picograms per milliliter (Figure 39.3). This dynamic range of protein amounts, covering ten to twelve orders of magnitude means that for a proteomic analysis we have to detect, identify, and quantify one protein molecule in the presence of one billion molecules of another protein species – an impossible task so far. Most analytical techniques can cover a dynamic range of 102–103, which is very far from the required minimal 108. The only way to cover a large dynamic range is to separate the frequent proteins from the rare ones and analyze them separately. A further problem is that a small number of highly expressed proteins accounts for the major part of the sample material and many regulatory or diagnostically interesting proteins occur only in very small amounts. For example, in plasma the 22 most abundant proteins account for about 99% of the total protein mass of plasma. Therefore, sometimes in plasma proteome analyses the few most abundant proteins, which account for 90% of the amount of protein in blood plasma, are removed via affinity methods

Figure 39.3 The dynamic range of the frequency of plasma proteins covers more than ten orders of magnitude. Albumin is present in the blood plasma at a concentration of 50 mg ml 1, while interleukins are present in concentrations as low as a few picograms per milliliter. Source: Anderson, N.L. et. al. (2002) Mol. Cell. Proteomics, 1, 845–867. With permission, Copyright  2002, by the American Society for Biochemistry and Molecular Biology.

982

Part V: Functional and Systems Analytics

(“depletion”). The proteomic analysis of the remaining material is then carried out with a more uniform sample. However, the completeness of the depletion is critical and is very difficult to control since even 0.1% of a highly abundant protein still appears in the depleted sample as an abundant protein and often varies quantitatively quite strongly due to the depletion process. Additionally, there is also always a risk that specific rare proteins may bind to the abundant proteins and are co-depleted. In such cases, the quantitative ratios of the depleted sample may no longer reflect those of the original sample.

39.4 Protein Based Quantitative Proteome Analysis (Top-Down Proteomics) Nicole A. Haverland, Owen S. Skinner, and Neil L. Kelleher Northwestern University, Departments of Chemistry, Molecular Biosciences, and the Feinberg School of Medicine, 2145 Sheridan Road, Evanston, IL 60208-3113, USA

Quantitative proteomics is today carried out mainly by two strategies. The top-down strategy has the intention to keep the proteins intact as long as possible. Reduction of complexity, quantification, and mass spectrometric identification of the individual proteins is achieved at the protein level. A typical example of a top-down approach is the classic gel-based proteomics (Section 39.4.1), when 2D gel electrophoresis is used to separate the proteins with subsequent image-based quantification. Mass spectrometry is here exclusively used for identifying the proteins. The difficulties and limitations of separation methods for proteins led to alternative or complementary techniques for quantitative proteomics, the bottom-up proteomics strategies (Section 39.5). Here, as a first step, all the proteins of a proteome are cleaved into peptides. The resulting immensely complex peptide mixture is fractionated using different separation methods and quantitatively analyzed by mass spectrometry. The peptides are identified and allocated via informatics to individual proteins.

39.4.1 Two-Dimensional-Gel-Based Proteomics Most top-down proteome data that can be found today in the literature have been obtained by means of 2D gel electrophoresis. This strategy can be divided into the following sub-steps:

    

sample preparation, separation of proteins, image analysis, quantification of the proteins and data analysis, identification and characterization of the proteins.

Sample Preparation In the first step, the proteins of the different proteome states for 2D gel electrophoresis have to be completely dissolved. To avoid interference of isoelectric focusing, no salts should be present and only zwitterionic or non-ionic detergents are used in the sample buffer. In practice, many cells can be solved directly in the buffer for the 2D gel electrophoresis (8 M urea, 2% CHAPS). For difficult samples (e.g., tissue or poorly soluble cells, such as fibrous cells) special sample preparation protocols have been prepared (French press, ultraturrax, bead beater, etc.). Again, before separation by 2D gel electrophoresis any undissolved material should be removed from the sample by high speed centrifugation. Separation of Proteins As described in Chapter 11, 2D gel electrophoresis is a classic separation technology that is able to provide a separation space for up to 10 000 protein species, which corresponds almost to the total number of proteins in simple cells. Two-dimensional gel electrophoresis has many features that makes it particularly suitable for proteomics:

 high resolution, up to 10 000 components can be separated in a single gel;  IEF pH ranges can be spread by using immobilized pH gradients (Immobilines) in the first dimension, which enables higher resolution in specific pH ranges;

39 Proteome Analysis

 it is compatible with detergents and therefore universally applicable for all proteins; in addition, hydrophobic proteins like membrane proteins can be separated;

 using new application techniques milligram quantities of proteins can be separated in a 2D semi-preparative gel:

 two-dimensional electrophoresis is relatively fast to carry out (1–2 days). Despite these indisputable advantages over other methods there are also some serious limitations of 2D gel electrophoresis:

 lack of automation, therefore limited reproducibility (regarding the position of the proteins in the gel as well as the amounts of individual proteins);

 difficult technical implementation;  transfer from the first (IEF) to the second dimension (SDS-PAGE) is not complete and involves a risk of protein loss;

 the matrix is not inert, the proteins must be retrieved from the gel matrix for further analysis;  low dynamic range capacity, that is, proteins with low copy number cannot be displayed simultaneously to proteins with high expression rate;

 only a few good methods to quantitate proteins in the gel matrix are available. The last two points are for a proteome analysis certainly the most problematic. For proteomics, the main goal is to represent all the proteins. The low dynamic range of gel electrophoresis of a maximum of 103 is a serious limitation. Currently, one only can try to separate common and rare proteins as far as possible from each other (e.g., with spread pH gradient and/or by prefractionation) and to quantify them in separate analyses. Here should be clearly pointed out that, although the gel has an extremely high separation capacity, a single spot in no way means that there is only one protein in that single spot. Even a purely statistical calculation assuming about 30 000 proteins in a more complex cellular proteome and a separation capacity of 10 000 concludes that on average three protein components in each spot must be present. This means that, most likely, in each spot in a 2D gel several proteins are present, although for analytical purposes (dynamic range, sensitivity, etc.) perhaps only one can be identified. The quantification of proteins is a central concern of proteome analysis. The proteins need to be stained for quantification in the gel matrix. Staining characteristics are different for each dye and can be very different from protein to protein and, unfortunately, for different amounts of protein. Consequently, in principle, very small amounts of protein can adsorb more dye relative to large amounts of protein. Furthermore, small variations in the staining protocols may lead to different staining intensity. The most popular techniques for the visualization of proteins include staining with Coomassie Blue, with a rather low sensitivity and dynamic range (detection limit about 100 ng, dynamic range about 102). In addition, silver staining (detection limit 10 ng) is often used, which, however, hardly leads to correct quantitative values. More sensitive are fluorescence staining, autoradiography (the detection of radiolabeled proteins), and immunological staining (if characterized antibodies are available). Owing to the limited reproducibility and the problems with protein staining, multiple determinations (5–10 gels of the same sample, preferably by independent workups) must always be performed to obtain statistically meaningful, quantitative information. Today, with technical replicates standard deviations below 15% can usually be reached for many proteins in 2D electrophoresis. Nevertheless, there are specific proteins that show much larger variations even under optimal instrumental and operating conditions. Imaging and Quantification of Proteins, Data Analysis After separation the stained proteins must be quantified. This is done with 2D gels by densitometry with a laser densitometer or a scanner. Commercially offered software evaluates the images so that in a first step the outlines of the protein spots are detected. One may influence - depending on the gel quality - the results to some extent through changing the input parameters. Normally not all spots are identified correctly even with an optimal set of parameters. However, a major limitation is that an error of even a few percent on average will with 2000 spots result in 40–100 incorrectly captured proteins. Even though software support has improved markedly in recent years,

983

984

Part V: Functional and Systems Analytics

Figure 39.4 Excerpts of 25 gels of different experiments, which by computerassistance are compared with reference gels. The gels are arranged in groups according to the expression of a particular protein (blue). One can see the reproducibility of the protein pattern and the significant differences in the expression of the tagged protein.

significant, time-consuming editing work has to be carried out at this point. After defining the protein spots they are automatically quantified relative to each other and the results stored in databases. Powerful software tools make it possible to compare gels, to correct small distortions, and to represent differences in the protein patterns of different gels. A form of presentation is partly reproduced based on a comparison of 25 two-dimensional gels with controls in Figure 39.4. Even at this level differential protein patterns and significant differences in the quantitative ratios can be recognized and analyzed. Above all, statistical data must be obtained and averaged statistically significant reference gels of the individual states must be prepared. Identification and characterization protocols of the proteins for further analytical procedures are optimized for high throughput and mainly use mass spectrometry. Proteins present after the separation in a 2D gel may be analyzed mainly in two ways. One possibility is that the intact protein is transferred from the gel matrix onto a chemically inert membrane and protein chemical analyses (e.g., sequence analysis) are performed directly on the membrane. Alternatively, the protein is enzymatically cleaved into smaller fragments directly in the gel matrix, and the resulting peptides are extracted and analyzed. Both methods complement each other and often give very specific identifications (Figure 39.5). For analysis of the intact protein in a first step, all proteins are electro-blotted from the gel matrix to a PVDF membrane. The transfer may be almost quantitative for average proteins, while for large or hydrophobic proteins rather low transfer yields can be expected. Thus,

Figure 39.5 Analysis of proteins separated by gel electrophoresis. After separation in the 2D gel, a protein may be analyzed either directly by electrotransfer to a membrane or after enzymatic cleavage, on the basis of the resulting peptide fragments.

39 Proteome Analysis

significant losses may occur and the quantification after electrotransfer certainly no longer reflects the original conditions. A direct amino acid sequence analysis on the membrane may be successfully performed. The minimum amount required for sequence analysis is currently in the low picomole range. As a general rule a visible, Coomassie stained spot analysis in the gel/on the blot is sufficient for sequence analysis. By amino acid sequencing a protein will often be identified by a protein database search. Unfortunately, many of the naturally occurring proteins are N-terminally blocked and will not give any result. The analysis of total proteins for identification and characterization by Edman amino acid sequence analysis (Chapter 14) has the disadvantage that it takes a relatively long time, and is also complex and can only provide limited information. Therefore, today, for proteome analyses the characterization and identification of mainly tryptic fragments by mass spectrometry are optimized for high-throughput analysis. Analysis of Peptide Fragments For the analysis of peptides the proteins must be cleaved enzymatically. This can take place directly in the gel matrix or after transfer of the proteins on an inert membrane. However, since the membrane transfer is an additional step that can be fraught with losses the cleavage is preferably carried out directly in the gel matrix. The enzymes that are used for this are in particular trypsin, endoprotease LysC, and endoprotease AspN. After cleavage, the resulting peptide fragments are eluted from the gel matrix using mostly organic solvents and acids. It is not expected that all peptides from the gel matrix can be completely recovered – especially hydrophobic or large peptides often provide poor yields or are not eluted at all. The extracted peptides must be further analyzed, for which in particular mass spectrometry techniques are applied. The individual steps are shown in the overview in Figure 39.5. For mass spectrometric identification the peptide mixture, which was previously desalted through a reversed phase mini-cartridge, is analyzed either with MALDI-MS or ESI (nanospray-) MS. In both cases, the sample output consists of a list of the mass-to-charge states of all peptides identified. The measured values are then compared with a list of peptide masses that were generated in silico by theoretical cleavage of all proteins of an organism. Ideally, all the measured peptide masses should be found in the theoretical peptide mass list of a protein. In practice, however, only less than 30–70% of the measured values can be directly assign to a single protein, but virtually always sufficient to clearly identify the protein. The remaining mass values can sometimes stem from oxidation products, unexpected fragmentations, modifications, or even from peptides of contaminating proteins. An alternative to analysis of the peptide pattern is the mass spectrometric sequencing of the proteolytic peptides. It provides a more confidence of identification. With mass spectrometric fragmentation techniques (using MALDI-TOF/TOF or ESI-MS/MS instruments) tandem MS spectra can be generated. Using special search algorithms, the measured spectra are compared with the database of computer generated theoretical MS/MS (= MS2) spectra of all peptides of the same organism. The automatic interpretation of MS/MS spectra is increasingly being carried out also for the de novo sequence analysis of unknown peptides, wherein the limiting step is the interpretation of the fragment spectra. If the identification of mass spectrometric analysis failed or is ambiguous, classical protein chemistry methods must be applied (Figure 39.5). Here again, of course, the mass spectrometric methods play a prominent role, but they are used in conjunction with HPLC or CE separation of peptides, with Edman sequence analysis, or with other analytical techniques. In this case, the throughput is of course much smaller than with the typical proteomics workflow. When a protein is identified via peptide sequences in a (DNA) database, the protein is then further analyzed only if post-translational modifications are suspected. This may be the case if the measured molecular weight or the observed isoelectric point of the protein deviates from the theoretically predicted values. Significant deviations (more than about 0.3 pH units) are most probably due to one or more post-translational modifications. If very accurate information is required – such as in the characterization of recombinant therapeutic proteins – also without evidence of a modification each amino acid residue has to be covered by an experimental analytical data value. Most often then the protein must be cleaved in independent experiments with different enzymes. Even today these analyses, whereby any modification and even the

985

Mass Spectrometry, Chapter 15

Cleavage of Proteins, Chapter 9

Chromatographic Separation Techniques, Chapter 10 Capillary Electrophoresis, Chapter 12 Amino Acid Sequence Analysis, Chapter 14

Sequence Data Analysis, Chapter 33 Analysis of Post-translational Modifications, Chapter 25

986

Part V: Functional and Systems Analytics

position of a modification has to be determined exactly, are extremely challenging, time consuming, and often the whole arsenal of classical protein chemistry, including mass spectrometric techniques, must be used.

39.4.2 Two-Dimensional Differential Gel Electrophoresis (2D DIGE)

DIGE, Section 11.6.6

Two-dimensional DIGE is a variation of the 2D gel electrophoresis, which offers some significant advantages over the classical technique. Two (or three) protein extracts are labeled individually with different fluorescent reagents (Cy-dyes with reactive N-hydroxysuccinimide). The reagents bind covalently to the ϵ-amino group of lysine residues in the protein. A positive charge compensates for the loss of the positive charge of the lysine residue due to the derivatization. In addition, the masses of the individual reagents are rather small in order to avoid a different migration behavior between the labeled and the unlabeled protein. The mass and electrophoretic properties of the different modification dyes are almost identical to assure that the migration behavior in a 2D gel is almost identical for the corresponding protein labeled with the different fluorescence dyes. There are two versions of the DIGE technique in use. In “minimal labeling” the reaction conditions are chosen so that only a small percentage of the lysine residues of a protein molecule is marked. In “maximum labeling” the cysteine residues of the proteins are fully derivatized with the dye. The reason for using the cysteine modification is that cysteine is a rather rare amino acid and derivatization with the hydrophobic reagent (dye) does not lead to insoluble proteins. The complete modification of a frequently occurring amino acid (e.g., lysine) would often lead to hydrophobic and insoluble proteins. The individual fluorescent dyes have different spectra. Therefore, one can combine the individual differently labeled protein extracts and separate them into a single, conventional 2D gel. The proteins of the individual samples can be individually visualized by image recording with the appropriate excitation and emission wavelength filters. Quantitative comparison of protein patterns is carried out by using special software programs. With this technique, the serious gel-to-gel variations of the classic 2D gel electrophoresis can be avoided. This allows comparative, quantitative proteomics with fewer gels, less material consumption, higher accuracy, and in less time. The difficult spot matching of different gels is eliminated because the entire analysis is carried out in a single gel. The fluorescent dyes used exhibit very high detection sensitivity and a higher dynamic range by covering more than five orders of magnitude in a linear calibration curve. Since the protein spots are only visible under UV light, an automatic spot picker is often necessary to transfer the protein spots to a mass spectrometric identification. As more proteins can migrate in the same spot and the small amount of fluorescent protein migrates at a slightly higher mass than the unmarked protein, there is a risk that the picked protein spot (identified by MS) is not identical to the protein identified and quantitated in the imaging process. As with all proteomic techniques, with 2D DIGE the statistical experimental error also has to be determined and taken into account when interpreting the results. The experimental error in 2D DIGE arises primarily in the sample preparation, variation in the protein labeling, and errors in image processing (bad spot detection and background problems due to different fluorescence characteristics of the acrylamide at different wavelengths).

39.4.3 Top-Down Proteomics using Isotope Labels The main challenge in protein-based proteomics is to preserve the quantitative ratio of the individual protein species in different proteome states. However, proteins as rather large molecules are sensitive on all levels of structure (primary, secondary, tertiary, and quaternary structure). Small changes/damages or a different environmental parameter (which is very common when comparing two or more different proteome states over a long time) may lead to differential behavior of certain proteins during sample work up and fractionation. Interactions with surfaces, chromatographic column material, gels in electrophoresis, or with other proteins will cause loss of proteins, unfortunately in an entirely unpredictable manner. Therefore, it is rather difficult in a top-down proteomics approach, when comparing two or more different

39 Proteome Analysis

samples and performing independent steps to fractionate and reduce complexity, to keep the quantitative ratio of proteins unchanged. On the other hand, for high quality results this reduction of complexity is a prerequisite for almost any analysis technique due to the diversity and complexity of a proteome. Independent direct proteome analyses from the same sample – even without sample preparation or fractionation steps – at present do not lead to exactly the same output. Introducing the concept of isotopic labeling and multiplexing may overcome several of the problems associated with independent multiple label-free analyses. A detailed evaluation of the concept and availability of stable isotope methods for top-down proteomics is given in Section 39.6.1.

39.4.4 Top-Down Proteomics using Intact Protein Mass Spectrometry Top-down proteomics refers to the comprehensive analysis of intact proteins using mass spectrometry. Whereas bottom-up proteomics relies on peptides derived from enzymatic digestion of proteins, top-down proteomics utilizes the intact protein for analysis, including its modifications and other forms of sequence variation. The advantage of this approach is the ability to detect and measure single-nucleotide polymorphisms, isoforms, splice variants, posttranslationally modified proteins, and even whole protein complexes. Unsurprisingly, everything changes when enzymatic digestion is excluded. In particular, top-down proteomics avoids the imperfect process of rebuilding a set of related protein forms using a collection of peptides. Throughout this chapter, the basic concepts and terminology for intact protein mass spectrometry is covered, measuring intact mass (MS1), fragmentation and the measurement of product ions (MS2), data analysis, and high-throughput top-down proteomics. To begin these discussions, a classic example of top-down mass spectrometry analysis is given: the 76 amino acid protein, ubiquitin. Although there are a multitude of approaches available for intact protein mass spectrometry, the discussions here will center on high-resolution data that was acquired using electrospray ionization and a Fourier-transform mass spectrometer operating in positive ion mode.

39.4.5 Concepts in Intact Protein Mass Spectrometry At first glance, it is easy to notice the differences in the mass spectrum of an intact protein (Figure 39.6a) as compared to that of a peptide. The set of major peaks that are shown in Figure 39.6 actually represent a single protein, ubiquitin. This is in contrast to a peptide MS1 spectrum, in which distinct peptides are typically observed as 1+, 2+, and/or 3+ ions. This set of major peaks in the protein MS1 represents the different charge states of ubiquitin that were desolvated and transmitted into the gas phase using electrospray ionization. Although the interpretation of an intact protein mass spectrum may seem intimidating, the basic concepts presented in this chapter will provide the framework needed for analyzing and understanding top-down proteomics data. In Figure 39.6b, an enlargement of the mass spectrum shows that the 11+ charge state peak is not a single signal but rather consists of a distribution of peaks. Known as isotopomers, each of these peaks has the same chemical formula and charge but differs in the number of heavy isotopes present in the molecule. The multitude of different masses from the same protein results in two concepts that are important to understand: the monoisotopic mass and the average mass. The monoisotopic mass is calculated using the exact mass of the most abundant isotope for each element present in the molecule. In small peptides, such as the one shown in Figure 39.7a, the monoisotopic mass is the most abundant mass in the spectrum. However, in larger proteins (Figure 39.7b), the monoisotopic mass can have a very low abundance. This phenomenon can be attributed to the increased likelihood of at least one of each element being a heavy isotope as the overall number of elements increases. This leads to the average mass, which refers to the weighted average of the atomic masses for each isotope for each element that is present in the molecule. The average mass for the peptide and protein are indicated in Figure 39.7. In proteins, the relative abundance of each isotopomer is influenced by all of the

Mass Spectrometry, Chapter 15

987

988

Part V: Functional and Systems Analytics

Figure 39.6 A top-down mass spectrum of ubiquitin. (a) The charge state distribution of the intact protein and (b) the isotopic distribution of the 11+ charge state with the monoisotopic and average mass indicated.

elements that make up the molecule. However, because carbon has the most abundant naturallyoccurring heavy isotope it has the largest effect on the overall isotopic distribution. Thus, the number of heavy isotopes in each peak is indicated by the subscript next to the 13C; for example, 13 C5 represents the protein with exactly five heavy isotopes. The difference between average and monoisotopic mass can be remarkably large and increases with protein size. Table 39.1 highlights this point using examples of large, medium, and small proteins as compared to a peptide. As previously highlighted, one of the most distinguishable differences in the mass spectrum of an intact protein as compared to that of a peptide is the presence of multiple charge states. These charges states arise as a result of solution equilibria and electrospray ionization, which for positive ion mode result in protonation of basic residues found throughout the protein. Unlike peptides, proteins often contain multiple basic residues that can be protonated, which creates a distribution of charge states. The number of accessible, ionizable residues influences the number of charge states for a given protein. As a general rule of thumb, the total number of

989

39 Proteome Analysis

Figure 39.7 Isotopic distributions and monoisotopic and average mass. A comparison for the isotopic distributions of a small peptide (a) and a protein (b). The monoisotopic and average masses are indicated for each.

Table 39.1 The average and monoisotopic masses of four human proteins and a peptide. Uniprot accession number

Name

Amino acid length

Monoisotopic mass (Da)

Average mass (Da)

Difference (Da)

98WZ42

Titin

34350

3813651.757

3815992.986

2341.229

P02787

Serotransferrin

679

75146.569

75194.920

48.351

P62979 [1-76]

Ubiquitin

76

8559.617

8564.757

5.140

Q8IVG9

Humanin

24

2685.482

2687.240

1.758

P01858

Phagocytosis-stimulating peptide

4

500.307

500.594

0.286

observable charge states is roughly equal to the intact mass of the protein divided by 1000. In Figure 39.6a, ten distinct charge states are observed for ubiquitin and range from 14+ to 5+. To determine the average mass or the monoisotopic neutral mass of ubiquitin using the experimental data, we employ the equation: M ˆ mz

MH z

where: M is the average or monoisotopic neutral mass of the protein, m is the observed mass-to-charge (m/z) ratio, z is the observed charge, MH is the mass of a proton, which is generally considered to be 1.00727 Da. For theoretical work, this equation can be rearranged to solve for the m/z ratio for a given charge state or set of charge states: m=z ˆ

M ‡ MH z z

As provided in Table 39.1, the monoisotopic mass and the average mass of ubiquitin are 8559.617 Da and 8564.757 Da, respectively. Using this equation, we can determine the theoretical monoisotopic and average m/z values for each of the ten charge states that were observed for ubiquitin (Figure 39.6a; Table 39.2). The charge state can be directly measured using a high-resolution mass spectrometer that is capable of resolving the peaks of a protein’s individual isotopomers. A collection of isotopomers is known as the isotopomer envelope and measuring the distance between isotopic peaks within an envelope can directly indicate the charge state. Because isotopes differ by one neutron, the spacing between isotopic peaks will be ∼1 divided by the charge of the ion. For example, Figure 39.6b presents the isotopic distribution for the 11+ charge state of ubiquitin.

990

Part V: Functional and Systems Analytics Table 39.2 Monoisotopic and average m/z values for ten observed charge states from ubiquitin. Charge state

Monoisotopic m/z

Average m/z

M

8559.617

8564.757

5

1712.931

1713.959

6

1427.610

1428.467

7

1223.810

1224.544

8

1070.959

1071.602

9

952.076

952.647

10

856.969

857.483

11

779.154

779.622

12

714.309

714.737

13

659.439

659.835

14

612.408

612.776

The distance between isotopic peaks is 0.091, which is an isotopic spacing of ∼1/11 m/z. Taken together, using high-resolution mass spectrometry, it is possible to isotopically resolve a single ion species and subsequently determine its charge state. Furthermore, the observation of a protein with isotopic resolution provides the means to calculate a much more accurate mass value, often with an error tolerance of 98%) of the isotope labeled amino acids. The proteome of autotrophic organisms may be only incomplete and difficult to mark with SILAC. Therefore, for MSbased quantitative proteomics studies of green plants, such as Arabidopsis thaliana, the metabolic 14 N/15 N-labeling or chemical labeling methods may be used. At this point it must be pointed out that for a quantitative analysis under two different experimental conditions, the tests must be carried out independently several times in order to draw conclusions about the technical and biological variance of the data. This applies equally well to the methods described below. Chemical Stable Isotope Labeling Chemical stable isotope labeling is a universal method, since it can be applied to almost any sample, such as cultured cells, body fluids, or tissue biopsies from any organism. The introduction of stable isotopes by a suitable chemical reaction can be made directly in the intact proteins and is so designed for the top-down proteomics strategy (see above, Metabolic Labeling). These reagents, however, are also generally suitable for labeling of proteolytic cleavage products after enzymatic digestion of the proteins for the bottom-up proteome strategies (Section 39.5). Among the strategies currently available are the cysteine specific labeling with isotopelabeled affinity tags (isotope-coded affinity tags, ICATs) or amino group specific labeling of proteins with ICPL (isotope-coded protein label). ICAT – Isotope-Coded Affinity Tag Method Initial results achieved with such an analysis strategy were published by Aebersold already in 1999. Figure 39.22 shows the ICAT reagent and Figure 39.23 shows the principle of the method. All proteins of a proteome state are derivatized with the ICAT reagent in its “light” version on the SH groups of the cysteine residue. The proteins of other proteome state are labeled with the “heavy” version of the ICAT reagent. The two reagents are chemically identical, but differ in terms of eight hydrogen atoms replaced by deuterium atoms. Therefore, after derivatization a protein molecule that, for example, has a cysteine residue, from the first (“light”) proteome state has a molecular weight that is lower by 8 Da than the same

Figure 39.22 The ICAT reagent, X = hydrogen in light reagent, deuterium in heavy reagent.

39 Proteome Analysis

1017

Figure 39.23 Schematic representation of the ICAT technique. The cysteine residues of all proteins in the control (e.g., healthy tissue) are reacted with the light ICAT label, the cysteines of the condition to be tested (e.g., diseased tissue) with the heavy label. The differently labeled proteomes are mixed in a 1 : 1 ratio, enzymatically digested, and the ICAT-labeled cysteine containing peptides are isolated over an affinity column and subsequently identified by LC/MS/MS and quantified relative to each other.

protein from the second (“heavy”) state. The two isotopically labeled proteome states are mixed in a 1 : 1 ratio and enzymatically cleaved into peptides. Since only the cysteine-containing peptides are modified, only such peptides are present in the light and heavy form. Thus, only those peptide pairs can be assigned to the different proteome states. Because the ICAT reagent carries an affinity tag (biotin), all modified peptides can be isolated over a streptavidin affinity column. After optional further separation steps the co-eluting ICAT peptide pairs are analyzed by mass spectrometry. Using special software programs the peptide pairs that differ by 8 Da (or a multiple of 8 if more than one cysteine is present in the peptide) are quantified and identified on the basis of MS2) data. The signal intensities or peak areas of the peptide pairs in the MS spectra reflect the relative amount of the peptides and therefore also their corresponding proteins in the individual proteome states. The basic weakness of the ICAT method is that the isotopic labeling takes place on a rare amino acid. After derivatization and proteolytic cleavage of proteins only the cysteinecontaining peptides are isolated by affinity chromatography. This means that generally very low sequence coverage is achieved and only those proteins can be quantitated that contain cysteine in the primary sequence. Protein isoforms, degradation products, or post-translational modifications that are not located in the cysteine-containing peptide are not recognized. In addition, the ICAT reagent is relatively large, does not react completely, and shows unspecific reactions caused by rather long reaction times. In addition, thioether is chemically labile with atmospheric oxygen, which may lead to uncontrolled removal of the label by β–elimination; the relative quantification of a protein is usually based only on a few peptides. There are today a new generation of ICAT reagents with 12 C/13 C isotope and acid-labile linkers available. An interesting application of these ICAT reagents is, for example, the quantitative study of oxidative thiol modifications in proteins.

1018

Part V: Functional and Systems Analytics

Figure 39.24 The four ICPL reagents. The heavy stable isotopes 13C and 2H (D, deuterium) are shown in gray.

ICPL – Isotope Coded Protein Label Another protein-based proteome analysis that employs isotopes is the ICPL method. For a comparative proteomic analysis the proteins of different proteome states are fully marked on the numerous existing amino groups with the different ICPL isotopologues (Figure 39.24). Thus, after cleavage all lysine-containing peptides of a protein can be used for the quantification. The ICPL reagents are currently available in four different isotopic variants. They change the properties of the proteins, but still allow fractionation with all separation techniques established in protein chemistry. The operation of the method is depicted in Figure 39.25. After labeling, the proteomes labeled with the different isotopic reagents are combined and, thus, the relative proportions of the individual proteins of the different states are

Figure 39.25 All proteins of the proteomes under investigation are labeled on all amino groups with the various ICPL reagents. The labeled proteomes are combined and separation at the protein level is carried out as efficiently as possible (preferably also multidimensionally). The proteins of the now relatively low complex fractions are enzymatically digested and the peptides (eventually after further separations) are analyzed by mass spectrometry, quantified, and identified.

39 Proteome Analysis

fixed. To reduce the complexity at the protein level both electrophoretic techniques (isoelectric focusing, 1D-PAGE, 2D-PAGE) and chromatographic separation methods or combinations thereof can be used. Information on the relative proportions of the corresponding proteins from the various states is also retained during multidimensional fractionation processes. The strategic goal of this step is to separate the proteome in a large number of factions, each of which contains only a small number of proteins. The proteins in the individual fractions have then to be cleaved enzymatically into peptides. When using the ICPL approach a tryptic cleavage is arginine-specific because all lysine residues are derivatized and no longer accessible to cleavage. Double cleavage with two proteases (e.g., trypsin and Glu-C) is recommended to obtain smaller peptides, which are easier to analyze by mass spectrometry. After cleavage, relatively simple peptide mixtures are obtained that, since there should be only a few proteins present, are simple and can be analyzed by mass spectrometry and quantified automatically. Dedicated software programs like ICPLQuant, MaxQuant, or ProteinScape (Bruker) can handle the multiplexed spectra of the ICPL workflow and recognize the peptide pairs. The analysis on the MS1 level already indicates which peptides/proteins differ quantitatively in different proteome states. Only those peptides must be further analyzed by MS2 to identify them and deduce the corresponding proteins. Protein isoforms, degradation products, and post-translationally modified proteins can be separated at the protein level, in which case each of these protein species is recorded and quantified separately. The ICPL technology enables a comprehensive, quantitative proteomics analysis from various samples (body fluids, tissues, etc.) and can create fast and efficient differential protein patterns where the quantification is obtained at the MS1 level.

39.6.2 Stable Isotope Labeling in Bottom-Up Proteomics To circumvent the quantification problems with the label-free techniques isotope labeling techniques can also be used with bottom-up proteome strategies (see Section 39.5.5, 39.5.7 and Figure 39.20). Non-isobaric Labeling With non-isobaric labels the peptides produced differ in molecular mass. Therefore, the quantification can be performed on the MS1 level. 18

O Labeling

Chemical labeling with 18 O is achieved by an enzymatically (e.g., trypsin) catalyzed reaction with H2 18 O directly in the cleavage reaction. Two 18 O atoms are incorporated in the C-terminal carboxyl group, thus producing a +4 Da shift that can easily be detected in MS1 analysis. The mass shift from 18 O incorporation does not alter the chromatographic separation or the ionization efficiency of the labeled peptides. Loss of the isotopic label has been reported since the mechanism of 18 O introduction is reversible. Therefore, immobilized trypsin should be used; the trypsin can be removed from the sample after the labeling step. Labels with Different Numbers of Stable Isotope Atoms (13 C or 2 H) in the Reagent (e.g., ICPL, see above) Mostly amino group specific reagents that react with all ϵ-amino group of

lysines and the amino terminal amino groups are used to introduce the labels, usually containing 13 C or 2 H isotopes. Predominantly, as reactive group hydroxysuccinimide esters are used and most of the peptides in a tryptic or LysC digest are labeled. Here the isotopic peptide pairs may also be identified and quantified already in the MS1 analysis. Higher multiplexing is possible; however, the complexity of the spectrum increases accordingly. After inserting the marker the various proteome states can be pooled, similarly to the topdown proteomics (see section above on Metabolic Labeling). By evaluating the introduced mass difference the assignment of the peptides to the individual proteome states is ensured. Although theoretically almost all peptides should be marked and high sequence coverage in the mass spectrometric analysis seems attainable, it can be seen in practice that such expectations are not met. This is mainly due to the enormous complexity that is generated by the enzymatic protein cleavage. Tens of thousands of proteins are cleaved to give several hundred thousand peptides. In addition, by marking the different weights of reagents the complexity of the peptide

1019

Chromatographic Separation Methods, Chapter 10 Electrophoretic Techniques, Chapter 11 Cleavage of Proteins, Chapter 9

Mass Spectrometry, Chapter 15

1020

Part V: Functional and Systems Analytics

Figure 39.26 Four isobaric iTRAQ reagents (Sciex/Sigma-Aldrich).

mixture is further multiplied (e.g., quadrupled for an experiment with four different isotope reagents). This complexity exceeds the capacity of today’s chromatographic or electrophoretic methods, so that only a portion of the proteome can be quantified by the subsequent analysis of such mixtures by high-resolution mass spectrometry. Therefore, multidimensional peptide separations are usually necessary for a comprehensive peptide-based proteome. The development of modern mass spectrometers and the corresponding software in recent years is tailored to this peptide-based workflow. As a consequence, today, an automatic, sensitive, and quantitative high-throughput analysis of peptides, partially including mass spectrometric identification, is routine. Isobaric Labeling The currently most attractive peptide-based approaches based on isotopic labeling are carried out with isobaric reagents, like iTRAQ (isobaric tags for relative and absolute quantitation) (Sciex/Sigma-Aldrich) (Figure 39.26) or TMT (tandem mass tag) (Thermo Scientific). Here in the various proteome states isobaric labels can be introduced after enzymatic cleavage onto amino groups. The corresponding peptides of a protein from the various proteome states have the same mass. Only in the MS/MS analysis are reporter groups released that are specific for each reagent and the quantitative relations for the different proteome states can be recognized (Figure 39.27). Figure 39.27 In the iTRAQ method four enzymatically cleaved proteome states are labeled with one of the isotope iTRAQ labeled reagents. The derivatized peptides from all states are isobaric and co-elute (e.g., in a chromatographic separation). In the MS/MS analysis of such a peptide mixture, the four reporter groups (114–117) are released and reflect in their relative intensity to each other the relative amounts of peptide in the single proteome states. Source: Ross, P. L. et al. (2004) Mol. Cell. Prot., 3, 1154-1169. With permission, Copyright  2004, by the American Society for Biochemistry and Molecular Biology.

39 Proteome Analysis

Some major advantages of this method are:

 a high intensity of the peptide signal MS spectra, since all the signals of a certain peptide from  

all states co-migrate, which usually allows a high quality MS/MS spectra of the peptide for identification; up to ten different forms of an isobaric reagent are commercially available, from which it follows that up to ten different samples can be analyzed in a single experiment – this allows increased throughput associated with reduction of costs; despite multiplexing there is no increase in MS1 complexity.

One has to keep in mind that the methods using isobaric labels in combination with the bottom-up strategy have all the drawbacks of peptide-based approaches as given above (Section 39.5.2 and Chapter 14, Introduction). In addition, all peptides must be analyzed with MS/MS techniques because the quantification is carried out by releasing the reporter ions and thus the quantitative relationships are only recognizable in the fragment spectra. Therefore, all peptides, including those that do not change in their amount and are often not of interest, have to be analyzed by using MS/MS. However, this is done quickly and automatically with the latest generation of mass spectrometer. Owing to the high complexity of the spectra, there is a risk that the MS1 spectrum includes not only the signals of a single peptide. During MS2 fragmentation all peptides liberate the reporter ion, but not all the peptides are identified in the MS/MS analysis, which in the end may give false quantifications.

Further Reading Sections 39.1–39.3 and 39.6 Kellermann, J. and Lottspeich, F. (2012) Isotope-coded protein label. Methods Mol. Biol., 893, 143–153. Liebler, D.C. (ed.) (2002) Introduction to Proteomics – Tools for the New Biology, Humana Press, Totowa. Lottspeich, F. and Kellermann, J. (2011) ICPL labeling strategies for proteome research. Methods Mol Biol., 753, 55–64. Posch, A. (ed.) (2008) 2D PAGE: Sample Preparation and Fractionation, vol. 1 + 2, Springer Verlag, Berlin. Vogt, A., Fuerholzner, B., Kinkl, N., Boldt, K., and Ueffing, M. (2013) Isotope coded protein labeling coupled immunoprecipitation (ICPL-IP): a novel approach for quantitative protein complex analysis from native tissue. Mol. Cell. Proteomics, 12, 1395–1406. Von Hagen, J. (ed.) Proteomics Sample Preparation, Wiley-VCH Verlag GmbH, Weinheim. Walker, J.M. (2002) Protein Protocols Handbook, Methods in Molecular Biology, Springer Verlag, Berlin. Wilkins, M.R., Williams, K.L., Appel, R.D., and Hochstrasser D.E. (eds) (1997) Proteome Research: New Frontiers in Functional Genomics, Springer, Heidelberg.

Section 39.4 Chait, B.T. (2006) Chemistry mass spectrometry: bottom-up or top-down? Science, 314 (5796), 65–66. Han, X., Jin, M., Breuker, K., and McLafferty, F.W. (2006) Extending top-down mass spectrometry to proteins with masses greater than 200 kilodaltons. Science, 314 (5796), 109–112. Kelleher, N.L. (2014) A cell-based approach to the human proteome project. J. Am. Soc. Mass Spectrom., 23, 1617–1624. Kelleher, N.L. (2004) Top-down proteomics. Anal Chem, 76 (11), 197A–203A. Reid, G.E. and McLuckey, S.A. (2002) ‘Top down’ protein characterization via tandem mass spectrometry. J. Mass Spectrom., 37 (7), 663–675.

Section 39.5 Barnidge, D.R., Dratz, E.A., Martin, T., Bonilla, L.E., Moran, L.B., and Lindall, A. (2003) Absolute quantification of the G protein-coupled receptor rhodopsin by LC/MS/MS using proteolysis product peptides and synthetic peptide standards. Anal. Chem., 75, 445–451. First SRM proteomics experiment. Desiderio, D.M. and Kai, M. (1983) Preparation of stable isotope-incorporated peptide internal standards for field desorption mass spectrometry quantification of peptides in biologic tissue. Biomed. Mass Spectrom., 10, 471–479. First stable isotope labeled reference peptides for robust sample comparison.

1021

1022

Part V: Functional and Systems Analytics Ebhardt, H.A., Root, A., Sander, C., and Aebersold, R. (2015) Applications of targeted proteomics in systems biology and translational medicine. Proteom. Rev., 15 (18), 3193–3208. SRM literature. Liu, Y., Buil, A., Collins, B.C., Gillet, L.C., Blum, L.C., Cheng, L.Y., Vitek, O., Mouritsen, J., Lachance, G., Spector, T.D., Dermitzakis, E.T., and Aebersold, R. (2015) Quantitative variability of 342 plasma proteins in a human twin population. Mol. Syst. Biol., 11, 786. Application of SWATH-MS for longitudinal twin study. Picotti, P., Bodenmiller, B., Mueller, L.N., Domon, B., and Aebersold, R. (2009) Full dynamic range proteome analysis of S. cerevisiae by targeted proteomics. Cell, 138, 795–806. First perturbed protein network quantification using SRM.

Metabolomics and Peptidomics Peter Schulz-Knappe and Hans-Dieter Zucht

40

Protagen AG, Otto-Hahn-Str 15, 44227 Dortmund, Germany

Over the last decade bioanalytical sciences have evolved from firmly established areas such as genomics to a whole set of novel “omics”–type of technologies. These disciplines are often defined to investigate comprehensively a certain class of biomolecules. For example, whilst transcriptomics covers the function and expression of messenger-RNA-molecules, proteomics and peptidomics have emerged as focusing on the comprehensive evaluation of proteins and biologically processed peptides. The research area referred to as metabolomics is dedicated to the analysis of the multitude of low molecular, organic, non-polymeric molecules (metabolites) in order to assess biological phenotypes or biochemical states. Non-targeted metabolite profiling involves the identification and characterization of a large number of metabolites and their precursors. In contrast, targeted metabolite profiling focuses on quantitative changes in metabolites of interest (e.g., amino acids, carbohydrates, steroids, and fatty acids) based on a priori knowledge of the biological function or metabolic pathway. The driving forces for innovation in metabolomics research are based on the improvements in analytical technology and instrumentation facilitating the quantitative and highly specific multiplex analysis of thousands of different molecules. A second factor is the revolution in information technology providing tools to handle large sources of measurement data. Hence bioinformatics has become an integral part of metabolomics. The following terminology is used in this field, which splits it into certain aspects:

 Metabolomics: comprehensive analysis of all metabolites of a cell, an organism, or a body fluid under well-defined conditions.

 Metabonomics: analysis of biochemical alterations under the influence of a disease, a drug intervention, or toxin.

 Metabolite profiling: selective analysis of a certain subset of metabolites such as amino acids or fatty acids.

 Metabolic profiling: Comprehensive survey analysis of a large set of biomolecules often in a





time dependent manner and quantitatively. Applications are the clinical and pharmaceutical analysis of a drug, its metabolites including its kinetics, conversion into intermediates, degradation/clearance; often used in conjunction with the term metabolite profiling. Metabolic fingerprinting: classification of samples with respect to biological relevance or origin of source without identification of individual metabolites, which involves rapid, highthroughput global analysis to discriminate between samples of different biological status or origin. Metabolite target analysis: description of relevance of a target by analyzing its substrate, for example the analysis of metabolites of a distinct enzymatic cascade that is altered by abiotic or biotic influences.

Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

Metabolomics The comprehensive analysis of metabolites of a cell, an organism, or a body fluid under defined conditions.

1024

Part V: Functional and Systems Analytics

Figure 40.1 Representation of complex structures of biochemical pathways. Adapted and modified from “Metabolism of water-soluble vitamins and cofactors” (reactome.org; (Homo sapiens); Jassal B., D’Eustachio P., Stephan R.; 10.3180/REACT_11238.1).

Metabolome The multitude of metabolites in a biological system. The metabolic state can be described mathematically considering metabolic entities, their concentration, spatial distribution, and time dependent alterations.

The terminology listed here is somewhat ambiguous. Several definitions overlap in their meaning and their use throughout the scientific literature. Such a classification aims to outline different aspects used to analyze the metabolome. The analysis of all “omics” classes has the goal of allowing for a comprehensive, encyclopedic picture of a biological system such as a cell, or an organ, and its actual state whilst influenced by environmental factors or drugs. The aim with “omics” approaches is to understand in a more holistic way how genetic information translates into function and provides a broad data basis, hopefully sufficient, for an interpretation of an emerging phenotype. Thus, in systems biology thinking, transcriptomics, proteomics, peptidomics, and metabolomics span the entire path from genotype towards a phenotype. Historically, knowledge about metabolic synthesis and degradation processes in cells was initially dedicated to understanding the basic organizational principles characteristically mediated through enzymatic catalysis by enzymes. Milestone achievements have been the discovery of common basic organizational pathways for providing metabolic building blocks, handling energy and their regulation. The common pathways such as the Krebs cycle, the pentose phosphate pathway, or glycolysis are profound examples of such basic organizational principles. They illustrate a tight relationship between metabolites and their processing enzymes. The regulatory principles are mainly based on molecular interactions, and enzyme kinetics are able to describe individual steps. The identification of all the members of such pathways was the basis on which to create directed graphs mapping the multitude of interactions between the compounds depicted in textbook biology, as shown in Figure 40.1, as a kind of knowledge representation. Such textbook images provide a handle for a qualitative general understanding and are valuable educational tools today. However, they do not allow for a comprehensive representation of quantitative, qualitative, spatial, and temporal information types of modern “omics”-type approaches. This explains the important role of today’s knowledge representation of metabolomics using computational methods. Today, the knowledge base is heavily supported with computational methods, such as dedicated databases, networking tools, and visualization software to interact with metabolomics databases. Examples are the KEGG (Kyoto Encyclopedia of Genes and Genomes) and Reactome (http://www .reactome.org) databases or Wikipedia-based systems like Wiki pathways (http://www .wikipathways.org/index.php/WikiPathways), which serve the metabolomics researcher. Today, many organisms such as humans, bacteria, and some plants have been thoroughly studied, and the number of known metabolites can be estimated as quite complete. Surprisingly, the number of metabolites in humans appears to be quite small. Current estimates for humans are in the range of 2500 catalogued metabolites (excluding the gut microbial metabolites), whereas the number in plants exceeds 100 000. How can such a relatively small number of metabolites account for the known complex phenotypes? One possible explanation might be the highly dynamic changes in concentrations and the concerted, time- and location-dependent regulation of serial as well as parallel reactions. Highly complex regulations can appear even with a seemingly restricted repertoire of active players (= metabolites). Interestingly, the number of protein and peptide species is quite large in humans. Estimates of the number of individual protein species in humans range from 100 000 to 1 000 000. The reason for the diversity of proteins lies in the fact that posttranslational modifications within

40 Metabolomics and Peptidomics

1025

those polymers can create a large number of variants based on permutation. As peptides result from multiple, protease mediated cleavages of proteins (peptein = Greek for “digestion”), their number is estimated to be even higher, and several million different species might be present (Figure 40.2). Peptides as processing products of the respective precursor protein can be considered in some ways to be “metabolites” of proteins: their molecular mass ranges typically from 1 to 15 kDa (equivalent to chain lengths of 7–150 amino acids).

40.1 Systems Biology and Metabolomics The overarching goal of metabolomics is the detailed understanding of metabolic rules, integration, and regulations of entire organisms. Over recent years a novel research field has evolved that is recognized as systems biology. The “omics” disciplines complemented with systems theory form the integral parts of systems biology approaches. Systems biology describes metabolic systems in a qualitative, quantitative, and time- and space-resolved manner. Supporting disciplines are systems theory, mathematics, biostatistics, computing, and physics. Life can be defined as the coordinated and complex interaction of (mostly organic) chemical molecules in a living system in the context of the local and distant environment. Metabolomics delivers highly relevant data and knowledge about interactions of metabolites, equilibria, and substance fluxes as response to the environment (Figure 40.3). A common interpretation in system biology is that self-organizing biological systems exhibit distinct properties and reactivities. These are adapted to rapid and slow changes in the environment based on regulation and evolution. Systems biologists try to simulate systems based on computer simulations of metabolites and certain heuristic models. A growing number of genomes are now known (including many mammals, bacteria, viruses, and plants). Whereas a genome represents a static description of an organism, the other biological systems represent themselves in qualitatively as well as quantitatively highly dynamic, interdependent data. Importantly, external influences regularly influence many, if not most, parameters in a “dose-dependent” way, backwardly influencing gene expression and all the many downstream systems, especially proteins and peptides, and also sugars, lipids, and

Figure 40.2 Organizational structure from genotype to phenotype.

Figure 40.3 Hierarchical representation of the complexity of living systems. Genomics, transcriptomics, proteomics, peptidomics, and metabolomics generate the basis for the formulation of biochemical pathways. The organizational hierarchy over five layers is shown in this sketch adapted and modified from Oltvai, Z. N. and Barabasi, A.-L. (2002) Science 298(25), 763.

1026

Part V: Functional and Systems Analytics

metabolites. All this is necessary to allow for growth, differentiation, repair, and reproduction, at the right time and in the correct location.

40.2 Technological Platforms for Metabolomics Metabolites are chemical entities that are – in contrast to genes, proteins, peptides, and polysaccharides – small, non-polymeric molecules. The listing below gives an overview of metabolites. Metabolites are analyzed using all the methods of conventional chemical analysis and analytical instrumentation. Highly resolving spectroscopy and spectrometry methods have been developed in past years that are able to exploit small specific characteristics of biomolecules. Especially useful and common are methods of molecular spectroscopy such as NMR, IR, and Raman spectroscopy, and spectrometric methods, especially mass spectrometry, which is often coupled to some kind of sample preparation. The coupling of analytical instruments to automated molecular separation technology such as high-performance liquid chromatography or gas chromatography enhances the resolution power and selectivity of an analytical workflow, providing procedures that can handle thousands of molecules in one analysis run. Most technologies need extraction procedures (e.g., MS, LC-MS, and GC-MS) and are restricted to snapshot analysis but some technologies (e.g., Raman and NMR) can even provide real time data of living organisms. The analytical platforms offer often both qualitative as well as quantitative data thus enabling cataloging of survey analysis and kinetics. Metabolites:

    

NMR, Chapter 18

Infrared Spectroscopy, Section 7.4

amino acids and their derivatives sugars ATP and other signaling molecules cholesterol and derivatives fatty acids, hydroxy acids, bicarbonic acids, polyamines

For a comprehensive analysis of metabolomes as a complex mixture of diverse components it is necessary to utilize a series of different technologies. To do this in a meaningful manner, it is necessary to distinguish methods of low and high specificity. Methods with low specificity usually address general features of a sample that are not based on individual molecules alone. Nuclear magnetic resonance (NMR) and infrared spectroscopy (IR) are most often used. These methods display chemical moieties of groups of metabolites. Results generated form a molecular fingerprint, and high numbers of samples can be analyzed in a short time. This is used for classification of samples, but in terms of sensitivity or structural identification of individual components these methods have clear limitations. Biofluids such as blood plasma, serum, urine, or cerebrospinal fluid can be assessed easily with NMR and IR, as sample preparation and measurement effort are very low. In NMR spectroscopy signals are generated from nuclei of certain atoms such as 1 H or 13 C. After excitation, they emit signals that alter their frequency and strength according to the molecular context within their molecular structure. This leads to defined alterations of nuclear magnetic resonance spectra. NMR spectroscopy yields many different signals from a single analyte, because organic biomolecules contain many functional groups with distinct signatures. When analyzing complex compound mixtures or whole organisms, organs, or body fluids such as blood plasma by NMR, the resulting spectra will be dominated by the main components present in the sample. An example from liver tissue extract is depicted in Figure 40.4. In a fairly simple way the difference between treated and untreated liver tissue is shown. The multitude of signals derived from a single biomolecule explains why only a limited number of analytes are readily accessible by NMR, and that minor components are frequently missed. The big advantage of the methodology lies in its ease of use, relatively low costs, simple sample preparation, and high-throughput capabilities, though instrument costs are still high. Without prior hypothesis a screening for differences of biological samples to reference samples is achieved. For this reason an established place for NMR is in areas such as food quality control, toxicology, and drug abuse. NMR-spectroscopy is an ideal tool for classification of biological samples as a nondestructive test procedure with the ability to analyze fully functional, living organisms.

40 Metabolomics and Peptidomics

1027

Figure 40.4 Differential 1 H ΝΜR spectroscopic analysis of liver tissue. Differential comparison of an untreated control (a) to a sample retrieved after application of a liver-active compound (b). Adapted and modified from Coen, M. and Kuchel, P. (2004) Chem. Aust., 71 (6), 13–17.

40.3 Metabolomic Profiling Higher specificity in bioanalytical sciences is regularly achieved using high-resolution technologies such as mass spectrometry. Resolution, sensitivity, and selectivity are significantly modified and enhanced if the analysis methods are complemented by separation technologies such as gas chromatography (GC), high-performance liquid chromatography (HPLC), or capillary zone electrophoresis (CE) (Figure 40.5). In coupling mass spectrometry to such separation technologies the retrieval of quantitative and qualitative data can be achieved for individual molecular species, even if the starting sample consists of a very complex mixture. This strategic approach is referred to as “profiling”. Depending on the class of analytes to be investigated the analysis methods can be prioritized. The necessary detail of analysis determines which technology or technology combination is selected. The resulting data, whether from metabolic, proteomic, or peptidomic profiling, have a high degree of similarity, as they represent a qualitative and (semi-) quantitative description of a large number of components. In an ideal scenario, identification of analytes is achieved simultaneously with quantification. In general, analyzing any of the above-mentioned substance classes, one relevant requirement is the unbiased, robust, and reproducible extraction of class members from their respective biological matrix. This is even more relevant in all cases where high-resolution analytic technologies are utilized. Especially if in addition to the presence the relative or absolute concentration of many analytes is also desired the careful selection and validation of sample collection, storage, and sample preparation are mandatory to avoid selective enrichment/ depletion of compounds, which would lead to strong bias in retrieved data. Especially here, specific know-how is required to correctly conserve the status of a metabolome of a biological system until it is analyzed. With the ongoing improvement in resolution and sensitivity, the number of analytes to be assessed simultaneously is steadily increasing. Modern mass spectrometry allows the quantification of hundreds of components from a single sample. Combining MS with suitable

Mass Spectrometry, Chapter 15 Chromatographic Separation Methods, Chapter 10 Capillary Electrophoresis, Chapter 12

Spectroscopy, Chapter 7

Figure 40.5 General strategies and techniques used to analyze a metabolome.

1028

Part V: Functional and Systems Analytics

Figure 40.6 Coupling of analytic techniques with separation technologies allows for high separation and resolution power even in complex samples. (a) IRspectrometry, (b) chromatography, and (c) coupling of chromatography with mass-spectrometry.

separation technologies, thousands of components can be quantified in one single LC/MS experiment (Figure 40.6). A disadvantage of the methods used is the inability to analyze the “in situ” state of a high number of metabolites in the living organism. Such “in situ” analysis would be seen as the ideal situation for systems biology approaches.

40.4 Peptidomics

Cleavage of Proteins, Chapter 9

The central aim of research interest in proteomics is to fully grasp analytical data of proteins from biological samples and turn these data into knowledge. The main technologies used have been evolving from 2D-gel electrophoresis to more protein-digest based methods that are more compatible with high performance mass spectrometry. Several sub-forms of proteomics have evolved dealing with glycoproteins, protein complexes, antibodies, and native peptides. The term peptidomics describes the analysis of native peptides in the mass range 1–15 kDa, thus closing the molecular mass gap between metabolomics and proteomics, with a certain degree of overlap with both sides. Peptidomics is today a novel research area in functional genomic analysis, comprehensively analyzing peptides and small proteins at a defined biological state at a defined time point. As already noted, peptides represent true metabolites in the sense that they are products of protein metabolism, usually generated by way of complex protease activities with temporal and spatial resolution. Almost every peptide in an organism is generated by the (specific) cleavage by a peptidase or protease. In the human genome it is assumed that approximately 5% of genes encode for peptidases and proteases, and around 1000 peptidases plus around 200 homologues have been characterized to date (source MEROPS http://merops. sanger.ac.uk). They are located within cells, in body fluids such as blood plasma, urine, and cerebrospinal fluid, and in the extracellular matrix, being able to generate an enormous multitude of peptides. Peptide species with distinct and relevant biological function have long attracted the attention of a broad spectrum from basic researchers to focusing on medicine. Prime examples are peptide hormones such as insulin from pancreatic endocrine cells or parathyroid hormone from the parathyroid gland. For many other peptides a precise knowledge of their biological function is currently not known, and it is expected that amongst those that are just degradation products ready for clearance or re-uptake a large number of important, bioactive peptides is still to be discovered. For example, collagenous proteins account for 25% of total body protein, and they undergo permanent degradation, remodeling, and alteration processes. They are usually cleaved

40 Metabolomics and Peptidomics

1029

Figure 40.7 Analysis of a peptidome. Following removal of contaminating components (proteins, non-proteinaceous materials such as salts, sugars, lipids, metabolites) the complexity of a peptidome is reduced by sample fractionation techniques. Mass spectrometric analysis is then capable of describing thousands of peptides in a qualitative and quantitative manner. Comparative analysis between sample cohorts facilitates a differential peptide display finally resulting in structural identification of peptide biomarkers.

by collagenases giving rise to molecules such as endostatins, which supposedly alter vascularization of tissues. Of high relevance are intracellular processing events of proteins, shuttling peptides to cell surfaces for immune system priming or converting pre-prohormones into active peptide hormones. The analysis a peptidome usually requires very detailed, robust, and reproducible methods for qualitative as well as quantitative analysis. Such a process (Figure 40.7) is composed of different, subsequent steps that have to be adapted depending on the biological source and the scientific rationale behind the study. The starting point for a study for peptidomics profiling is the adequate collection and storage of suitable samples. Especially here, the success/failure of subsequent studies is already decided, since inadequate sample pretreatment and storage can never be corrected whatever sound science is applied at later stages. The second relevant procedure is sample preparation. Biological samples such as body fluids contain sets of analytes with the potential to interfere with the analysis of proteins and especially peptides. During sample preparation one main goal is the removal of such interferences, such as salts, sugars, lipids, metabolites and especially high abundant proteins such as albumins, immunoglobulins, and others. This is achieved via technologies such as size separation by gel chromatography, electrophoresis or ultrafiltration, affinity extraction, and solid phase adsorption. After isolating native peptides they are subsequently fractionated and subjected to mass spectrometric analysis. Detailed analysis of mass spectrometric data, usually in combination with other attributes such as clinical data of a patient, is then performed. To generate knowledge a diverse set of bioinformatical and biostatistical tools is required.

40.5 Metabolomics – Knowledge Mining Knowledge discovery in data is (according to Fayyad et al.): “. . . the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”. The collection of large amounts of data is not yet useful information. For this, models have to be created, which can be applied for predictions or alternatively some kind of theory, which can be later used for setting up experiments. A useful model is one that is accurate in the context of the utility of its predictions. A common challenge in omics research is to turn data into knowledge in a systematic way, to integrate previous knowledge and data, and to differentiate between data driven and model driven approaches. Therefore, it is necessary to understand the underlying principles of knowledge and model generation. Profiling technologies, such as metabolomics, usually result in large sets of uninformative data. A key responsibility of a researcher lies in the prioritization and extraction of the correct relevant information and in deciding on a proper experimental design to be able to annotate metabolic profiles. The experimental design already

1030

Part V: Functional and Systems Analytics

Figure 40.8 Research cycle based on a circular system of hypotheses and knowledge in systems biology. Adapted and modified from Kitano H. (2002) Systems biology: a brief overview. Science, 295, 1662–1664.

contains expectations or prior information. Examples of basic experimental designs are comparisons of data derived from two drastically different states of samples from different individuals (healthy versus diseased) or within an individual (pre- versus post-treatment) to create classifiers. The investigated factors have to be separated from experimental artifacts or nuisance factors, otherwise no relevant and portable knowledge is generated. The metabolomics researcher has to integrate hypothesis driven, data driven, and model driven research and thus create hypotheses based on previous knowledge, extract valid patterns of data, and create and validate models, which have to be compact and can be communicated. Historically, in classic biochemistry hypothesis-driven deductive work has been the center of activities. With the availability of large profiling data, data driven inductive methods became popular. In purely inductive and data-driven approaches the goal is to model data without reference to pre-formulated hypotheses in order to identify the relevant individual components in an unbiased way. However, the shortcoming of this approach became apparent. One underlying issue comes from the fact that data from fingerprinting and profiling exercises frequently carry measurement errors. To address research questions adequately, creating knowledge, a combination of inductive and hypothesis driven approaches is essential. Existing models are therefore challenged by experiments to test and validate these models (Figure 40.8).

40.6 Data Mining

Figure 40.9 Comparison of fundamentally different network architectures. Members in a random network have identical priorities and are therefore connected in a non-hierarchical network. Scale-free networks represent nodes highly connected and therefore superior in relevance to less connected members. They are called scale-free, because zooming in on any part of the distribution does not change its shape. Hierarchical networks demonstrate clear hierarchical organizations. This type of network is usually typical for metabolic pathways in all organisms serving different functional tasks.

Analysis of metabolites increasingly delivers high dimensional, multivariate data. In this respect, data of single metabolites, clinical parameters, and properties of analytes represent the different dimensions of an experiment. The data set usually consists of large tables from thousands of analytes in individual sample cohorts. Network analysis intends to reveal a model of correlations in such data, extracting pattern and topology of interaction networks. These networks usually include information on the concentration of whole sets of analytes. The network topology allows us to recognize relevant organizational hierarchy and central molecules as “hubs” (Figure 40.9). One good example of numeric procedures to analyze correlations and associations between metabolites is correlation-associated networks. This procedure utilizes the fact that different metabolites have distinct, very defined relations regarding their concentrations in biological samples. If, now, sets of experiments utilizing different, carefully selected experimental conditions are correlated, the concentration levels of different metabolites can be put into relation towards each other, resulting in mapping of known as well as of novel metabolites towards each other, creating such network topologies. Data mining is the computational process of systematically handling complex data sets for the extraction of compact patterns or rules. It can be understood as “mining,” as a kind of enrichment procedure utilizing methods of statistics, machine learning, and database systems. The common data matrix investigated by data mining methods contains the metabolites and annotation data as attributes and the experimental measurements as observations forming an nm matrix. The number of formally independent conclusions depends only on the number of observations, which is unfortunately outnumbered by the large number of attributes in “omics” studies. This so-called curse of dimensionality or sparsity of data is problematic for any method that requires statistical significance. Examples of typical methods used for metabolomic data mining procedures are multiple hypothesis testing based on multiple ANOVAs, multiple regression methods, step-wise linear regression, linear mixed effect modeling, multiple logistic

40 Metabolomics and Peptidomics

1031

Figure 40.10 Principal component analysis (PCA) subdivides complex data matrices into different projection vectors based on their relative importance (power of component). The product of the most important vector (rows and columns of the first principal component) is utilized to reconstruct the data matrix according to the first component. During PCA the most important components are utilized to denote dominant vectors, leaving behind the influence of the residual vector matrix not represented by the more important vectors.

regression, or the like. More sophisticated methods are derived from machine learning algorithms such as neural networks, support vector machines, Gaussian mixture models, Kohonen network analysis, or other artificial intelligence algorithms. Searching data often relies on detecting areas where objects form groups with similar properties. Conclusions drawn from such underpowered studies are often compromised by random observations. This represents a real challenge for biostatisticians and needs careful consideration in a cycle of experimental replication. In addition, the number of investigated analytes has to be reduced step by step during verification studies to escape this curse of dimensionality. Powerful methods for data reduction and creating compact models are, for example, the chemometric projection methods for multivariate statistics. They usually aim to extract a few so-called latent variables from the whole measurement set of data, which are shared between all the measured analytes as common metabolic phenomena. A well-known broadly used chemometric method is principal component analysis (PCA). Here, the measurement data is transformed in a systematic way to identify so-called principle components, which are related to latent variables (Figure 40.10) and can be interpreted as such. Latent variables constitute metabolic properties, which usually cannot be observed directly. They represent intrinsic, powerful features important for the understanding of general, relevant sets of parameters of metabolism. PCA thus has the power to identify general factors and biological phenomena and provides a compact model, which can be applied to novel data. Each new metabolic profile can now be visualized in a coordinate system and the profile is subsequently used for correlation to known metabolic data. Similar metabolic states will group together even if biological specimens are quite diverse (Figure 40.11).

Figure 40.11 Visualization of experimental data of the first two principal components in a simplified coordinate system. Every point represents the measurement of a complex metabolomics profile. Such a simplified coordinate system allows alignment of individual samples along two main, prioritized vectors pointing towards the most important variables in the data space. This can be utilized to interrogate from a top view the degree of self-similarity of different metabolic profiles.

1032

Part V: Functional and Systems Analytics

40.7 Fields of Application The technologies outlined here have shown growing utility in addressing a diverse set of bioanalytical questions: In biology: basic research in physiology and pathophysiology of plants, microorganisms, and animals. In medicine:

                

basic research in medicine; target identification, verification, and validation; measurement of biochemical changes during disease; mapping of multiple known as well as novel analytes and correlation to established biochemical pathways; generation of novel research hypotheses; identification of key enzymes, proteins, and metabolites; preclinical studies and toxicological analyses; mode of action studies; comparison of metabolic profiles with known toxic profiles; prediction of toxic effects; dose–effect relation mapping; extrapolation of species relationships; clinical trials; classification of subgroups by side effects and adverse events; reduction of non-responder populations; diagnostics and biomarkers; identification of single diagnostic molecules and multiplex panels.

40.8 Outlook Metabolomics and peptidomics are emerging science fields with a high impact on diagnostics and systems biology. Both fields utilize a set of specific analytic, computerized, and bioinformatics technologies thus merging different aspects of functional genomics. The improved analysis and predictability of living systems reflecting internal and external factors is achieved by metabolomics and peptidomics. Significant impact is expected on progress in medicine, drug development, diagnostics, and basic science. Systems theory and progress in information theory is going to provide a better toolbox for, hopefully, enabling a basic understanding of the miracle of living systems as such.

Further Reading Beckonert, O., Keun, H.C., Ebbels, T., Bundy, J., Holmes, E., Lindon, J.C., and Nicholson, J.K. (2007) Metabolic profiling, metabolomic and metabonomic procedures for NMR spectroscopy of urine, plasma, serum and tissue extracts. Nat. Protocols, 2, 2692–2703. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R. (1996) Advances in Knowledge Discovery and Data Mining, MIT Press. Gaenshirt, D., Harms, F., Rohmann, S., and Schulz-Knappe, P. (eds) (2005) Peptidomics in Drug Development, Editio Cantor Verlag. Goodacre, R., Vaidyanathan, S., Dunn, G., Harrigan, G., and Kell D. (2004) Metabolomics by numbers: acquiring and understanding global metabolite data. Trends Biotechnol., 22 (5), 245–252. Kell, D. (2004) Metabolomics and systems biology: making sense of the soup. Curr. Opin. Microbiol., 7, 296–307. Nicholson, J. and Wilson, I. (2003) Understanding global systems biology: metabonomics and the continuum of metabolism. Nat. Rev. Drug Discov., 2, 668–676. Roche Applied Science, Biochemical Pathways, Elsevier, Heidelberg.

Interactomics – Systematic Protein–Protein Interactions Thomas O. Joos,1 Stefan Pabst,2 and Markus F. Templin1

41

1 NMI Natural and Medical Sciences Institute at the University of Tübingen, Markwiesenstraße 55, 72770 Tübingen, Germany 2 Pabst, MorphoSys AG, Protein Sciences & CMC, Semmelweisstraße 7, 82152 Planegg, Germany

To gain a more complete understanding of protein-mediated processes in cells, a comprehensive characterization of the proteins involved and identification of their interaction partners is an absolute prerequisite. A major goal of such investigations is not only to find out whether a protein of interest is present, but also to glean information on its functional state. This relates in particular to the protein’s localization, potential modifications and interactions with other proteins. The localization of proteins usually involves the application of immunohistological methods and cell fractionation. Protein modifications can be identified using classical approaches such as mass spectrometry and ELISA (enzyme-linked immunosorbent assay). The identification of protein–protein interactions is also of major importance. The central role of such interactions has only been recognized in recent years. However, fundamental biological processes such as DNA replication, transcription, translation, transport, cell cycle control, and signal transduction can only be explained with precise/detailed knowledge of the underlying protein–protein interactions. Consequently, the study of protein–protein interactions has become a central theme in protein analysis that falls within the field of interactomics. Interactomics deals with all possible interactions between proteins in a given situation. This takes into account permanent multienzyme complexes as well as the association and dissociation of regulatory factors in protein complexes. Protein interaction can be studied using a broad range of different technologies, including two-hybrid interaction analysis, the separation of protein complexes using affinity chromatography, the mass spectrometric analysis of associated factors as well as protein microarrays. The following sections focus on microarrays and present and discuss their potential for use in interactome analysis.

41.1 Protein Microarrays Array systems are analytical systems that enable scientists to perform a large number of measurements simultaneously. An array is an orderly arrangement of microscopic spots, often in rows and columns. The term microarray refers to the miniaturized array format that enables the highly parallel execution of experiments. Protein microarrays can be used to simultaneously identify and quantify a large number of proteins in a single experiment. In addition to studying protein expression, protein microarrays can also be used for global interaction studies and functional analyses in a miniaturized format. The outstanding power of microarray technology comes from the high sensitivity of the measurements along with the ability to analyze tens to hundreds of relevant measurement parameters in a single experiment and in tiny sample quantities. In a planar protein microarray, a collection of capture molecules (e.g., antibodies, antigens, or complex probes such as the lysates of cells or tissues) are immobilized on a solid surface – or Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

Two-Hybrid System, Section 16.1

Affinity Chromatography, Section 10.4.8 Mass Spectrometry, Chapter 15

1034

Part V: Functional and Systems Analytics

Figure 41.1 Planar and bead-based arrays. In a planar microarray, the different probes (shown in blue) can be precisely identified from their x- and yaxis coordinates. In bead-based assays, the distinction of the probes on the beads relies on different bead types that can be identified by color (fluorescence) code or size. In planar arrays, the labeled binding partner (gray with asterisk) is incubated directly on the planar chip surface; in bead-based arrays, the beads are mixed with labeled binding partners in microtiter plate wells. Different methods are used for the detection of the bound binding partners; bead-based arrays usually involve fluorescence-based detection methods.

FACS, Section 5.5

phase, which is why these microarrays are referred to as solid phase assays. Parallelization is achieved by immobilizing the different capture molecules or probes on planar carriers (e.g., coated glass slides) in so-called microspots (spot diameter: 250 μm) that are arranged in rows and columns, i.e., in arrays. On planar surfaces, the identity of the immobilized capture molecules can be determined from their xy coordinates (Figure 41.1). In analogy to the orderly spatial separation of planar microarrays, the ability to analyze different immunoassays in bead-based systems requires the use of other codes for their unique identification. For example, miniaturized and parallelized ligand binding assays employ colorcoded or size-coded microspheres where the probes are attached to spectroscopically distinguishable fluorescence-coded beads only a few micrometers in diameter (e.g., polystyrene beads). Arrays are created with microsphere beads of different sizes or colors to which different probes are attached (suspension arrays). The molecules that are attached to the beads can be identified from the microspheres’ unique identity based on variations in fluorescence (Figure 41.1). The different color codes can be effectively identified using a technology that enables the separation of the beads in the suspension on the basis of their optical properties (FACS, fluorescence activated cell sorting). This technique was originally used for cell analysis. Cells in suspension or, in the case of suspension arrays, color- or size-coded microspheres in solution are aligned in a stream of fluid so that they pass single file through the detection cell. Two lasers of different wavelengths classify the different beads and determine the quantity of ligand bound. The bound ligands are usually identified using fluorescence-based methods.

41.1.1 Sensitivity Increase through Miniaturization – Ambient Analyte Assay The array systems described here have two principal characteristics in common: the high degree of parallelization and the associated extreme reduction of the area available for analyte detection. This miniaturization was the major priority of the first experiments done in this field that nowadays are referred to as microarrays. In the 1980s, Roger Ekins was looking for conditions under which immunoassays/ligand binding assays would achieve the best possible sensitivity. His ambient analyte assay theory outlines the consequences of miniaturization. Capture molecules (e.g., antibodies) are immobilized on a small area – the microspot – on a solid phase. Although the total number of capture molecules (antibodies) per microspot is relatively low, they are nevertheless arranged in a very high density. These capture molecules form complexes with their target molecules. The number of target molecules bound is relatively low as it is identical to the number of capture molecules contained in a microspot. The number of free target molecules in the sample therefore does not change. This is also the case with a low/ small number of target proteins. Assays in which a microspot contains 0.1/K or less of capture molecules are referred to as ambient analyte assays (K is the affinity constant of the binding reaction). The proportion of bound target protein correlates directly with the concentration of

41 Interactomics – Systematic Protein–Protein Interactions

target protein in the sample. In addition, under such conditions, the results are insensitive to the volume of sample used. Ekins’ ambient analyte assay theory therefore relates to the increased sensitivity of miniaturized immunoassays. This increased sensitivity is due to two phenomena:

 the reaction of complex formation occurs at maximum target molecule concentration;  capture–target molecule complexes are formed on an extremely small area – the microspot – which leads to a high local signal intensity (Figure 41.2). The following illustrates this relationship. On a surface, microspots of increasing size and constant concentration are produced from a capture molecule solution. Therefore, the total number of immobilized capture molecules increases the greater the spot area becomes. As a consequence, the overall signal of the respective microspots also increases. However, as the target molecule is not present in unlimited quantities in the sample, the signal density will decrease as the spot area increases. The formation of complexes consisting of capture molecule and target molecule leads to a reduction in the quantity of free target proteins in the sample at the same time as the complexes formed are distributed across a larger area. A larger spot therefore exhibits a lower maximum signal. Although lower overall signals will be observed for smaller microspots, the signal density (signal intensity per area) increases (Figure 41.2). The signal density will reach an optimum value below a certain spot size and will not increase further. Under ambient analyte assay conditions, the quantity of target protein does not represent a limiting factor.

41.1.2 From DNA to Protein Microarrays The development of DNA chip technologies has made it possible to analyze large numbers of probes in a single experiment, which is achieved by immobilizing a large number of different capture molecules on a carrier. In addition, the search for conditions that enabled immunoassays to exhibit maximum sensitivity led to the miniaturization of the system. Ekins and his colleagues showed that the use of miniaturized systems enables the sensitive detection of TSH (thyroid stimulating hormone) or HbsAg (hepatitis B surface antigen) including in

1035

Figure 41.2 Sensitivity increase through miniaturization – signal density and overall signal in microspots (adapted and modified from Ekins, R. and Chu, F. (1992) Ann. Biol. Clin., 50, 337–353). The course of signal density (signal intensity/area, grey bars) and overall signal (signal intensity, black triangles) was determined for microspots with increasing quantities of capture molecules. As all microspots have the same capture molecule density, larger microspots contain larger quantities of capture molecules. The overall signal (overall signal intensity) therefore increases with increasing spot size and reaches a maximum when all target molecules contained in the sample are bound in the microspot. For microspots with fewer capture molecules and smaller spot size, the signal density (signal intensity/area) increases and reaches an almost constant value when the capture molecule concentration value is below 0.1/K (K = affinity constant). These conditions characterize ambient analyte assays. The concentration of free target molecules in the sample barely changes despite the fact that they form complexes with the capture molecules that are immobilized in the microspot. In the figure, signal intensity and the derived area ratio (signal density) are depicted.

1036

Part V: Functional and Systems Analytics

Table 41.1 Comparison of DNA and proteins in terms of their microarray suitability. Characteristics

DNA

Protein

Structure

Uniform Hydrophilic Stable

Varies Hydrophilic and/or hydrophobic domains Very sensitive to stable

Functional state

Denatured, no activity loss → can be stored dry

Interaction sites

1 : 1 Interaction

3D structure is crucial for activity, denaturation should be prevented → needs to be kept moist (in a buffer) Several interaction sites

Interaction affinity

High

Varies from protein to protein; very low to high

Interaction specificity

High

Varies from protein to protein; very low to high

Prediction of activity

Easily possible, based on nucleotide sequence

Amplification method

Established (PCR)

Not (yet) possible, but bioinformatic methods are being developed (based on sequence and structural homologies) Not (yet) available

femtomolar concentrations (106 molecules per ml). These investigations showed that miniaturized systems have huge potential for application in basic research and medical diagnostics. As DNA microarray technologies have long been established as important and well-functioning laboratory methodologies, many attempts have in recent years been made to transfer this technology to protein microarrays. It has been established that the transfer of the technology from DNA to proteins is theoretically possible, but solutions still need to be found for numerous issues. The associated challenges are more of an intrinsic than of a methodological nature, and arise from the particular requirements of proteins. As far as the equipment used to produce DNA and protein microarrays is concerned, the same devices or devices with just minor technical modifications can be used for printing planar protein microarrays on a microscope slide. Different systems, including needle-based contact printers (split-needle or pin-and-ring systems) and contactless microdispensing systems (inkjet systems), can be used to automatically deposit nano- or picoliter sample volumes in columns and rows on suitable carrier materials. Protein interactions can be detected using confocal laser scanners such as those used for the analysis of DNA microarrays. While the challenges associated with the generation of protein arrays are less technical, the properties of proteins make global approaches involving thousands of capture molecules appear far more difficult than for DNA chips. DNA molecules are a homogeneous class of molecules with fairly similar physicochemical properties due to the DNA’s defined sequence of four nucleotides joined together by an acidic sugar backbone. In contrast to DNA molecules, proteins are much more complex (Table 41.1). They consist of 20 different amino acid building blocks that can have different conformations (secondary, tertiary, and quaternary structures), different charges (no charge, positive, or negative net charge), and solubilities (hydrophilic or hydrophobic). In addition, proteins are far less robust than DNA molecules and have a low level of stability. It is therefore important to immobilize the proteins on a substrate without compromising their function, which is an important prerequisite for analyzing protein–protein interactions. In contrast to DNA–DNA interactions, which, due to the complementary pairing of the nucleotides, can be accurately determined, it is currently impossible to predict from amino acid information alone the amino acid sequence a potential interaction partner needs to have in order to be able to bind specifically to the immobilized capture molecule. In addition, the generation of the required capture molecules is frequently a time-consuming and laborious process; no PCR equivalent is available that would allow the amplification of proteins. And last but not least, the interactions of proteins depend on a large number of buffer conditions, including the pH value, salt concentration, and cofactors required. It is therefore quite difficult to develop a universal protein array system in analogy to DNA chips that would take into account the individual requirements of the proteins and would make it possible to analyze as many interactions as possible under physiological conditions. Care must therefore be taken to interpret the protein microarray results while keeping in mind the particular experimental methodology used.

41 Interactomics – Systematic Protein–Protein Interactions

1037

41.1.3 Application of Protein Microarrays Protein microarrays can provide qualitative as well as quantitative information. They provide information on whether a protein is present in a biological sample. For example, tissue samples of different origin (tumor tissue and normal tissue) can be investigated to find out in which of the two samples a certain protein is present or absent. However, in most cases, quantitative information is also required. Information of this type might relate to the difference in the amount of marker protein in the two samples. In principle, protein microarrays can be used for identifying any molecular interaction in which two partner molecules specifically recognize one another. In addition to protein–protein interactions, these are also antigen–antibody reactions, enzyme–substrate, and ligand–receptor interactions. Microarrays can be produced from complex samples such as cell and tissue lysates as well as from whole and intact cells. The following subsection deals with different protein microarray technologies that highlight the enormous versatility of miniaturized assay systems (Figure 41.3). Protein–Protein Interaction Protein microarrays are excellent tools for the parallel analysis of protein–protein interactions. Firstly, they are used to identify new interaction partners of a specific protein (screening tool). Numerous protein microarrays for analyzing protein–protein interactions are commercially available. One of these is a planar array developed by Invitrogen, which is a standard size nitrocellulose-coated glass slide on which more than 12 000 protein spots around 150 μm in diameter, more than 4000 different proteins of the yeast Saccharomyces cerevisiae, and more than 2000 control proteins are immobilized in duplicate. A single experiment with this array therefore enables the identification of potential interaction partners for almost the entire yeast proteome (approximately 6500 proteins). An example of an interaction partner identified with the Protoarray protein microarray (ThermoFischer Scientific) is shown in Figure 41.4. The yeast protein MOG1, which is involved in nuclear transport, was used as probe. MOG1 was biotinylated prior to use in order to enable the subsequent detection of the reaction with fluorophore-conjugated streptavidin. MOG1 interacted well with the protein GSP1, a GTP-binding protein (yeast homologue of mammalian Ran) that is also involved in nuclear transport. Antibody–Antigen Interaction Protein microarrays are often used for analyzing antigen– antibody interactions. An antibody microarray usually contains a broad range of antibodies with different specificities. The chip is then probed with a mixture of proteins that are labeled with a dye (e.g., fluorophore-labeled tissue lysate). This enables a large number of antibodies to be tested simultaneously with a very limited amount of sample. Miniaturized sandwich immunoassays have proven to be especially suited to the sensitive analysis of analytes. Using the same method as a classical ELISA, capture molecules are spotted and fixed on a solid surface. The analyte in a liquid sample (liquid phase) is bound by the capture molecules and detected by a second antibody that also has an analyte binding site. Miniaturization has the advantage that several analytes can be analyzed and quantified simultaneously with the same sensitivity as classical ELISAs.

Immunological Techniques, Chapter 5

Figure 41.3 Schematic representation of different protein microarray applications. (a) Set-up used for the identification of protein–protein interactions. A protein (dark grey) that is immobilized in the microspot reacts specifically with another protein (light grey) which is labeled with a fluorophore (asterisk) for detection. (b) An antibody array in which the antibody is immobilized on the chip surface and captures a labeled analyte from the solution. (c) A sandwich immunoassay, where the analyte is captured by an immobilized antibody, and detected with a second antibody that has another binding site on the analyte. (d) Illustration of a peptide array such as those used for the characterization of peptide-specific antibodies. Protein microarrays can also be used for the detection of enzyme–substrate interactions (e) and ligand–receptor interactions (f). Reverse-phase protein microarrays can be produced with cell and tissue lysates (g), tissue sections (h), or whole cells (i).

1038

Part V: Functional and Systems Analytics

Figure 41.4 Planar Protoarray microarray, which contains more than 4000 different yeast proteins. The proteins on the microarray are printed in 48 subarrays that are equally spaced vertically and horizontally (dimensions: four columns × twelve rows). Each subarray consists of 256 microspots, arranged in 16 columns and 16 rows. The entire array contains more than 12 000 protein microspots. The protein array was incubated with the yeast protein MOG1, which plays a crucial role in nuclear transport. The protein was biotinylated prior to use, enabling bound MOG1 to be detected using fluorophore-conjugated streptavidin. The left-hand side of the figure shows the entire slide with all subarrays; the right-hand side shows two enlarged subarrays (# 02 and # 47, with a dark grey frame). Signals that occur in both subarrays are controls. The signals in the dark grey circle depict the interaction of MOG1 with GSP1 (YLR293C), a nuclear import protein that was immobilized on the chip surface.

Instead of antibodies, it is also possible to spot and fix antigens on a solid surface and identify them using a labeled antibody. Whole protein molecules, short protein fragments or peptides can be used as antigens. Peptide arrays can be used for the identification of antibody epitopes (antigenic determinants), i.e., the residues of an antigen that are crucial for antibody binding. The antigen sequence needs to be known for this purpose. The process starts with the chemical synthesis of peptides (10–20 amino acids long) that represent the entire antigen sequence. Ideally, peptides with partially overlapping sequences are generated. The peptides are immobilized on a suitable surface and the peptides that interact with an antibody of interest are identified. Another prerequisite for this method is that the antibody needs a high enough affinity for it to bind to its binding site. The method is therefore better suited to the mapping of continuous (linear) epitopes than to the identification of discontinuous epitopes (conformational epitopes).

Enzyme Activity Testing, Chapter 3

Enzyme–Substrate Interaction Enzyme–substrate interactions are studied using enzyme substrates that are spotted and fixed in arrays on suitable surfaces. The array is then incubated with the enzyme under investigation. Such microarrays can be used to analyze different substrates for their protein kinase specificity (substrate profiling). Substrate phosphorylation is usually detected using a phospho-specific antibody that only binds phosphorylated substrates or a radioactively labeled ATP and a phosphoimager. Ligand–Receptor Interaction To study ligand–receptor interactions, membrane fractions or low-molecular organic substances containing receptors are immobilized on a microarray and incubated with labeled binding molecules. The binding behavior of substances to receptors can thus be analyzed simultaneously under identical conditions. Such array systems are of major interest for the pharmaceutical industry in its search for new drugs (drug screening), as miniaturization and parallelization minimizes the consumption of reagents at the same time as enabling an increase in the number of parameters that can be studied per experiment. Reverse Phase Microarrays Reverse phase protein microarrays contain protein fractions that are generated by cell lysis or tissue microsection. Each microspot either represents the proteome of a tissue sample that correlates with different disease stages or the proteome of healthy tissue. Suitable antibodies are used to study these proteomes for the presence of molecular particularities associated with a certain disease. The results of such investigations might be

41 Interactomics – Systematic Protein–Protein Interactions

of particular benefit in the future as they may make it possible to tailor patient therapy to individual requirements. Microarrays can also be produced from different tissues. Tissue arrays allow the rapid and effective investigation of protein expression patterns with suitable binding molecules. In addition, they enable the effective profiling of antibodies for their suitability for application in histological investigations or as disease markers. In addition to lysates or tissue sections, it is also possible to immobilize whole cells on solid carriers. Such cell arrangements (cell arrays) are excellently suited to the characterization of antibodies that are directed against cell surface molecules (e.g., MHC molecules) and for the comparative analysis of cell surface molecules of different cell lines.

Further Reading Ekins, R.E. (1989) Multi-analyte immunoassay. J. Pharm. Biomed. Anal., 7, 155–168. Ekins, R. and Chu, F. (1992) Multianalyte microspot immunoassay. The microanalytical ‘compact disk’ of the future. Ann. Biol. Clin., 50, 337–353. Templin, M.F., Stoll, D., Schwenk, J.M., Pötz, O., Kramer, S., and Joos, T.O. (2003) Protein microarrays: promising tools for proteomic research. Proteomics, 3, 2155–2166.

1039

Chemical Biology Daniel Rauh and Matthias Rabiller Technische Universität Dortmund, Chemische Biologie, Otto-Hahn-Straße 6, 44227 Dortmund, Germany

42.1 Chemical Biology – Innovative Chemical Approaches to Study Biological Phenomena Sequencing the human genome as well as the genomes of a growing number of other organisms and the development of powerful “-omics” technologies have provided a new basis for the study of cellular processes. Science is now confronted with the daunting task of translating these technological breakthroughs and the resulting genomic and proteomic data into useful knowledge. In this context, considerable hope is placed in biology to develop new approaches to address the challenges of the twenty-first century: energy and food supply, innovative materials, and drug development. The first successes have already signaled biology’s potential: crops can be genetically modified to preserve yields despite declining water supplies, microorganisms can be genetically modified to produce large amounts of biofuel or hydrogen. The treatment of human diseases such as cancer has also experienced significant successes based on the decoding of genetic information, which allows new therapeutic approaches, such as personalized medicine. For example, about 10% of the patients suffering from non-small cell lung cancer (NSCLC) carry a mutation in which leucine 858 in the cellular kinase domain of the epidermal growth factor receptor (EGFR) is replaced by an arginine residue, leading the mutated EGFR to continuously promote cell division. Cellular mechanisms that should inhibit uncontrolled growth fail: a tumor starts growing. With the understanding of the mutation as the cause of the tumor and that the cells that harbor this mutation are highly dependent on this receptor signal for their continued growth, a process referred to as oncogene addiction, came the possibility for drugs that specifically block the function of the mutated receptor, once sequencing has established its presence in an affected patient. Even though many of these promising approaches are still far from general use, they pave the way for future developments. What these approaches all have in common is, however, that the function of the native or mutated gene products, the proteins, is understood in detail at the cellular level. Only then it is possible to analyze and utilize the connections between the sequenced genotype and the observed phenotype. The rapid developments in the last few years in the field of life sciences have led to the study of more and more biological phenomena on the molecular and cellular levels. The challenge inherent in this task is apparent. The key is the realization that many questions can only be addressed successfully at the interface between chemistry and biology with the aid of their often different, but fundamentally complementary, approaches and methods. The goal of chemical biology is to solve these fundamental biological problems through the development and application of innovative methods from the broad array of techniques available in chemistry (Figure 42.1). Often, chemical biology approaches have their roots in the analysis of structural biological data, which are the manifestation of biological Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

42

1042

Part V: Functional and Systems Analytics

Figure 42.1 Chemical biology is a cyclic process in which biological phenomena are addressed. Most biological phenomena arise from orchestrated protein functions that regulate these processes taking place inside the cell. Chemical biology uses custom made molecules to perturb these interactions (often rooted in molecular networks) specifically and analyzes their role. The development of organic molecules used in biological experiments faces their designers with the challenge of identifying chemical processes that yield the desired molecular properties while respecting the highly sensitive intracellular environment and/or its isolated components.

phenomena. From the fundamentals of biochemistry and molecular cell biology, we know that proteins and their dynamic regulation are central elements of living cells. As a result, biological phenomena can often be traced back to the proteins that are responsible for them. Against this background, proteins, as well as their post-translational modifications, are the particular focus of chemical biology, as well as sugars and lipids. Protein–protein interactions or the interaction of a given protein with a particular effector or inhibitor molecule can serve as a useful starting point for chemical biology research. Based on this information, new approaches have been developed with which chemical biology is used to study the biological phenomenon, usually by the use of preparative organic chemistry to generate the suitable reagents or molecular probes. An almost infinite repertoire of molecular probes is conceivable. Such probes can be used as reporter groups and as tags to mark proteins in cells, or as small biologically active molecules that can serve to modulate protein function specifically. The probe molecules developed are characterized by means of biochemistry and biophysics before they are used in mostly cell biological experiments. The knowledge gained by the application of these tools not only actively supports further development cycles within the chosen chemical biological method and contributes to a better understanding of the biological phenomena defined at the outset, it also includes the potential to be useful as starting points for modern drug development. It becomes clear that chemical biology is a multidisciplinary approach that combines methods from related sciences such as chemistry, biology, and physics. Practically, these interactions take place through collaboration, either within a laboratory or between complementary laboratory groups. The spectrum of methods used in chemical biology covers a very broad range and is becoming ever larger. A few of the key aspects, some of which will be addressed later in this chapter, are listed briefly here:

 Methods for the identification and organic synthesis of biologically active molecules that are suitable for use as modulators and molecular probes for perturbing biological systems.

 Methods for the design and the development of semi-synthetic proteins and reporter Fluorescence Spectroscopy, Section 7.4, 32.5



FRET, Section 16.7, 32.5.3 Stable Isotope Labeling in Proteomics (IACT, SILAC), Section, 39.6



molecules, which can be switched on or off by the use of drug-like substances or external light. Methods for the development of meaningful biochemical and cellular assay systems to allow the screening of substance libraries. In particular phenotypic assays and high-content screening are in focus. Use of modern fluorescence techniques like time-resolved fluorescence (FRET/FLIM) and fluorescence correlation spectroscopy (FCS) allow the visualization and analysis of time- and spatial-dependent effects in living cells. Identification of the targets of a given biologically active molecule. In particular, powerful mass spectroscopy (ICAT and SILAC), as well as phage- and ribosome-display, are methods of choice.

42 Chemical Biology

1043

 Methods or techniques for the validation of previously identified proteins in the full breadth of their biology.

 Powerful methods and techniques of chemo- and bioinformatics, for example, to analyze the correlation between a given chemotype and the observed phenotype. In addition to classic organic chemistry, structural biology, and mass spectroscopy, methods from nano- and micro-technology are increasingly coming into use. As a result of the wide scope of materials and methods, this chapter can only offer an introduction to the broad field of chemical biology, and its utility is illustrated by selected examples. The reader is referred to the Further Reading at the end of the chapter for additional information.

42.2 Chemical Genetics – Small Organic Molecules for the Modulation of Protein Function Over the last two decades, numerous methods for the investigation of biological phenomena on the molecular and cellular levels have been developed. They are based on the targeted manipulation of the structure, and therefore activity, of proteins. The discovery of the pleiotropic functions of proteins, particularly within complex molecular networks, was made possible through the use of genetic methods such as mutagenesis, gene knock-out and knock-in experiments, and molecular reporters such as the green fluorescent protein (GFP). This post-genomic era has also found that classic molecular biological approaches are often not able to close the gap between a measured genotype and an observed phenotype. This is because proteins in cells are rarely independent functional units. Instead, proteins usually manifest their functions as parts of complex structures like cascades and networks, which often interact with and regulate one another. In addition, identical proteins can carry out different functions, dependent on its context in different protein networks. In such networks, a large number of proteins interact with one another physically and chemically. The complex temporal and spatial composition of such systems, as well as their dynamic regulation, represent the individual characteristics of a biological process, such as cell division and differentiation, and are essential for life. It is, therefore, apparent that the quantification of the information, which flows through such networks, is one of the great challenges in modern cell biology. Hopefully, such analyzes will not only provide a better picture of which proteins interact with one another but also reveal how their dynamic regulation is resolved in time and how the corresponding functions emerge from them. Clearly the purely genomic or proteomic information of biological processes is insufficient to describe the complete function of a protein in the context of a cellular network of a living cell, a complete organism, or even just a disease process. One of the most powerful approaches to the decoding of complex biological questions is the targeted perturbation of protein functions and the differential analysis of perturbed and unperturbed states (Figure 42.2). This is comparable to the approach of an engineer, who first takes an unknown machine apart and then puts it back together, leaving out individual parts, in order to understand the function of the apparatus and its components. In modern cell biology, functional perturbations are achieved through techniques such as mutagenesis, gene knock-out/-in, or through siRNA. These methods have often proven their utility and allowed the analysis of entire genomes in modern high-throughput approaches. Techniques to control

GFP, Section 7.3.4

Figure 42.2 Study of the function of proteins. The functions of proteins can be investigated on various levels with the help of perturbations. All the methods shown have their advantages and disadvantages. While gene knock-out and RNA interference physically remove the protein from the cell by controlling its translation or transcription, small organic molecules allow the study of the target protein in its native physiological environment.

1044

Part V: Functional and Systems Analytics Table 42.1 Use of biologically active molecules for the modulation of protein functions has decisive advantages relative to other methods. Parameter

Advantage

Fast action/temporal control

Small biologically active molecules work rapidly and can be used at any time point in the development of a cell or organism Through the use of suitable physical (e.g., microfluidics systems) or chemical modifications (e.g., affinity to membranes, pro-drug approaches) the effects can be temporally and spatially confined to defined compartments Substances are subject to pharmacological processes in a biological system, such as diffusion, metabolism, and excretion. The biological effect of a small biologically active molecule is therefore usually reversible By testing different concentrations of the small molecule, the graded phenotypes can be controlled and quantified Substances allow the differentiation of the function of protein variants that come from the same gene Substances allow studies on native systems and, therefore, are considered minimally invasive

Spatial control

Reversibility

Dosing Differentiation Minimally invasive

RNA Interference, Section 38.3

proteins on the transcriptional level are, however, not trivial and, particularly when animal studies are the focus, are still a great challenge. The development of a transgenic mouse or a knock-out mouse takes months, and the use of RNA interference and the targeting of cells by siRNA requires optimization steps for each sequence used, and every cell type or organism. In addition, RNA interference eliminates the synthesis of proteins but does not influence the pool of the target protein that has already been synthesized in the cell. The targeted “switching off” of a protein function (e.g., of a given enzyme) depends on its natural turnover rate and, in the most unfortunate cases, can last hours or days. The resulting temporal delay between the induced genotype and the observed phenotype allows the cell time to adapt and to compensate for the provoked changes, for example, through the activation of alternate network structures. A further fundamental problem of RNA interference and knock-down techniques is that the changes are chronic, and the protein under investigation is permanently physically removed from the cell. Particularly in the case of enzymes, not only is the originally targeted enzyme activity lost, other important secondary functions, such as scaffolding and the participation in protein–protein interactions, are also lost. Many proteins only manifest their functions through the formation of complex structures with other proteins. Permanent removal of one of these components from the cell can endanger the integrity of an entire protein complex. In addition, a detailed study of protein function over a defined time frame or at a particular time point in the development of an organism is difficult to achieve. Analogous problems result when the gene product under investigation plays a critical role in the developmental process of the cell or the organism, and the modification of the gene is lethal in the embryonic phase. Modern cell biology, therefore, needs methods that allow for the perturbation of protein functions in a biological system, without disturbing the spatial and temporal distribution of the protein of interest. As we will see, the use of small organic molecules as probes for the targeted perturbation of protein function is an attractive and often complementary approach to the genetic methods (Table 42.1). In addition, chemical and genetic approaches can be combined to improve selectivity in targeting a particular protein function.

42.2.1 Study of Protein Functions with Small Organic Molecules From a pharmacological and chemical point of view, drug-like small molecules are ideal tools for the accurate perturbation of a target protein or of its interactions with other partner molecules inside a living cell. For example, enzyme inhibitors allow interference with the activity of the enzyme at any chosen point time in the development of a cell or an organism. With the aid of chemical perturbation of biological systems, not only can the connections in a protein network be discovered, it also allows the study of highly dynamic processes like the cytoskeleton reorganization, where classical genetic approaches no longer function due to the inherently

42 Chemical Biology

dynamic nature of the system. The rapid onset of the pharmacological effect of a biologically active compound also helps to get around the lack of an observable phenotype, which is occasionally observed under the control of single gene knock-outs due to time-dependent transcriptional compensation. In addition, the biological effect of a substance is frequently reversible, and the use of different concentrations allows for the induction of grated phenotypes, which together allows for the transient temporal control of a system. This perspective shows the necessity of the development of universal methods with which the function of virtually every gene in the genome can be modulated by the use of small organic molecules. Ideally these methods should be complementary to genetic approaches. The systematic use of biologically active molecules for the study of protein functions is, therefore, a central discipline of chemical biology. Chemical biology and the perturbation of complex systems with small molecules can be traced back to the use of poisonous natural substances for the study of biological processes. A few of these molecular probes, like brefeldin (study of vesicular transport in cells), okadaic acid (phosphatase inhibitor), or the alkaloid colchicine from the autumn crocus (study of cell division) had a fundamental impact on entire fields within biology. The discovery of these natural substances established a new paradigm involving the use of molecular probes. Advances in the fields of organic chemistry (e.g., automated parallel and solid-phase synthesis and new approaches for the total synthesis of complex natural substances), screening technologies (automation and miniaturization of the screening of substance libraries, powerful visualization technologies, as well as the development of meaningful phenotypic assays), protein technologies (design of conditional systems for the activation and inactivation of mutants), and computer sciences (chemo-informatics, bio-informatics, the use and management of large data sets such as those from screening campaigns) have expanded the paradigm to allow a systematic approach to the search, design, and development of drug-like molecules that are suitable for the study of biological systems. The success of chemical genetic approaches is dependent on the discovery and availability of organic molecules, which bind to a protein or protein complex as a ligand and thereby modulate their function. Suitable molecules are usually found during the screening of large substance libraries that are designed and synthesized for, for example, biological relevance (focused libraries based on biologically active precursors), similarity to drugs, structural chemical diversity (to cover as much chemical space as possible), or targeted based on structural approaches. Table 42.2 provides an overview of the approaches used to create substance libraries. The substance libraries are tested in several biochemical and/or phenotypic assay systems for their biological activity. The two fundamental approaches are forward chemical genetics and reverse chemical genetics (Figure 42.3). In the latter approach substance libraries

Table 42.2 Source of substance libraries. Besides several different possibilities for the design and synthesis of substance libraries, the following three complementary approaches have proven to be particularly useful for chemical biology research. Approach

Use

Biology-oriented synthesis (BIOS)

The design of substance libraries based on natural substances. In the course of evolution, Nature has created a series of structurally complex and potent natural substances that control the function of proteins and downstream biology. Adrenalin and estrogen are good examples, which carry out their functions as biological messengers that interact with cognate receptors. But also plant components like alkaloids, which play an important role in the defense against pests, rate as good starting points for the synthesis of substance libraries This approach focuses design and synthesis on covering as much chemical space as possible This approach takes advantage of structural information (e.g., 3D structural information from protein X-ray crystallography) available about the target protein to guide the design and synthesis, which allows the rational development of small molecule modulators (e.g., inhibitors) and offers starting points for orthogonal, chemical-genetic approaches like the bumpand-hole approach (Figure 42.6)

Diversity-oriented synthesis (DOS) Structure-based ligand design (SBLD)

1045

1046

Part V: Functional and Systems Analytics

Figure 42.3 Chemical genetics. Forward (to find the protein responsible for a particular function) and reverse (to find the functions of a particular protein) chemical genetics approaches start by screening substance libraries for biologically active compounds. Substance libraries can be acquired by various means. Besides preparative organic synthesis and the isolation of natural substances from plants, microorganisms, or animals, today libraries of millions of compounds can be purchased.

Figure 42.4 Chemical genetics in the case of the protein–protein interaction stabilizer fusicoccin. (a) The influence of individual members of a substance library on the growth of plants is investigated. (b) A few of the tested substances show an obvious phenotype. (c) The biologically active molecule discovered this way was chemically modified and attached to a suitable carrier to enable the successful identification of its target protein by means of two-dimensional electrophoreses and mass spectrometry. (d) Structural biology enabled detailed understanding of atomic interactions between biologically active compounds and the identified target protein. (e) In this example, the X-ray structure enabled identification of the active substance as a stabilizer of protein–protein interactions. The discovered complex natural substance (transparent space-filling model) stabilizes the protein–protein interaction consisting of an adapter protein (gray surface) and the target protein (blue space-filling model). The complex structure of the ligand and protein served as a starting point for the rational design of new small molecules with optimized characteristics (affinity, selectivity, etc.). Through the coordinated efforts of chemistry and biology, new compounds can be identified and optimized which, due to their wilting activity, for example, can be further developed to combat weed growth.

are usually screened against to a predefined target structure (e.g., the inhibition of enzyme activity or protein–protein interactions) and the resulting substances were then used as probes for the investigation of the function of the target protein in physiological systems. In forward chemical genetics, the target structures are unknown at the outset. Substances responsible for desired effects or a new interesting phenotype found while screening in cellbased assays are later used for the identification of target proteins (e.g., by means of pull down and mass spectroscopy). Figure 42.4 outlines this approach with a stabilizer of a protein–protein interaction, namely, fusicoccin.

42.2.2 Forward and Reverse Chemical Genetics Genetic analysis can be divided into forward approaches, in which certain phenotypes are analyzed and reverse approaches, in which the phenotypic consequences of mutations are in focus. Forward approaches aim to identify the modulated gene product, which is the cause of the phenotype. Analogously, reverse chemical genetics is based on phenotypes induced by small organic molecules, which interact with certain biological macromolecules like proteins, RNA, or DNA. Reverse chemical genetics often starts with a known protein (it may have been identified by genetics, molecular biology, or earlier chemical biology experiments) and seeks to investigate its function using perturbation experiments. To this purpose, an assay based on the activity of the protein of interest is developed to screen for active molecules in a compounds library. The screening hits serve as starting points for the design of molecular probes, which are then used to investigate the behavior of the protein in its environment. Such analysis involves a three-step process that begins with the development of a suitable phenotypic assay that allows the readout of the desired biological effect. For example, this can be the degree of differentiation of a cell line. Frequently, reporter gene assays are used in this

42 Chemical Biology

1047

Figure 42.5 Reporter gene assay. The genetic information for the enzyme luciferase is cloned downstream of the promoter of the gene under investigation. The resulting constructs are transfected into suitable cells and, by measuring the luciferase activity, the effect of small organic molecules on the upstream signaling cascade can be read out directly.

context (Figure 42.5). To enable the testing of as many substances as possible for their biological activity in the second step, today assays are miniaturized to microtiter plate format (96, 384, or 1536 wells). After the discovery of compounds, the identification and biological validation of the cellular targets, whose modulation lead to the signal in the assay, is often the greatest challenge. With a new compound, which has been identified as the modulator of the function of a gene product, reverse chemical genetics becomes possible.

42.2.3 The Bump-and-Hole Approach of Chemical Genetics As we have seen in the previous section, the modulation of protein functions with small organic molecules is a very effective method and the core discipline of chemical biology. The development of molecules, however, that can selectively chemically switch off the function of certain target proteins is very difficult. Inhibitors, for example, often affect other proteins as well as the intended target, particularly those that are structurally or functionally related. Development of a new drug often takes a dozen years and enormous financial resources, which shows that development of a drug is a difficult task. The development of a selective drug without undesirable side effects is impossible (side effects result from interactions with other proteins in many cases). To get around this central problem, at least for the analysis of biological systems and to combine the advantages of small organic molecules (Table 42.1) with the high precision of genetic technology, the bump-and-hole method was developed. This involves extending the binding pocket of the targeted protein by targeted mutagenesis to create a hole that a more sterically demanding ligand, the bump, will bind to selectively, without disturbing the function of the (non-mutated) wild-type or related proteins. The ligand can be an inhibitor, an agonist, or a chemically modified cofactor, in order to, for example, analyze the substrate specificity of the target enzyme in the context of a living cell (Figure 42.6). The bump-and-hole approach was first used to investigate the interaction between the elongation factor EF-TU and the ribosome and has since then been applied to many enzyme classes. Particularly in the laboratory of Kevan Shokat at the University of California, San Francisco, this method has been developed into a powerful chemical genetics approach to determine the function of protein kinases in complex systems. This approach is also called analog sensitive kinase alleles (ASKAs). ASKA – the Combination of Chemical and Genetic Methods for the Selective Inhibition of Protein Kinases In 1839 scientists had already discovered that the element phosphorus is present in proteins. However, it took another 120 years before the enzymatic transfer of phosphorus to proteins in the form of phosphate groups was discovered and thereby the basis for the understanding of this key post-translational modification. Today we know that protein kinases transfer the γ-phosphate group of ATP to protein substrates and thereby play key roles in the control of protein functions and the complex

1048

Part V: Functional and Systems Analytics

Figure 42.6 The ASKAs (analoguesensitive kinase alleles) approach creates highly selective kinase/inhibitor pairs (schematic views on left-hand side; structure of the ATP binding pocket in complex with ligand on the right-hand side). (a) Inhibition of a protein kinase by an unspecific, ATP-competitive inhibitor. (b) After modification with a bulky substituent, the inhibitor forms unfavorable interactions with the gatekeeper amino acid (highlighted in blue) in the back of the ATP binding pocket. (c) A point mutation of the gatekeeper residue enlarges the ATP pocket, enabling it to bind the sterically demanding bumped inhibitor.

regulation of signaling pathways. The human genome encodes over 500 kinases and the physiological, and pathophysiological, significance of each member of this enzyme family is of particular interest to current research. However, the complex regulation of these enzymes, as well as their often overlapping substrate specificity, makes the identification of individual kinase activities in biological processes extremely difficult. To answer the question of which kinase is active, at what time point, in which network in the development of an organism and how this activity influences downstream signaling or processes, requires particularly powerful methods. With what we have learned in the previous sections, the answers should be relatively easy to find with the help of perturbation experiments in which kinase inhibitors block the enzymatic transfer of phosphate groups to their substrates and thereby disrupt biological systems. However, this requires extremely selective kinase inhibitors that specifically inhibit only the kinase of interest. The development of potent and selective inhibitors is the central problem of kinase research. Every kinase uses ATP as a cofactor, and most known kinase inhibitors compete directly with ATP for the binding site in the catalytic center of the kinase domain. Since this catalytic domain and, in particular, the ATP binding site are highly conserved, the development of mono-selective inhibitors is an almost impossible task. To study the function of kinases with the help of inhibitors, nevertheless, clever approaches have been developed that combine chemical and genetic methods. The ATP-binding site of the kinase under study is enlarged by a point mutation such only that a specially developed, sterically demanding inhibitor binds to the kinase and blocks the phosphotransfer. The artificially created complementarity of the ligand and protein structure can reach a surprisingly high level of selectivity, which can be applied to almost any kinase in a number of organisms one to one. This is made possible by the exchange of the gatekeeper, which is a conserved, often large, hydrophobic amino acid located in the ATP binding site. The exchange of this sterically demanding amino acid (phenylalanine, methionine, etc.) for a smaller residue such as alanine or glycine creates additional space in the pocket that is not present in the wild-type kinase. If one selects a relatively unspecific kinase inhibitor, such as the pyrazolo-pyrimidine PP1 (Figure 42.7) and modifies its structure such that the additional steric needs of the new analogue (NM-PP1) are complementary to the enlarged ATP pocket in the mutated kinase, the mutated (analogue-sensitive) kinase is inhibited, while the wild-type kinase is unaffected (Figure 42.6). By repeated use of the same sterically demanding inhibitor and the mutation of the gatekeeper in the corresponding target kinase, it is relatively simple to generate a number of analogue sensitive kinase alleles and to selectively inhibit them. Remarkably, although the gatekeeper is located in the proximity of the ATP binding site, the exchange usually does not affect the function of kinases. The use of inhibitors that are suitable for ASKAs opens other excellent

42 Chemical Biology

1049

Figure 42.7 Inhibitors and cofactors of analogue-sensitive kinases. (a) Based on classical ATP competitive kinase inhibitors, sterically demanding analogs that are complementary to analoguesensitive kinases can be generated. The group highlighted in blue prevents binding to the wild-type kinase. A further modification of the inhibitor (e.g., with a Michael acceptor (gray)) allows the covalent modification of cysteines introduced into the target kinase by sitedirected mutagenesis. (b) Modified ATP analogues are selectively recognized by analog-sensitive kinases and can be used to transfer radioactively labeled phosphate groups (∗ , 32 P-label).

opportunities for the experimental analysis of the cellular function of kinases. Since the previously mentioned inhibitors NM-PP1 and NA-PP1 are readily taken up from media on cells or can be added to animal feed, they enable a rapid and dose-dependent perturbation of the analoguesensitive kinases. In addition, the subsequent analysis of the induced phenotype is not limited in any way, since the process is perfomed in an almost native system. Of particular interest is the comparison of changes in gene expression profiles (microarray technologies) that arise from the selective inhibition of analogue-sensitive kinases. The resulting information can contribute to the elucidation of cellular networks, which are regulated by the kinase of interest, or for the identification of new inhibitors. In the latter case, the expression profile of cells that were treated with the new compound is compared to the profiles of cells treated with the analogue-sensitive kinase’s complementary inhibitor. The advantages of ASKA technology relative to purely biological methods can be illustrated in a series of examples. Bruton tyrosine kinase (BTK) plays an important role in the immune response. Knock-out experiments that interrupted the function of BTK did not result in a clear phenotype because other kinases in the network associated with BTK compensated for its loss. On the other hand, the speed of the chemical perturbation of the analogue-sensitive variant BTK does not leave the cells sufficient time to compensate the loss of BTK activity. With the use of ASKA-BTK, a much clearer picture of the regulation of cellular processes by BTK was possible. It is often interesting to correlate the strength of the activity of a particular kinase with the observed effect in an animal model. The classical approach to this problem would require the generation of a

DNA-Microarray Technology, Chapter 37

1050

Part V: Functional and Systems Analytics

large number of mouse lines, each of which expressed a different amount of the kinase of interest. AKSA, on the other hand, only requires a single mouse line and enables the regulation of the kinase activity by the simple adjustment of the dose of the corresponding inhibitor. The Gatekeeper Residue, an important Amino Acid in the Kinase Domain As mentioned in the previous section, the gatekeeper amino acid is one of the few amino acids near the ATP binding pocket that can be mutated without seriously affecting the activity of the enzyme. In addition, the side chain of the gatekeeper is one of the most important determinants for the development of selective therapeutic kinase inhibitors. While the polarity of the side chain, for example, in the case of threonine, allows the direct interaction with ligands via hydrogen bonds, the size of the residue controls access to the deeper reaches of the ATP binding pocket. It is therefore not surprising that during the treatment of cancer patients with kinase inhibitors clinically relevant resistance mutants emerge at exactly this spot, preventing the binding of the inhibitor. The gatekeeper residue is, therefore, an important amino acid, which, on the one hand, enables innovative approaches to the characterization of biological systems and, on the other hand, presents a great challenge to targeted tumor therapy. The realization that the gatekeeper is a hot spot for the emergence of resistance only came years after the development of the first analogue-sensitive kinase alleles were created in laboratories. However, the principle behind the emergence of resistance by steric blocking can also be used for targeted chemical-biological experiments. To validate a particular kinase as a biologically relevant target of a newly discovered or newly developed inhibitor, the gatekeeper can be used to deliberately introduce larger amino acids at this position and thereby to create an artificial resistance. As a result, the wild type is inhibited but not the mutated kinase.

42.2.4 Identification of Kinase Substrates with ASKA Technology To understand the biological function of a protein kinase in detail, knowledge about the corresponding substrate proteins is of central importance. Over the course of years methods have been developed that allow the identification of kinase substrate pairs in their biologically relevant context by the use of radioactive ATP (the γ-phosphate group of ATP is marked with the radioactive phosphorus isotope 32 P). But the analysis is complicated by the fact that protein kinases often have overlapping substrate specificities and experiments often contain a complex mix of radioactively labeled proteins, which makes the association of a particular kinase with the phosphorylation difficult or even impossible. The previously mentioned analogue-sensitive methods allow for the facilitation of sterically demanding and radioactively labeled ATP analogues that can be only recognized and enzymatically processed by previously mutated kinases. To adapt the kinase to the changed ATP analogues, the gatekeeper residue in the hinge region of the kinase domain is also mutated to a smaller amino acid (e.g., alanine or glycine). The complementary ATP analogues are chemically modified in the N6 position of the adenine ring system, for example, through the introduction of a benzyl or cyclohexyl group (Figure 42.7b). Correspondingly the γ-phosphate from the modified ATP-analogue is transferred to the appropriate substrates. The use of an inhibitor complementary to the analogue sensitive kinase can be combined with this approach. The changes in the phosphorylation pattern caused by the use of the ASKA kinase and corresponding inhibitor are then analyzed. With the aid of this approach, it was possible to identify several previously unrecognized substrates. Frequently, however, the low copy number of the substrate proteins makes this difficult. To improve the situation, special ATPγS analogues were developed, which are used by analogue-sensitive kinases and transfer a thiophosphate group instead of a phosphate group to the substrate. The special reactivity of the phosphorothioate anion attached to the substrate allows the conversion of p-nitrobenzyl mesylate into a phosphothioester. This conjugate is recognized as a hapten by specially created monoclonal antibodies. With this method, not only can the transfer of the phosphate group (thiophosphate group) be followed without the need for radioactivity, it also allows an enrichment of the substrate by immunoprecipitation. With the aid of this approach, it was possible to identify the nucleoporin TRP, which plays an important role in the transport of proteins into the nucleus as a substrate for the MAP kinase Erk2.

42 Chemical Biology

1051

Figure 42.8 Caging gives temporal and spatial control (through external stimuli) over the release of molecular probes. The activity of the probe is initially masked or caged. Exposure to UV light, for example, uncages the probe, which is released to bind to its target structure and modulate its biological function.

42.2.5 Switching Biological Systems on and off with Small Organic Molecules The presented methods and techniques have shown how protein function can be controlled with the aid of small organic molecules. The expression of target genes can also be modulated in this manner. The best-known example is the Tet system in which expression of the target gene is placed under the control of doxycycline. The tetracycline doxycycline is a broadband antibiotic that binds specifically to the Tet-repressor (TetR), which plays a central role in bacterial resistance to tetracycline. To make the expression of chosen genes controllable by doxycycline, TetR, a doxycycline-inducible DNA-binding unit, can be fused to eukaryotic gene regulation domains. With the aid of this system, selected oncogenes can be brought under the inducible control of doxycycline in transgenic mice, which allows the pathological effects of these genes in the development of cancer to be studied, for example. This elegant method allows the temporal and temporal control of gene expression, which is often interesting for organ-specific expression in animal studies. An alternative is caging, which modifies the chemical probe such that an interaction with the corresponding target structure is impossible and the probe is initially inactive – the biological activity is, in effect, trapped in a cage. Only an external stimulus, like a temperature change, pressure, or irradiation with UV light, breaks open the cage and allows the probe to become active (Figure 42.8). The activation (uncaging) of a caged probe by photolysis is particularly attractive. In a transparent organism or a cell, even the smallest regions can be addressed under a microscope and the activity of the freed probe can be investigated in an organ, a cell compartment, or a patch of a membrane. For example, a photosensitive doxycycline derivative was used this way for the light-inducible control of expression of target proteins in the brains of mouse embryos. Further approaches to the direct control of gene expression are based on photoactivatable antisense DNA or RNA molecules, which only bind their target structures under the influence of light and thereby cause their degradation in the cells. To make oligonucleotides controllable with light, the photolabile group can be attached either directly to the bases or the phosphate backbone. In principle, caging can also be applied to gain spatial and temporal control of the activity of any molecule – inhibitors, cofactors, or substrates (Figure 42.9). A similar approach involves the

Line of a Ribozyme, Section 38.2

Figure 42.9 Temporal control of the expression of a target protein. The coding sequence of a ribozyme is fused upstream of the genetic information of the target gene. The chosen ribozyme is catalytically active and degrades the coding sequence of the fusion protein after transcription – translation does not take place. Release of a ribozyme inhibitor by light inhibits the catalytic activity of the ribozyme, which enables the transcription and translation of the target protein.

1052

Part V: Functional and Systems Analytics

use of molecular probes that can be switched between a biologically active and a biologically inactive form with monochromatic light. Through the reversibility of the isomerization of the probe dependent on the wavelength of the monochromatic light, macromolecules such as ion channels or other receptors become switchable and, therefore, can be investigated conditionally in biological systems.

42.3 Expressed Protein Ligation – Symbiosis of Chemistry and Biology for the Study of Protein Functions One of the great surprises of early genome analysis was the recognition that the size of a genome is not representative of the complexity of an organism. For example, the genome of the roundworm Caenorhabditis elegans with 15 000 genes and that of the fruit fly Drosophila melanogaster, with 20 000 genes are only marginally smaller than the genome of modern humans (ca. 25 000 genes). The morphological and functional differences between the organs are fundamentally different from us humans and a fly or worm. If it is not the number of the gene products, what, then, is responsible for the complexity of an organism and life, in general? Craig Venter, one of the biochemists who was an important participant in decoding the human genome, stated the following in one of his highly regarded articles in the journal Science: “The finding that the human genome contains fewer genes than previously predicted might be compensated for by combinatorial diversity generated at the level of post-translational modification of proteins.” Post-translational modifications of proteins are specific, enzymecatalyzed covalent modifications that change the information content of proteins. Indeed, posttranslational modifications play a central role in the regulation of all metabolic processes. Nature can thereby control the function of proteins by, for example, a change in stability, charge, cellular localization, three-dimensional structure, or interaction with other molecules. In cells post-translational modifications, such as phosphorylation, conjugation with lipids, or glycosylation, are usually subject to a finely tuned, reversible exchange that allows a previously introduced modification to be removed again, returning the system to its beginning state. Regulatory mechanisms based on post-translational modifications are highly dynamic, which makes them interesting for the elucidation of biological questions, but also difficult to handle at the same time. Often, the modified protein preparations necessary for detailed investigation of the function cannot be obtained by purely biological methods, such as targeted mutagenesis or recombinant protein expression. The development and application of combined chemobiological techniques for the chemoselective modification of proteins have proven to be a superb tool for the study of protein function on the molecular level and represents another important focus of chemical biology.

42.3.1 Analysis of Lipid-Modified Proteins The originally purely synthetic method of native chemical ligation (NCL) has proven to be extremely useful for protein chemistry. This method allows the synthesis of large peptides by the condensation of peptide fragments, consisting of a C-terminal thioester and an N-terminal cysteine. The semisynthetic version of NCL, known as expressed protein ligation (EPL), combines chemical synthesis with biological techniques. This method allows the fusion of synthetically manufactured peptides with recombinantly synthesized proteins. EPL allows the site-specific modification of proteins with a large number of probes like fluorophores, spin labels, stable isotopes, unnatural amino acids, or posttranslational modifications and has already been applied successfully to a large number of questions of protein design. A previously selected reporter group (e.g., fluorophore or unnatural amino acid) is chemically built into the peptide. The ligation of the peptide with the recombinant protein fragment allows, for example, structural biological investigations of the previously mentioned post-translationally modified proteins (Figure 42.10). The example shown in Figure 42.10 demonstrates that EPL could be used to obtain preparative amounts of mono- and diprenylated variants of the Rab-GTPase Ypt1, which

42 Chemical Biology

1053

Figure 42.10 Expressed protein ligation (EPL). (a) A protein splicing element (intein) is recombinantly overexpressed along with an affinity tag such as the chitin-binding domain (CBD) fused with the C-terminus of the target protein. The N-terminal cysteine of the intein domain initiates an N,S-acyl shift, after which exogenous semi-synthetic thiols cleave the intein by thiolysis of the recombinant thioester of the target protein. The final, semisynthetic target protein is formed by the reaction of the C-terminal thioester with the N-terminal cysteine of the synthetic peptide through a ligation reaction. Chitin beads allow the effective separation and purification of the formed construct. Commercially available systems have been developed for the intein-mediated purification of immobilized chitin. There are numerous examples of biological questions that could be addressed by use of EPL. One example is the X-ray structure of the monoprenylated Ypt:RabGDI complex (PDB-code 1ukv). (b) EPL made the isolation of preparative amounts of mono- and diprenylated variants of the Rab-GTPase Ypt1 possible. The prenyl-modified and synthetically produced dipeptide (gray) is ligated with the C-terminus of the recombinantly obtained Ypt1 (blue). (c) After successful folding, the crystallization and structure determination of the semi-synthetically modified Ypt1 in complex with its physiological modulator RabGDP-dissociation inhibitor (RabGDI) was possible for the first time. Rab proteins are important regulators of vesicular membrane transport and mediate numerous events, such as the docking and fusion of membranes as well as their intracellular mobility. Post-translational modifications like prenylation are therefore essential for protein function. The Xray structure of the Ypt1:RabGDI complex showed a conformational change of RabGDI induced by binding of the prenylated Ypt1 and an associated formation of a hydrophobic binding pocket that accepts the prenyl side chain of Ypt1. In the uncomplexed state the prenyl moiety is anchored in the plasma membrane. The illustration shows the surfaces and secondary structure of RabGDI (light gray) bound to Ypt (blue) with the prenyl moiety (dark gray). The close up shows the binding of RabGDi to the hydrophobic pocket formed by the prenylated Ypt1. Source: from Rauh, D., and Waldmann, H. (2007) Angew. Chem., Int. Ed. Engl., 46, 826–829. With permission, Copyright  2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

enabled their crystallization and structure determination in a complex with their physiological modulator RabGDP-dissociation inhibitor (RabGDI). Rab-proteins are GTPases of the Ras superfamily and have a central role as regulators of vesicular membrane transport. They modulate numerous events such as the docking and fusion of membranes and their intracellular mobility. Post-translational modifications are essential for protein function and the course of physiological processes. The X-ray structure of the Ypt1:RabGDI complex showed a conformation change in RabGDI induced by the binding of the prenylated Ypt1 and an associated formation of a hydrophobic binding pocket, which accepts the prenyl side chain of Ypt1. In the uncomplexed state, the prenyl moiety is anchored in the plasma membrane of the cell and therefore sequestered. The group of Roger Goody at the Max-Planck-Institute in Dortmund was the first to determine the position of the prenyl binding site and led to the elucidation of the molecular mechanism of the insertion and removal of Rab proteins through effectors like Rab or GDI. Interestingly, mutations in RabGDI can cause mental retardation in humans. By the use of prenylated proteins obtained with EPL, many comprehensive biophysical and structure biological studies could be carried out which showed that mutated Rab proteins like prenylated Ypt are harder to remove from the membrane. Thus, a disruption of membrane transport is the ultimate molecular cause of this genetic disease.

1054

Part V: Functional and Systems Analytics

42.3.2 Analysis of Phosphorylated Proteins

Posttranslational Modifications, Chapter 25

As discussed in the previous sections, the enzymatic transfer of phosphate groups to the side chains of serine, threonine, and tyrosine is a central modification of proteins and is essential for the maintenance of almost all biological systems. On the cellular level, kinases oppose the action of phosphatases, which hydrolytically remove the phosphate groups. The orchestrated balance between phosphorylation and dephosphorylation forms the basis of signal transduction that allows exchange of information between different compartments of the cell, for example. A detailed biological understanding of these complex processes as well as the recognition that the dysregulation of the cooperation between the two enzyme classes can be causal to the genesis and progression of diseases like cancer, auto-immune diseases, diabetes, or neurological defects has led these signal transducing proteins to be considered promising target proteins in modern drug research. However, the detailed biochemical and biological characterization of the influence of phosphorylation patterns on the function of proteins has proven to be very difficult due to the tendency towards dephosphorylation, particularly under physiological conditions. These difficulties can be elegantly avoided by the semi-synthetic incorporation of, for example, phosphomethylene-L-phenylalanine (Pmp) as a well-characterized, non-hydrolysable phosphotyrosine mimic for the production of homogenous protein preparations. This phosphonate imitates the functional phosphorylation on the essential positions in the protein under investigation. The use of EPL techniques can be applied to the phosphorylation of proteins in vivo by the microinjection of semi-synthetically produced proteins equipped with non-hydrolysable phosphotyrosines into living cells. Besides the examples of semi-synthetic methods for the study of the modification of lipids and phosphorylation, a series of other ligation techniques have been developed, which, among others things, allow the transfer of complicated glycosylation patterns to proteins.

42.3.3 Conditional Protein Splicing The utility of protein ligation for the study of protein functions is apparent and therefore there is a great deal of interest in the development of minimally invasive techniques that allow an analysis of the target proteins in the complex environment of all proteins in vivo. One of these methods is conditional protein splicing. This is an in vivo technique that makes possible the reconstitution of two inactive protein fragments into the functionally active target protein under the control of a cell-permeable, small molecule ligand (Figure 42.11). The methods of conditional protein splicing are based on the fusion of inactive domains of the target protein with the two protein fragments of an excised inteins. The two intein halves are fused, in turn, either to the rapamycin receptor or with the rapamycin binding domain. Under the influence of rapamycin, the two rapamycin-binding proteins dimerize. The resulting complex formation of the intein fragments initiates the autocatalytic splice reaction to reconstitute the functional and active target protein. This elegant method enables the activation of protein functions under the

Figure 42.11 Conditional protein splicing. As an important development of chemical-biologic ligation techniques, this method allows protein functions to be turned on by small organic molecules in cellular systems. (a) The biologically inactive protein fragments (1 and 2) of the target protein to be reconstituted are cloned as fusions of cleaved inteins (N-intein and C-intein) and rapamycin-binding proteins (FRAB und FKBP). The fused constructs are expressed in cells. (b) Under the control of the small molecule, cell-permeable, natural substance rapamycin, the rapamycin-binding domains of the FRAB and FKBP interact, which leads to dimerization of the two constructs and results in the reconstitution of the intein. (c) The functional intein initiates the splice reaction of both protein fragments 1 and 2 to the biologically active target protein RabGDI. Source: from Rauh, D., and Waldmann, H. (2007) Angew. Chem., Int. Ed. Engl., 46, 826–829. With permission, Copyright  2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

42 Chemical Biology

control of small organic molecules in in vivo systems and represents an example of the development of ligation techniques in the field of chemical genetics. The ligation methods briefly described here are innovative tools for the study of protein functions in living cells and make it possible to better understand and follow the function of proteins in the dynamic and variable environment of biological processes.

Further Reading Baker, A.S., and Deiters, A. (2014) Optical control of protein function through unnatural amino acid mutagenesis and other optogenetic approaches. ACS Chem. Biol., 9, 1398–1407. Bishop, A.C., Ubersax, J.A., Petsch, D.T., Matheos, D.P., Gray, N.S., Blethrow, J., Shimizu, E., Tsien, J.Z., Schultz, P.G., Rose, M.D., Wood, J.L., Morgan, D.O., and Shokat, K.M. (2000) A chemical switch for inhibitor-sensitive alleles of any protein kinase. Nature, 407, 395–401. Dar, A.C. and Shokat, K.M. (2011) The evolution of protein kinase inhibitors from antagonists to agonists of cellular signaling. Annu. Rev. Biochem., 80, 769–795. Hasson, S.A. and Inglese, J. (2013) Innovation in academic chemical screening: filling the gaps in chemical biology. Curr. Opin. Chem. Biol., 17, 329–338. O’Connor, C.J., Laraia, L., and Spring, D.R. (2011) Chemical genetics. Chem. Soc. Rev., 40, 4332–4345. Rak, A., Pylypenko, O., Durek, T., Watzke, A., Kushnir, S., Brunsveld, L., Waldmann, H., Goody, R.S., and Alexandrov, K. (2003) Structure of Rab GDP-dissociation inhibitor in complex with prenylated YPT1 GTPase. Science, 302, 646–650. Waldmann, H. and Janning, P. (2014) Concepts and Case Studies in Chemical Biology, Wiley-VCH Verlag GmbH, Weinheim.

1055

Toponome Analysis

43

“Life is Spatial” The hierarchy of cellular functionalities consists of at least four functional levels: genome, transcriptome, proteome, and toponome. They exist as interacting spatial systems and subsystems, whose proper functions require correct topologically determined compositions of their molecular components. Moreover there is no doubt that a biological function not only depends on the amount of participating molecules, but in particular the local context – the molecular neighborhood inside the cellular or tissue structures plays an important role. For example, it has been shown that the relative concentration and differential relative spatial order of more than 20 proteins at the cell surface membrane enable a tumor cell to enter an explore state preceding and inducing the migratory state. Furthermore, similar investigations in clinical samples have shown that co-mapping of a large number of molecular components was key for the investigator to find a hierarchy of molecular networks and predict a disease specific target molecule whose downregulation leads to clinical effects, as shown for amyotrophic lateral sclerosis. These examples have shown that the in vivo/in situ topology of large molecular systems with their inherent topological hierarchies, controlled by lead proteins, is important for the finding of clinically relevant target molecules. If this topological context is destroyed by tissue homogenization or isolation of components the spatial context of biomolecules, the toponome, is lost. Hence toponome in situ technologies are needed when it comes to the spatial analysis of large molecular systems (toponome) driving cellular function or dysfunction in disease – the socalled disease robustness networks (Schubert 1995; 2015). Toponomics addresses the structure and (dys)function of molecular systems in situ. In this chapter the basic methods of toponome analysis are described, which have already been shown to provide new insight into the operations and functions of large molecular systems inside cells and tissues with their predictive power in the field of diagnostics and clinical medicine.

43.1 Antibody Based Toponome Analysis using Imaging Cycler Microscopy (ICM) Walter Schubert1,2 1 Molecular Pattern Recognition Research Group, University of Magdeburg, Leipziger Straße 44, 39120 Magdeburg, Germany 2 Human Toponome Project, TopoNomos Ltd., Margaretenstraße 20, 81373 Munich, Germany

Considering the definition of the term toponome, we anticipate that most cellular functionalities are based on the interaction of proteins and other molecular components of the cell. These Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

Toponome By the term toponome we understand the spatial arrangement of the molecular networks in a cell or a tissue, for.mally described by a colocalisation and anti-colocalisation code.

Toponomics Analysis of the modes and rules of the quantitative combinatorial arrangements of the molecular components of a biological structure or system under defined conditions.

Biological systems Biological systems imaged in toponomics possess emergent (also called relational or systemic) properties that are possessed only by the system as a whole and not by any isolated part of the system.

1058

Part V: Functional and Systems Analytics

Figure 43.1 Scheme: toponome reading process with ICM leading to the spatial decoding of molecular networks of a cell or a tissue. Using pattern recognition, topological in situ mapping, or experimental and/or clinical analysis (clinical trial) the rules of the toponome “grammar” are found, and then integrated as functional knowledge into next step new reading processes (progressive decoding of the spatial molecular networks in health and disease – a pattern cognition like procedure).

Imaging cycler microscope (ICM) The fundamental automated labeling principle overcoming the spectral fluorescence resolution limit to detect a random number of molecular components in the identical morphologically intact biological structure (subcellular, cell or tissue) in a single experiment. Stages of development included the multi-epitope-ligand cartography (MELK) and toponome imaging system (TIS).

interactions can involve more or less strong direct physical crosstalk of proteins or can be based on indirect crosstalk (e.g., by means of diffusible molecules). In any case, every element of such interactions (protein or other molecular component) must be at the right time point at the right concentration at the right location inside a cell or on an extracellular matrix so that a concrete molecular network can be formed and operationally active. Hence, molecular networks are characterized by a specific spatial context of their single elements: every molecular network is based on a highly non-random topology of its molecular elements. As with every system, large molecular systems in vivo have intrinsic relational (emergent) properties, which are only possessed by the systems a whole and not by any of its isolated parts. In other words, the topology of these systems is the direct expression of the molecular compartmentalization rules of the cell or the tissue. Toponomics aims to analyze and map these co-compartmentalization and topological association rules of molecular components as well as their functional network architecture in morphologically intact fixed cells or tissue sections. For this purpose the fully automated imaging cycler technology is applied: with the aid of large tag libraries (mostly antibody libraries) a large number of molecular components is co-localized and visualized in the direct context of cell functions. These data serve to construct functional toponome maps. Progressively, such toponome maps reveal the modes and rules of functional constellations of the molecular components of cells and tissues that are assembled within a socalled toponome grammar (Figure 43.1). While the basic principles of this technology were developed in the late 1980s, only the development of highly automated imaging cycler microscopes and powerful software in the last 20 years has opened up the possibility of systematically establishing the research field of toponomics.

43.1.1 Concept of the Protein Toponome Toponome map Structural representation of toponome modes (CMPs, CMP groups, CMP motifs) of biological structures (cell or tissue), which are specifically expressed in certain biological regions (e.g., single data point – pixel or voxel – a subcellular compartment or partial compartment, in a whole cell or whole tissue compartment).

Given that most proteins inside and on the surface of cells are organized as networks exerting the different cellular functionalities we also can assume that these proteins are not stochastically distributed but in terms of both time and space are highly organized. This is analogous to written language, in which letters are assembled as words and sentences; following syntactic and semantic rules, the proteins, which are organized as networks, are also topologically precisely determined in every cell. The cell itself thereby functions as a protein pattern formation apparatus, or can be seen as a protein co-compartmentalization machine. As described in the introduction most of the molecular components of the cell or a tissue, including proteins, must be at the right place at the right concentration in the cell to interact with other proteins following similar rules. It follows that every given cellular functionality can be detected as a specific contextual protein pattern directly in the corresponding biological structure and, further, that blocking of the molecular component(s) hierarchically and topologically controlling this entire

43 Toponome Analysis

1059

structure will result in a disassembly of that structure and loss of the corresponding (dys) functionality. One important task is to find by toponomic analyses the most relevant biomolecule controlling the overall structure of the corresponding molecular systems in a disease, block it therapeutically, and thereby successfully interfere with the disease. The entirety, or fractions, of all protein networks are termed the toponome – a purely descriptive term derived from the ancient Greek nouns topos (τoπoζ) (place) and nomos (νoμoζ) (law). This term simply says that the natural in situ protein network code is topologically determined, and as such represents a spatially determined functional code exerting the cellular functionalities. The corresponding spatial data sets are amenable to a quantitative, mathematical analyses and combinatorial geometric measures (e.g., for functional genomics), with the interesting perspective that, technically, a direct alignment of next-generation genome sequencing and toponome mapping is in the pipeline. To identify protein networks directly in a given biological structure (e.g., inside a single cell) a microscopic “reading technology” is needed, which must fulfil the following conditions (as provided by imaging cycler technology, below):

 A large, quasi-random number of distinct molecular components (e.g., proteins) must be comappable independently from each other inside the same structure in a single experiment.

 The resulting combinatorial molecular patterns that specifically characterize the correspond-



ing cell type or corresponding cell function (or dysfunction) or a tissue must be visualizable as a whole, and also in selected fractions of the whole, by applying an appropriate interrogation algorithm. The biological meaning of these patterns must not necessarily be understood immediately, but can serve as an important specific toponome feature (e.g., a disease process), and should be interrogatable functionally, by aligning an actually found pattern with a memory store that contains the modes and rules of topological protein associations of the cell (toponome grammar) found in preceding toponome mappings (Figure 43.1).

The driving force is a vision of the complete grammar of the protein networks of the cells in human, animal, or plant tissues. While the imaging cycler technology fulfils the criteria and principles of co-localization and anti-colocalisation of an extremely large number of different molecular components (Figures 43.2–43.4) and is capable of visualizing directly these comapped patterns inside or on the surface of a given cell (Figures 43.4e and 43.2) or in a tissue section (Figures 43.3 and 43.4a), the future challenge is to gain insight into the precise spatial coding function of protein networks on a proteome-wide scale to discover all essential topological protein associations. For this purpose toponome mapping and biological experiments must be linked. This has been shown for amyotrophic lateral sclerosis (ALS) clinically, and for tumor cells experimentally. In both cases it was shown that blocking the hierarchical lead proteins of large protein networks leads to a halt in disease progression and loss of the ability to migrate, respectively. Hence in both cases disease robustness networks were found that could be broken by inhibiting the lead proteins. Basic cancer specific lead proteinconnected toponomes were found in prostate cancer and in colon cancer.

43.1.2 Imaging Cycler Robots: Fundament of a Toponome Reading Technology Toponome reading requires highly automated procedures. It was established as a system of socalled image cycler workstations (for its set up and operations see www.huto.toposnomos.com). These robots consist of a multi-pipetting unit, a fluorescence microscope, and a CCD camera. The robots can be fully programmed to incubate large tag libraries (e.g., antibodies) on fixed cells and tissue sections to co-localize their molecular components in one single experiment. A large series of experiments over 20 years established that the most stable procedure is as follows: Each tag is conjugated to the same fluorochrome (e.g., FITC, fluorescein isothiocyanate) and is applied and incubated sequentially on the biological sample, which is placed on the object table of the robot microscope. Every labeling step is followed by an imaging step and then by a bleaching step, so that a large number of iterative labeling–imaging–bleaching rounds can be performed to overcome the spectral resolution limit in fluorescence microscopy. The result of this cyclical process is a two- or three-dimensional combinatorial protein fingerprint at

Chemical Modifications of Proteins, FITC, Section 6.2.1

1060

Part V: Functional and Systems Analytics

Figure 43.2 Schematic illustration of the toponome theory confirmed experimentally and clinically. Analyzing proteome topology and function by automated multidimensional fluorescence microscopy. Nat Biotechnol., 24 (10), 1270–1278. Copyright  2006, Rights Managed by Nature Publishing Group; Schubert, W. Advances in toponomics drug discovery: Imaging cycler microscopy (ICM) correctly predicts a therapy method of amyotrophic lateral sclerosis (ALS) Cytometry Part A; 2015, 87A: 696-703. doi:10.1002/cyto.a.22671. (a) Distinct spatial protein toponome patterns in situ specifically designate different cells (on the right-hand side). By extraction of the corresponding proteins and their annotation as ex vivo protein profiles (left-hand side) the natural in situ/in vivo differences between the cellular states “normal” and “abnormal” can no longer be detected because the concentration of these proteins is identical in these two cells. This can be the source of fundamental errors in interpretation and downstream conclusions (e.g., in therapeutic management). (b) Two cells express distinct selectivities on their cell surfaces by differentially combining the identical proteins as higher order units. These different local protein constellations (1–4) can be: (i) a simple co-existence of these proteins without any direct cross interaction, (ii) a nondirect interaction via other not co-localized biomolecules, or (iii) a direct physical interaction with weak or strong binding forces. Together, and independently of these alternative functional constellations, all four distinct local constellations (1–4) are clear cut multimolecular domains (MMDs), which – in the present case – encipher high selectivity for cell–cell or cell–matrix interaction and – most importantly – can couple these differential MMD information units with corresponding MMD specific activation of intracellular mechanisms. Hence, these four domains “encipher” specific supramolecular functionalities (exclusion principle), which together form the ground structure of a functional molecular network. By using imaging cycler microscopes (ICMs), such networks are detectable as specific toponome patterns. ICM generates multi-dimensional vectors of these different protein ensembles that can be collected as toponome lists of different CMPs (combinatorial molecular phenotypes) (b, bottom). Every CMP is a geometric object (one or several data points in x/y or x/y/z). All CMPs together reveal a toponome map of a cell or a tissue. The different CMPs (b, bottom) express both distinct and common features in the two cells, which together reveal distinct CMP motifs. These motifs represent the higher order feature of distinct functional codings of the cell surface.

each subcellular data point in a corresponding cell or tissue sample (scheme in Figure 43.5) (biological/clinical examples in Figures 43.3 and 43.4e, respectively). The resulting highdimensional context of combinatorial protein information, which would be lost in cell homogenates or lists of proteins, supports the conclusion that protein networks are spatially determined functional units in every cell (different cell functions = different protein networks = unique combinatorial cellular protein pattern). The scheme in Figure 43.2 shows that single normal and abnormal cells are characterized by specific relative locations of the single proteins as a cellular fingerprint, which specifically characterizes these cells. Sometimes these features have a periodical order. If, however, these cells are homogenized the resulting quantitative profiles of the corresponding proteins no longer show any difference (Figure 43.2a, left-hand side). These latter ex vivo profiles (without their corresponding spatial combinatorial topologies) might prompt us to conclude that these identified proteins have no relevant function in the corresponding disease, because their abundance is not changed. However, the co-mapping of the same proteins leads to the contrary conclusion (Figure 43.2a, right-hand side). Such constellations have been shown and published for tumor biology and singled out in several comprehensive reviews. The current observations made with ICM (image cycler microscopy) support the working hypothesis that every cell contains a quasi-infinite “toponome space” to spatially encipher the myriad of different cellular functionalities. The combinatorial patterns on a large scale, which are generated by cells in vivo, are, as expected, highly restrictive and non-

43 Toponome Analysis

random. From a pure mathematical point of view the gain of information using the ICM technology, as compared to other methods, is significant. For example, given the toponome acquisition as shown in Figure 43.3 the co-mapping of 100 biomolecules in 1 million data points allows for the simultaneous discrimination of any biologically expressed combination out of 100 biomolecules per data point, when a non-threshold based approach termed similarity mapping is applied on the 100-cycles. Hence the power of combinatorial molecular discrimination (PCMD) per data point is 256100 (using an 8-bit CCD camera) or 65 536100 (using a 16 bit CCD camera). Compared to ex vivo protein profiling the gain of information in this example (using the 16 bit CCD) is 65 536100 × n million data points per image. The information content hence increases exponentially using ICM technology. If the number of co-mapped biomolecules exceeds 100, the information content increases and determines corresponding “reading frames.” A specific example is shown in Figure 43.3a–d illustrating the dermo-epithelial junction in a histological skin tissue section analyzed by 100-component molecular profiling: The three layers of the basal lamina, hitherto only seen in electron microscopic images (Figure 43.3c), can be clearly distinguished by ICM (Figure 43.3a), but not by traditional three-parameter fluorescence (Figure 43.3b). By using deconvolution algorithms it is possible to reveal many distinct optical levels with a high voxel resolution so that all three-dimensional data points together reveal the three-dimensional reality of protein network assemblies of many multiprotein complexes at a time (Figure 43.3e, f, g). These measurements can be extended extremely by parallel measuring of cell or tissue arrays. Multiple ICM robots can cooperate strategically to measure very large numbers of probes as a high-throughput/high-content screening procedure. By using optical tools (e.g., navigation algorithms) a complete protein network in every cell can be visualized and explored (walking through large toponome fractions with subcellular resolution). As substantiated by power law, the supramolecular order of

1061

Figure 43.3 Functional super resolution of large molecular networks. (a)–(d) Dermo-epithelial junction in human tissue. Imaging cycler microscopy based discovery of molecular networks in situ. (a) Direct real-time protein profiling in 100-dimensional ICM data set using an algorithm based on the similarity mapping approach (Dress, A.W.M., Lokot, T., Pustyl’nikov, L.D., and Schubert, W. (2005) Poisson numbers and Poisson distributions in subset surprisology. Ann Comb., 8 (4), 473–485 (Copyright  2004, Birkhäuser Verlag, Basel); Dress, A.W.M., Lokot, T., Schubert, W., and Serocka, P. (2008) Two theorems about similarity maps. Ann. Comb., 12 (3), 279–290 (Copyright  2008, Birkhäuser); Schubert, W., Gieseler, A., Krusche, A., Serocka, P., and Hillert, R. (2012) Next-generation biomarkers based on 100-parameter functional super-resolution microscopy TIS. Nat. Biotechnol., 29 (5), 599–610 (Copyright  2011 Elsevier B.V. All rights reserved)). Each data point has a power of combinatorial molecular discrimination (PCMD) of 256100. Note: sharp images at the junctional area discriminating between lamina fibroreticularis (LF), lamina densa (LD, green profile in (d)), lamina lucida (LL), and the basal keratinocyte layer (BC, red profile in (d)), as known from transmission electron microscopy (c). (b) Same area as in (a), displaying traditional triple fluorescence imaging. (e) Three-dimensional ICM imaging of distinct 32-component multiprotein complexes on the cell surface of a blood T-lymphocyte (Friedenberger, M., Bode, M., Krusche, A., and Schubert, W. (2007) Fluorescence detection of protein clusters in individual cells and tissue sections by using toponome imaging system: sample preparation and measuring procedures. Nat. Protocol., 2 (9), 2285–2294 (Copyright  2007, Rights Managed by Nature Publishing Group)). Multiprotein complexes are composed of differential combination of 32 proteins/glycotopes listed in (f). (g) Examples are marked with asterisks (number 1–3) and detailed as combinatorial molecular phenotypes (CMPs) with proteins present (1) or absent (0) together characterized as individual CMPs. Bars: 10 μm (a, b), 50 nm (c), and 1 μm (e).

1062

Part V: Functional and Systems Analytics

Figure 43.4 Discovery of disease specific 100-dimensional protein profiles simultaneously and in real-time in morphology intact tissue sections at a PCMD of 256100 per pixel, exemplified in human skin. (a) List of 100 co-mapped biomolecules and selected protein profiles (PPPs: 0, 1, 3, 32–35) specific for diseased (d, f) and normal skin (e, g). (b, c) Diseased (b) and normal skin (c) are highlighted by pseudo-coloring as histological stain for morphological orientation. Note that, by moving the cursor over the pixels, the software directly recognizes in real-time which protein profiles are specific for the diseased (d, f) or normal skin (e, g). For example, pixel protein profiles (PPP) with numbers 0 and 1 are specific for the normal skin (e), and PPP 3, as well as PPPs 32–35 (d, f, respectively) are specific for the diseased skin. Note: For many similar applications at real-time see webpage of the human toponome (HUTO) project (www.huto.toposnomos. com). (h) Power law (Zipf’s law) substantiates highly organized protein systems, as seen in (d)–(g). If 49 molecules are co-mapped Zipf’s law applies (blue line), but does not apply if less than 15 molecules are co-mapped (red and green lines). This is revealed by plotting the log–log-relationship of thousands of distinct protein assemblies in toponome data sets (Schubert, W. et al. (2006) Analyzing proteome topology and function by automated multidimensional fluorescence microscopy. Nat. Biotechnol., 24 (10), 1270–1278 (Copyright  2006, Rights Managed by Nature Publishing Group). Bar: 100 μm (b)–(g).

molecular systems is revealed by co-mapping more than 20 molecules, but not by co-mapping lower numbers (Figure 43.4h). Since finding critical lead proteins that control molecular networks depends on the discovery of supramolecular order (see below) this fact is essential for any drug target discovery study. A topologically determined ensemble of proteins that exerts a given cell function is defined as a protein network motif (combinatorial molecular phenotype motif; CMP-motif). In addition, primary ICM datasets always contain the precise fluorescence intensity at every given data point for every single co-mapped molecular component. In routine work, an effective procedure has been to map every single molecular component relative to a threshold intensity, as present or absent (present = 1; absent = 0; 1 bit) for CMP detection. The primary image information consisting of distinct grey value distributions for every protein can then be described as a relatively simple geometric structure by using combinatorial binary vectors in X/Y/Z (3D) or X/ Y (2D) (Figure 43.2b). Each CMP represents the corresponding topologically determined arrangement of proteins (or of other molecular components) in every single compartment of the cell. These vectors can be directly transformed as geometric objects and as a functional map of the whole cell. Such maps are termed toponome maps. Figure 43.2b illustrates schematically CMP lists of two cells. Distinct CMPs can have one (or also several) protein(s) in common (= 1;

43 Toponome Analysis

so called lead proteins); one or several proteins can be absent (= 0 = anti-colocated); some proteins can be variably existing together with the lead protein (= ∗ = wild card proteins). These common features of different CMPs are called the CMP motif. As revealed by direct comparison of the cells in Figure 43.2b these cells use the same proteins to generate distinct CMP motifs. Following the toponome theory, the cells thereby encipher distinct selectivities of the cell surface that can play important differential functions in cell to cell or cell to matrix interactions, or within cells to encipher a complex directed action, such as driving a tumor cell from a spherical state into an explore state preceding migration. A toponome mapping example of three-dimensional multiprotein complexes on the cell surface of a human blood T cell (3D CMPs) corresponding to the scheme in Figure 43.2 is shown in Figure 43.4e. Similarly, a nonthreshold based real-time 100-component biomolecular profiling within an intact tissue section across the human skin reveals the cell surface-coupled 100-dimensional supermolecule linking the cell surface of the basal keratinocytes with the extracellular matrix (Figure 43.4, PPP 3) in a disease (psoriasis) and the non-involved skin (PPP 0) of the same patient. The remarkable issue here is that disease specific supermolecule is directly visualized in context with the normal case (same patient, same genome), and serves as proof of principle for such approaches. These examples show that toponomics provides direct access to the functional organization of large molecular systems in situ.

CMP combinatorial phenotype.

1063 molecular

CMP motif Annotation (1; 0; ∗) describing a characteristic feature of a cluster of distinct CMPs: lead protein (1; = L = a molecule that is present in all CMPs of a cluster); a molecule that is always absent in a motif (0; = A = anti-colocated); molecules present in at least one, but not in all CMPs (∗; = W = wild card variably associated with L).

43.1.3 Summary and Outlook Toponomics is a non-destructive technology essential for direct access to natural large molecular systems in tissues and cells, thereby introducing a “molecular systems histology” and “molecular systems cell biology.” The ICM technology for toponome mapping is mature and has been established within the human toponome project in Europe and in the USA (www.huto .toposnomos.com). Cyclical imaging, the principle of ICM-toponomics, has been applied to address various biological problems, and mathematics/informatics approaches to ICM data have been published. Several editorials and a research highlight have featured toponome research. Toponomics reveals the experimentally and clinically validated fact that natural in situ molecular systems have their own topological rules that cannot be derived from the isolated parts: the systems structure and function cannot be predicted from their isolated parts, because the relational (emergent) properties possessed by the system as a whole are not possessed by any of its isolated components. The toponomics discovery of disease specific hierarchies within large in situ molecular systems controlled by lead protein(s), and the successful therapeutic addressing of such proteins resulting in positive clinical effects, has paved the way to a systematic large-scale toponomics finding of the right lead proteins across major diseases. The driving force is the finding that therapeutic downregulation of a disease specific lead protein not only leads to disassembly of the disease driving molecular robustness network, as shown for tumor cells, but also to a halt of disease progression of many patients as shown for neurologic disease, as known so far. Both experimental and clinical validation of toponomics together with novel highly efficient affinity reagents, such as recombinant human antibodies, recognizing multiple protein domains per protein in thousands of proteins in one toponome mapping experiments is now feasible and will be established as an efficient target finding strategy and in situ decoding of disease mechanisms.

Acknowledgements This research was supported by the Klaus Tschira Foundation (KTS) through projects Toponome Atlas I and II, the Deutsche Forschungsgemeinschaft (DFG Schu627/10-1), the BMBF (grants CELLECT, NBL3, NGFN2, and NGFNplus), the DFG-Innovationskolleg (INK15) as well as the EU project IMAGINT (Health-F5-2011-259881), and the human toponome project (www.huto.toposnomos.com). I thank ToposNomos Ltd. for providing access to the Imaging Cycler® reference laboratory and Andreas Krusche and Reyk Hillert for helping with the figures and formatting the manuscript.

PPP Pixel protein profiling. As an extension of threshold based CMPs (1 bit per protein) direct in situ protein profiling by using a ICM similarity mapping approach represents the realization of large-scale in situ proteomics with the power of combinatorial discrimination (PCMD) per data point of up to 65 536k, where 65 536 is the number of grey value levels (output of a 16-bit CCD camera) and k is the number of co-mapped biomolecules and/or subdomains per biomolecule(s).

1064

Part V: Functional and Systems Analytics

43.2 Mass Spectrometry Imaging Bernhard Spengler Justus Liebig University at Giessen, Heinrich-Buff-Ring 17, 35392 Giessen, Germany

43.2.1 Analytical Microprobes

Figure 43.5 Simple scheme of the cellular organization as fundament for toponome mapping by ICM: protein “spaces” (angled symbols) and aqueous “spaces” (free spaces, aqueous compartments) are highly conserved in a cell (principle of city of Venice with houses and water channels). After gentle fixation of this state using appropriate methods, which make it possible at the same time for macromolecules to diffuse across cellular membranes, a quasirandomly large number of distinct fluorochrome (e.g., FITC) conjugated antibodies (or any specific affinity reagent) can be applied for incubation on the sample sequentially (repetitive incubation–imaging–bleaching cycles), a procedure that is called Venicing in the laboratory jargon. After diffusion across the aqueous compartments the dyeconjugated antibodies recognize and bind to their cognate epitopes. Procedure: (e.g., 1–4 in the figure): Fluorochrome-conjugated antibody is incubated for binding the corresponding epitope, then the corresponding specific fluorescence signals are registered, which is followed by soft-bleaching of the signal at low sunlight – similar natural stray light energy (avoiding lasers): the sunlight hypothesis of ICM. The Venice principle and the sunlight hypothesis are represented as natural biophysical conditions in an ICM to enable largescale biomolecule in vivo/in situ profiling without interfering with the tissue and biomolecule structures.

Mass Spectrometry, Chapter 15

Methods in high-throughput proteomics and metabolomics have demonstrated the power and efficiency of fast and accurate detection and identification of compounds in complex samples. The vast amount of data generated with this strategy, however, has also revealed that the bulk molecular composition of a sample is unable to give final and convincing answers to striking questions about cellular processes. Besides qualitative and quantitative analysis, the localization of specific compounds and markers in organs, tissues, and cells has been found to be of paramount importance for the understanding of natural states and processes of life. Such compounds, being indicators for a specific biochemical or physiological context, might be inorganic ions (e.g., Ca2+, Fe2+), trace elements (e.g., selenium, cobalt), pharmacologically active ingredients, lipids, oligosaccharides, peptides, proteins, or metabolites of biochemical reactions chains. The spatial localization of such compounds or “targets” can be attributed as “toponomics.” The toponome describes the set of spatial concentration distributions of members of a certain class of substances (e.g., all proteins in a biological cell). In a more specific definition, a toponome represents the visualization of an interaction or co-localization of two or more substances in, for example, heterogeneous protein complexes, and the principles of their appearance (Section 43.2.6). The goal of describing biochemical processes not only qualitatively and quantitatively but also toponomically is not new. Fluorescence microscopy combined with immunochemical labeling, for example, is a well-established technique for visualizing physiological situations. It is understood as an intrinsic disadvantage of such labeling-based imaging techniques that the target molecules have to be known and selected prior to analysis, to allow for defining an appropriate and selective marker (such as an antibody). The laborious and restricting labeling step can be avoided, if the chemical composition of a targeted biological sample region can be determined without specific sample preparation steps and without substance-specific knowledge. This, in contrast to labeling-based imaging techniques, is possible by using mass spectrometry as the source of analytical information. Soon after the introduction of high-power laser systems, the first attempts at spotwise analysis of biological material by mass spectrometric laser-induced desorption of ions were reported in the 1970s, using ultraviolet, pulsed laser radiation. This technique, called laser microprobe mass analysis (LAMMA), is still being used in a number of laboratories, for example, for the detection of trace elements in biological samples. Similarly, secondary ion mass spectrometry (SIMS) has been able quite early on to detect atoms and small molecules on the surface of technical or biological surfaces. The two methods have a lateral resolution in the range of about 1 μm, but initially were not able to rasterize samples and visualize concentration images, but only provided punctual information about samples. Only with the introduction of the so-called ion imaging mode of SIMS did concentration images of elements and small molecules became available. In the field of mass spectrometric detection of intact biomolecules, the developments of matrix-assisted laser desorption ionization (MALDI) by Franz Hillenkamp and Michael Karas and of electrospray ionization (ESI) by John Fenn in the 1980s directed the focus towards bulk analysis of large molecules, such as proteins or protein complexes. Then, in the 1990s, interest in a direct analytical visualization of native samples returned. The proof-of-principle of MALDI mass spectrometry imaging of biomolecules was first described in 1994 by Spengler et al. for high-lateral-resolution imaging of peptide standards. In the following years, numerous applications of this principle were reported (Figure 43.6).

43.2.2 Mass Spectrometric Pixel Images The generation of microanalytical pixel images of biological samples requires several computer intensive processing steps. The minimal procedure includes the following (Figure 43.7): A highly focused, pulsed laser beam (or primary ion beam, in the case of SIMS) is directed to an exactly positioned sample spot. This results in the formation and desorption of ions, which are

43 Toponome Analysis

1065

Figure 43.6 Highly resolved visualization of the spatial distribution of a peptide within a standard MALDI preparation (a); distribution of salt impurities (b); distribution of the matrix 2,5dihydroxybenzoic acid (c). The figure demonstrates the exclusion principle of ion formation, as peptide ions are formed only from micro-areas, which are free from alkali ions. Courtesy of B. Spengler, Giessen.

then mass analyzed and their signals stored as a mass spectrum for this sample spot. Spot diameters are typically in the micrometer range, pulse durations in the nanosecond range. The focus diameter of the laser beam (which results in an ablation spot of a similar size) is typically taken as the step size of the sample manipulation stage and thus as the pixel size of the resulting image. The step sizes can also be larger or smaller than the focus diameter, in order to create an overview image with a lower lateral resolution, or to enhance the achievable lateral resolution by so-called oversampling data acquisition, with step sizes smaller than the ablation spot. The next step after acquisition of a spot-related mass spectrum is to move the sample stage to the next position. In SIMS, it is not the sample that is moved but the ion beam is scanned across the sample surface. When the complete area is scanned and mass analyzed, the resulting twodimensional data set can be searched for useful mass signals, which represent sample components and which are suitable for visualization as an image. All these signals can now be transformed into gray scale images, with signal intensities coded as gray levels between black and white for each pixel of the image. Color coding can be used to visualize a broader dynamic range, since more colors than gray values can be differentiated by the human eye. Colors can also code different components in one image. Co-localization of compounds, differential display of concentration distributions, or complex correlations can be displayed with colors. Red-green-blue (RGB) images are used in this context to visualize three selected compound distributions in one image without information loss (Figure 43.8). With all visualization procedures it is especially important for a powerful imaging software to always retain the original data and to easily provide access to the raw data, correlated with the final visual representation.

43.2.3 Achievable Spatial Resolution The spatial resolution that can be achieved with an imaging method is always a key parameter in toponomy. A lateral resolution of 25–50 μm can easily be obtained with MALDI instrumentation, without much technical effort. Reaching a cellular resolution down to 1 μm, however, is hindered by principle limitations, resulting from physical laws of optics and properties of light. To distinguish between low-resolution and high-resolution MALDI imaging techniques, the term “scanning microprobe MALDI (SMALDI)” was introduced for the high-resolution method. Differences between the two are found not only in the analytical goal (regarding the size of structures to be resolved) but especially in the underlying mechanisms of material evaporation and ionization. The principle of usability of MALDI for desorption, ionization, and imaging of biomolecules in the laser focus range of a micrometer was first demonstrated in 1994 (Figure 43.6). As with normal MALDI mass spectrometry of biomolecules it is necessary for SMALDI that matrix and analyte mix in the liquid (solution) phase and that analyte molecules are embedded in the growing matrix crystals upon desolvation of the matrix droplets. (There are, however, certain substance classes of smaller biomolecules, such as phospholipids and carbohydrates, that do not necessarily require this embedding process, even if the limit of detection is improved by it.) Optimization of this embedding process in normal MALDI analyses is of interest primarily with respect to achievable signal intensity (i.e., limit of detection). On the micrometer scale, however, this process has a dramatic influence on the morphology and topology of the sample components. High-resolution images of dried-droplet MALDI samples clearly visualize these phenomena (Figure 43.6). The topologic distribution of

Figure 43.7 Generation of a microanalytical image from positionrelated mass spectrometric data.

MALDI-MS, Section 15.1.1

1066

Part V: Functional and Systems Analytics

Figure 43.8 Different ways of visual coding of mass spectrometric position information. The spatial distribution of three components A, B, C can first be visualized as gray scale images ((a), middle column). White, in this case, codes high signal intensity and black low intensity. Up to three components can furthermore be coded in parallel in one RGB (red, green, blue) image, without information loss, if each of the three components is coded in one of the three native color channels ((a), right-hand side). The local coincidence of several components can, on the other hand, be visualized by logic operations of signal values of a large number of components and displayed with a large number of chosen colors (b). In this case, not signal intensities of individual components are coded directly but the presence of signal intensities above a certain threshold (b), right). This is a non-lossless information transformation that allows presentation of a visualization of a large number of channels, but with reduced information contents.

the peptide substance P in the dried-droplet MALDI sample reveals its predominant localization within the crystals of the matrix 2,5-dihydroxybenzoic acid (DHB). In contrast, alkali ions, which are present in the sample as impurities, are found outside these crystals and not within them. Highly polar or ionic substances are known to hinder the detection of peptide ions in MALDI-MS in general. Matrix preparation thus can be understood as a micro-separation step that cleans the microregions of the individual matrix crystals from interfering ionic substances. Desorption and ionization of peptides can subsequently take place from alkali-free microcrystal regions under highly focused laser irradiation. In low-resolution MALDI instruments, this effect of mutual exclusion cannot be observed with the same clearness, as the large laser focus always covers multiple matrix crystals and alkali-enrichment areas at the same time. In SMALDI-MS, on the other hand, the mutual exclusion of biomolecular ion signals and ionic impurities signals is obviously very distinct and is observed very clearly. Inclusion of analyte molecules in matrix crystals is a necessary prerequisite for a sensitive detection of peptides and proteins. It is therefore essential for the detection of such biomolecules from biological tissues or cells by MALDI or SMALDI mass spectrometry to allow for a mixing of matrix and analyte in solution, at least for a short while, and thus to allow doped matrix crystals to grow. This process, however, will necessarily lead to a blur of originally localized analyte concentrations and to the formation of artifacts of concentration peaks of certain analytes, due to segregation effects (Figure 43.9). This situation reveals a dilemma for the goal of highly sensitive and highly laterally resolved analysis of tissue. It can be solved only with a compromise by choosing acceptable conditions in terms of achievable sensitivity and lateral resolution. Such a compromise has been found, leading to an effective lateral resolution of 3 μm at a sufficiently low limit of detection for biological tissue, by spraying or vapor-depositing matrix with dedicated protocols. There is also a search for alternative methods of mass spectrometry imaging that work without application of a dedicated matrix compound. One of these methods is infrared laser desorption/ionization mass spectrometry, which is currently under development.

43 Toponome Analysis

1067

Figure 43.9 SMALDI mass spectrometry imaging of a dried-droplet preparation of a peptide mixture of substance P, melittin, and insulin. The images show that analyte components are dislocated within the sample, as a result of the matrix application. The originally homogeneous solution of peptide components is turned into an inhomogeneous sample after matrix addition and crystallization. Such artifact formation has to be taken into account in any MALDI analysis.

43.2.4 SIMS, ME-SIMS, and Cluster SIMS Imaging: Enhancing the Mass Range In contrast to MALDI, the detection sensitivity in secondary ion mass spectrometry is a clear function of analyte mass, leading to a loss of detectability of larger biomolecules. Recent developments, however, have demonstrated significant improvements. The application of a matrix to a tissue sample is one possible way to extend the mass range in SIMS, by achieving a MALDI-like desorption/ionization process. This method, called matrix-enhanced SIMS (MESIMS), has been used for mass spectrometry imaging, but is limited in spatial resolution, due to the same matrix-related effects as SMALDI imaging. Another improvement of SIMS was obtained by the development of cluster ion sources, using C60 ions or large clusters of gold or argon. These primary ion sources do not reach the same high lateral resolution as classical SIMS sources, but at least allow desorption and ionization of small biomolecules with a lateral resolution in the low micrometer range. Still, SIMS methods are unable (in bulk or imaging mode) to detect intact larger biomolecules such as proteins.

43.2.5 Lateral Resolution and Analytical Limit of Detection Matrix crystallization is only one of the mechanistic limitations in the achievable lateral resolution. Independent of the individual ionization process (i.e., for MALDI just as for SIMS) there is a fundamental restriction in further improving lateral resolution, which is based on the decreasing number of accessible analyte molecules with decreasing focus area. Assuming a number of six billion peptide molecules under a focus diameter of 200 μm in one monolayer, this number is reduced to only 200 000 molecules in a 1 μm spot. Depending on instrument type, a mass spectrometer requires between 1000 and 100 000 analyte molecules in the sample for generation of a useful mass spectrum. The limitation is to some extent eased by the fact that in mass spectrometry not only the absolute signal intensity defines the quality of a mass spectrum but even more so the signal-to-noise ratio. Fortunately, the intensity of noise signals (especially those of chemical noise) also decreases with decreasing sampling area. Nevertheless, there still is a final limit of detection that cannot be overcome, namely, due to signal statistics at too low molecule numbers. One also has to take into account that, typically, not complete monolayers of only one analyte component are to be analyzed but complex mixtures of a large number of different components. Low abundance molecules in biological tissue or on cell surfaces are thus naturally problematic for highly resolved mass spectrometry imaging. A useful lateral resolution significantly below 1 μm will therefore be achievable by MALDI or SIMS imaging only for abundant biomolecules due to fundamental reasons.

1068

Part V: Functional and Systems Analytics

43.2.6 Coarse Screening by MS Imaging

Figure 43.10 MALDI images of a glioblastoma tissue section, covered with sinapinic acid as a matrix and scanned with a step size of 100 μm. The images indicate an accumulation of thymosin-β.4 in the proliferating area of the tumor (d), while other proteins were mostly found in the ischemic and necrotic areas ((b) and (c)). Courtesy by R. Caprioli, Vanderbilt University.

Mass spectrometry imaging currently is still in a state of methodological evolution. Earlypublished applications looked promising but mostly did not include proofs of validity, reproducibility, and biomedical information of analytical images. Mass spectrometric data revealed molecular topologies within tissue images, but the identity and homogeneity of underlying substance signals could not be unveiled. Owing to these methodological and technological limitations, MALDI imaging and SIMS imaging initially remained purely qualitative and descriptive methods, unable to meet the conditions of analytical-chemistry standards. Results were reported using MALDI in the low lateral-resolution range and using SIMS in the low mass range. Protein imaging on a qualitative and unvalidated level were reported at first, with a lateral resolution of about 100 μm. One early example describes the analytical imaging of a mouse brain (Figure 43.10). Variations in the lateral distributions were found for various expected proteins in the different brain regions. In the vicinity of a glioblastoma, a significant increase of a signal attributed to thymosin-β-4 was found. Identification of the suspected signal could not be performed based on the mass spectrometric imaging data, but was obtained by classical LC-MS/ MS analysis of trypsin-digested tumor tissue in a parallel study. The identity of proteins determined in LC-MS/MS runs with those of according m/z signals in imaging experiments was tacitly assumed. Bioanalytical applications at a high lateral resolution in the micrometer or sub-micrometer range were mostly limited to SIMS methodology, in a rather low mass range. The high spatial resolution of secondary ion mass spectrometry allows us to visualize processes in the cell membrane, provided that investigated analytes are detectable with sufficient sensitivity. A remarkable example is the visualization of the mating of protozoa Tetrahymena thermophila, which reveals the conjugation of cell membranes for the exchange of genetic material. The zone of conjugation showed a significant reduction of phosphatidyl choline concentrations in relation to the total concentration of phospholipids (Figure 43.11). This indicates a severe reordering of lipids in the hyphenation zone. In this investigation, again, the identification of signals as being phosphatidyl choline fragments had to be performed by separate tandem mass spectrometric analysis. It was not possible to distinguish and characterize the imaged phosphatidyl choline species directly in the membranes, since only the head group of the phosphatidyl cholines was detected as a fragment and not the intact molecules with their isomeric fatty acids.

43.2.7 Accurate MALDI Mass Spectrometry Imaging Figure 43.11 SIMS images of two protozoa during mating (conjugation). A decrease of phosphatidyl choline was found in the conjugation zone (c), while the total concentration of phospholipids was found to stay constant (b). The scheme (a) indicates the assumed distribution of phosphatidyl cholines (white circles) and phosphatidyl ethanolamines (black circles). Courtesy by A.G. Ewing, Pennsylvania University.

Not only with SIMS, but also with SMALDI mass spectrometry imaging, cellular and subcellular resolution is achievable. In contrast to the SIMS method, the mass range is not a prominent limitation of biomolecular analyses in SMALDI imaging. Cell cultures of a human renal cell carcinoma have been analyzed with a lateral pixel resolution of 1 μm, showing mass signals up to 15 000 u. Individual cells revealed a reasonable topological structure of analyte signals. A significant leap in quality of mass spectrometry imaging was obtained by the recent combination of high lateral resolution, high mass resolution and accuracy, and high imaging selectivity on-tissue structure analysis by MS/MS imaging and semi-physiological tissue analysis by atmospheric pressure SMALDI. With that, mass spectrometry imaging was improved and extended to a technique of high validity and high specificity, providing for molecular histology at a high informational level. SMALDI imaging on high resolution mass spectrometers now allows us to distinguish precisely tissue subtypes at a lateral resolution of 2–3 μm. Label-free, non-targeted MALDI mass spectrometry imaging generates hundreds of characteristic tissue signals, extending the performance of bioanalytical tissue analysis considerably, significantly beyond that of histochemical, targeted (label-dependent) methods. Figure 43.12 shows an example of SMALDI imaging of mouse urinary bladder tissue, visualizing three selected tissue-specific signals, which obviously enhance the information obtained from histological staining.

43 Toponome Analysis

1069

Figure 43.12 SMALDI images of a mouse urinary bladder tissue section. (a) 10 μm step size, overlay of ion images of m/z = 741.5307 (blue, muscle tissue, sphingomyelin SM(34:1)), m/ z = 798.5410 (green, urothelium, phosphatidyl choline PC(34:1)), and m/ z = 743.5482 (red, lamina propria). (b) Overlay of ion images of m/z = 798.5410 (blue, urothelium, PC(34:1)), m/ z = 812.5566 (green, umbrella cells, phosphatidyl ethanolamine PE(38:1)), and m/z = 616.1767 (red, blood vessels, heme b, M+.). (c) Microscope image of the tissue after staining with toluidine blue. Copied with permission from Römpp, A., Guenther, S., Schober, Y., Schulz, O., Takats, Z., Kummer, W., and Spengler, B. (2010) Angew. Chem. Int. Ed., 3834–3838. Copyright Wiley-VCH Verlag GmbH.

In the field of peptide imaging at a lateral resolution of 5 μm, experiments were reported for neuropeptides in mouse pituitary gland tissue, differentiating individual peptide-containing cells and identifying peptide sequences and post-translational modifications by MS/MS imaging.

43.2.8 Identification and Characterization of Analytes Mass spectrometry images were limited initially in their information contents to the visualization of an intensity distribution within a certain m/z window. Identification and characterization of the imaged substances directly from the image data was not possible. Furthermore, the analytical homogeneity (i.e., the substance-specific unambiguousness of the chosen mass window) was not guaranteed with the employed low-resolution mass analyzers. It was thus rarely possible to perform an analytical validation of the method, prior to any biomedical validation. Instead, in parallel to an imaging analysis, classical identification methods were employed using homogenized material, with a subsequent correlation to the obtained imaging data. This limitation has been eliminated or at least strongly reduced with modern methods of high-resolution and high-accuracy mass spectrometry directly from tissue, allowing for a direct validation of imaging data (Figure 43.13). Coupling imaging ion sources to high-resolution (FT-ICR or orbital trapping) mass spectrometers now provides for a direct identification of individual tissue components, using mass-based classification, combinatorial data evaluation and MS/MS imaging.

Mass Analysers, Section 15.2

1070

Part V: Functional and Systems Analytics

Figure 43.13 SMALDI MS/MS image of mouse spinal cord: (a) 10 μm step size, Orbitrap (R = 30 000), positive ions, MS ion image of m/z = 772.5253 (phosphatidyl choline PC(32:0). (b) MS/ MS ion image of m/z = 713.4518 (precursor ion 772.5253; PC(32:0)). (c) Molecular structure of PC(16:0, 16:0). (d) Microscope image of the tissue after immunostaining. Courtesy by B. Spengler.

Further Reading Section 43.1 Bode M, Irmler M, Friedenberger M, May C, Jung K, Stephan C, Meyer HE, Lach C, Hillert R, Krusche A, Beckers J, Marcus K, Schubert W. Interlocking transcriptomics, proteomics and toponomics technologies for brain tissue analysis in murine hippocampus. Proteomics 2008; 8 (6): 1170–1178. doi: 10.1002/ pmic.200700742. Dress AWM, Lokot T, Pustyl’nikov LD, Schubert W, Poisson Numbers and Poisson Distributions in Subset Surprisology. Ann Comb. 2005; 8 (4): 473–485. doi: 10.1007/s00026-004-0234-2. Dress AWM, Lokot T, Schubert W, Serocka P, Two Theorems about Similarity Maps. Ann Comb. 2008; 12 (3): 279–290. doi: 10.1007/s00026-008-0351-4. Friedenberger M, Bode M, Krusche A, Schubert W. Fluorescence detection of protein clusters in individual cells and tissue sections by using toponome imaging system: sample preparation and measuring procedures. Nat Protoc. 2007; 2 (9): 2285–2294. doi: 10.1038/nprot.2007.320. Hillert R, Gieseler A, Krusche A, Humme D, Röwert-Huber HJ, Sterry W, Walden P, Schubert W. Large molecular systems landscape uncovers T cell trapping in human skin cancer. Sci Rep. 2016, 6: 19012 doi: 10.1038/srep19012. Nattkemper TW, Ritter HJ, Schubert W. A neural classifier enabling high-throughput topological analysis of lymphocytes in tissue sections. IEEE Trans Inf Technol Biomed. 2001; 5 (2): 138–149. doi: 10.1109/ 4233.924804. Ostalecki C, Lee JH, Dindorf J, Collenburg L, Schierer S, Simon B, Schliep S, Kremmer E, Schuler G, Baur AS. Multiepitope tissue analysis reveals SPPL3-mediated ADAM10 activation as a key step in the transformation of melanocytes. Sci Signal. 2017, 10 (470). pii: eaai8288. doi: 10.1126/scisignal.aai8288. Schubert W. Multiple antigen-mapping microscopy of human tissue. In: Burger G, Oberholzer M, Vooijs GP editors. Excerpta medica. Adv Anal Cell Pathol. Elsevier, Amsterdam; 1990. p. 97–98. Schubert W. Topological Proteomics, Toponomics, MELK-Technology. Adv Biochem Eng Biotechnol 2003; 83: 189–209. doi: 10.1007/3-540-36459-5_8. Schubert W, Bonnekoh B, Pommer AJ, Philipsen L, Böckelmann R, Malykh Y, Gollnick H, Friedenberger M, Bode M, Dress AWM. Analyzing proteome topology and function by automated multidimensional fluorescence microscopy. Nat Biotechnol. 2006; 24 (10): 1270–1278. doi: 10.1038/nbt1250. Schubert W. A three-symbol code for organized proteomes based on cyclical imaging of protein locations. Cytometry A 2007; 71A (6): 352–360. doi: 10.1002/cyto.a.20281.

43 Toponome Analysis Schubert W, Gieseler A, Krusche A. Hillert R. Toponome Mapping in Prostate Cancer: Detection of 2000 Cell Surface Protein Clusters in a Single Tissue Section and Cell Type Specific Annotation by Using a Three Symbol Code. J Proteome Res. 2009; 8 (6): 2696–2707. doi: 10.1021/pr800944f. Schubert W. On the origin of cell functions encoded in the toponome. J Biotechnol. 2010; 149 (4): 252–259. doi: 10.1016/j.jbiotec.2010.03.009. Schubert W, Gieseler A, Krusche A, Serocka P, Hillert R, Next-generation biomarkers based on 100parameter functional super-resolution microscopy TIS. N Biotechnol. 2012; 29 (5): 599–610. doi: 10.1016/j.nbt.2011.12.004. Schubert W. Toponomics. In: Dubitzky W, Wolkenhauer O, Cho KH, Yokota H, editors. Encyclopedia of Systems Biology. New York: Springer; 2013. p. 2191–2212. doi: 10.1007/978-1-4419-9863-7_631. ISBN 978-1-4419-9862-0. Schubert W. Systematic, spatial imaging of large multimolecular assemblies and the emerging principles of supramolecular order in biological systems. JMR. 2014; 27 (1): 3–18. doi: 10.1002/jmr.2326. Schubert W. Advances in toponomics drug discovery: Imaging cycler microscopy (ICM) correctly predicts a therapy method of amyotrophic lateral sclerosis (ALS). Cytometry Part A; 2015, 87A: 696–703. doi: Q6 10.1002/cyto.a.22671.

Section 43.2 Bouschen, W., Schulz, O., Eikel, O., and Spengler, B. (2010) Matrix vapour deposition/recrystallization and dedicated spray preparation for high-resolution scanning microprobe MALDI imaging mass spectrometry (SMALDI-MS) of tissue and single cells. Rapid Comm. Mass Spectrom., 24, 355–364. Guenther, S., Römpp, A., Kummer, W., and Spengler, B. (2011) AP-MALDI imaging of neuropeptides in mouse pituitary gland with 5 μm spatial resolution and high mass accuracy. Int. J. Mass Spectrom., 305, 228–237. Hillenkamp, F., Unsöld, Ε.E., Kaufmann, R., and Nitsche, R. (1975) Laser microprobe mass analysis of organic materials. Nature, 256, 119–120. Hummon, A.B., Sweedler, J.V., and Corbin, R.W. (2003) Discovering new neuropeptides using single-cell mass spectrometry. Trends Anal. Chem., 22, 515–521. Jespersen, S., Chaurand, E., van Strien, F.J.C., Spengler, B., and van der Greef, J. (1999) Direct sequencing of neuropeptides in biological tissue by MALDI-PSD. Mass Spectrom., 71, 660–666. Luxembourg, S.L., McDonnell, L.A., Duursma, M.C., Guo, X., and Heeren, R.M.A. (2003) Effect of local matrix crystal variations in matrix-assisted ionization techniques for mass spectrometry. Anal. Chem., 75, 2333–2341. Ostrowski, S.G., Van Bell, C.T., Winograd, N., and Ewing, A.G. (2004) Mass spectrometric imaging of highly curved membranes during Tetrahymena mating. Science, 305, 71–73. Schröder, W.H. and Fain, G.L. (1984) Light-dependent calcium release from photoreceptors measured by laser micro-mass analysis. Nature, 309, 268–270. Spengler, B., Hubert, M., and Kaufmann, R. (1994) MALDI ion imaging and biological ion imaging with a new scanning UV-laser microprobe, in Proceedings of the 42nd ASMS Conference on Mass Spectrometry and Allied Topics, Chicago, Illinois, 29 May to 3 June 1994, S. 1041, American Society for Mass Spectrometry Spengler, B. and Hubert, M. (2002) Scanning microprobe matrix-assisted laser desorption ionization (SMALDI) mass spectrometry: instrumentation for sub-micrometer resolved LDI and MALDI surface analysis. J. Am Soc. Mass Spectrom., 13, 735–748. Stoeckli, M., Chaurand, E., Hallahan, D.E., and Caprioli, R.M. (2001) Imaging mass spectrometry: a new technology for the analysis of protein expression in mammalian tissue. Nat. Med. 7, 493–496. Yanagisawa, K., Shyr, Y., Xu, B.J., Massion, E.E., Larsen, E.H., White, B.C., Roberts, J.R., Edgerton, E., Gonzalez, A., Nadaf, S., Moore, J.H., Caprioli, R.M., and Carbone, D.E. (2003) Proteomic patterns of tumour subsets in non-small-cell lung cancer. The Lancet, 362, 433–439.

1071

Appendix 1: Amino Acids and Posttranslational Modifications Table 1 Amino acids: mass, pK-values and pI values (mass values given are unmodified amino acid mass minus water) Symbols

Name and Formula

Ala

A

alanine C3H5NO

Arg

R

Asn

Mono-isotopic mass

Average mass

pK 2.35 9.69

pI

71.03711

71.07794

6.02

arginine C6H12N4O

156.10111

156.18584

2.17 9.04 (α-amino) 12.48 (guanidino)

N

asparagine C4H6N2O2

114.04293

114.10272

2.02 8.8

5.41

Asp

D

aspartic acid C4H5NO3

115.02694

115.08744

2.09 (α-carboxy) 3.86 (β-carboxy) 9.82

2.87

Cys

C

cysteine C3H5NOS

103.00919

103.14394

1.71 8.33 (Sulfhydryl) 10.78 (α-Amino)

5.02

cystine C6H10N2O3S2

222.01328

222.28728

1.65 2.26 (Carboxy) 7.85 9.85 (Amino)

5.06

10.76

Gln

Q

glutamine C5H8N2O2

128.05858

128.12930

2.17 9.13

5.65

Glu

E

glutamic acid C5H7NO3

129.04259

129.11402

2.19 (α-carboxy) 4.25 (β-carboxy) 9.67

3.22

Gly

G

glycine C2H3NO

57.02146

57.05136

2.34 9.6

5.97

His

H

histidine C6H7N3O

137.05891

137.13940

1.82 6.0 (imidazol) 9.17

7.58

Ile

I

isoleucine C6H11NO

113.08406

113.15768

2.36 9.68

6.02

Leu

L

leucin C6H11NO

113.08406

113.15768

2.36 9.60

5.98

Lys

K

lysine C6H12N2O

128.09496

128.17236

2.18 8.95 (α-amino) 10.53 (ε-amino)

9.74

Met

M

methionine C5H9NOS

131.04048

131.19710

2.28 9.21

5.75

Phe

F

phenylalanine C9H9NO

147.06841

147.17390

1.83 9.13

5.98

Pro

P

proline C5H7NO

97.05276

97.11522

1.99 10.60

6.10

Ser

S

serine C3H5NO2

87.03203

87.07734

2.21 9.15

5.68

Thr

T

threonine C4H7NO2

101.04768

101.10392

2.63 10.43

6.53

Trp

W

tryptophan C11H10N2O

186.07931

186.20998

2.38 9.39

5.88

Tyr

Y

tyrosine C9H9NO2

163.06333

163.17330

2.20 9.11 (α-amino) 10.07 (phenol)

5.65

Val

V

valine C5H9NO

99.06841

99.13110

2.32 9.62

5.97

Sec

U

selenocysteine C3H5NOSe

150.95363

150.03794

Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

5.3 (selenol)

1074

Appendix 1: Amino Acids and Posttranslational Modifications

Table 2 Posttranslational modifications and mass changes of modified peptide Modification

Mono isotopic mass

Average mass

5´ -Adenylation

329.0525

329.2091

42.0106

42.0373

N-Acetylhexosamine (GalNAc, GlcNAc)

Acetylation

203.0794

203.1950

N-Acetylneuraminic acid (sialic acid, NeuAc, NANA, SA)

291.0954

291.2579

ADP-Ribosylation (from NAD)

541.0611

541.3052

Biotinylation (via amide bond to Lys)

226.0776

226.2994

43.9898

44.0098

0.9840

0.9847

119.0041

119.1442

Carboxylation (of Asp and Glu) C-terminal amide (from Gly) Cysteinylation Deamidation (of Asn and Gln) Deoxyhexoses (Fuc, Rha) Disulfide bridge Farnesylation Formylation Geranylgeranylation

0.9840

0.9847

146.0579

146.1430

2.0157

2.0159

204.1878

204.3556

27.9949

28.0104

272.2504

272.4741

Glutathionylation

305.0682

305.3117

N-Glycolylneuraminic acid (NeuGc)

307.0903

307.2573

Hexosamines (GalN; GlcN)

161.0688

161.1577

Hexoses (Fru, Gal, Glc, Man)

162.0528

162.1424

Homoserine (from Met, by CNBr treatment)

29.9928

30.0935

Hydroxylation

15.9949

15.9994

188.0330

188.3147

Lipoic acid (via amide bond to Lys) Methylation

14.0157

14.0269

210.1984

210.3598

15.9949

15.9994

Palmitoylation

238.2297

238.4136

Pentoses (Ara, Rib, Xyl)

132.0423

132.1161

4-Phosphopantetheine

339.0780

339.3294

Myristoylation Oxidation (of Met)

Phosphorylation

79.9663

79.9799

Proteolysis (of a peptide bond)

18.0106

18.0153

231.0297

231.1449

Pyridoxal phosphate (via Schiff base to Lys) Pyroglutamic acid (from Gln) Stearoylation Sulfation

17.0265

17.0306

266.2610

266.4674

79.9568

80.0642

Appendix 2: Symbols and Abbreviations

2DPP-mapping 4-NA 5ca C 5f C 5hm C 5mC 5m CG

two-dimensional phosphopeptide mapping 4-nitroanilide 5-carboxylcytosine 5-formylcytosine 5-hydroxymethylcytosine 5-methylcytosine CpG-methylation

α ACE ACQ

selectivity affinity capillary electrophoresis 6-aminoquinoyl-N-hydroxysuccinimidyl carbamate activation domain atomic force microscopy atomic force microscope amyotrophic lateral sclerosis avian myeloblastosis virus atomic orbital acousto-optical beam-splitter acousto-optic tunable filter alkaline phosphatase affinity purification mass spectrometry automated projection spectroscopy arginine-rich motif analogue-sensitive kinase allele amino acid thiohydantoin adenosine triphosphate attenuated total reflection anilinothiazolinone

AD AFM AFM ALS AMV AO AOBS AOTF AP AP-MS APSY ARM ASKA ATH ATP ATR ATZ β2m BACTH BCA BCIP bDNA BFP BIO BIOS BLAST BNPS-skatole

β2-microglobulin bacterial adenylate cyclase-based twohybrid system bicinchoninic acid 5-bromo-4-chloro-indoxyl phosphate branched DNA amplification blue fluorescent protein biotin:avidin (or streptavidin) biology-oriented synthesis Basic Local Alignment Search Tool 3-bromo-3-methyl-2-(2-nitrophenylthio)3H-indole

BOD5 BSA BSTFA BTK

biochemical oxygen demand in 5 days bovine serum albumin bis(trimethylsilyl)trifluoroacetamide Bruton tyrosine kinase

C5mCWGG CAD CAP CAT CBP CCD CD CD CDR CE CE CEC CEC CEkF CEM CET CF CFP CGE CGH CHEF ChIP ChIP chip

Dcm-methylation collision-activated dissociation catabolite activator protein chloramphenicol acetyltransferase calmodulin-binding peptide charge-coupled device circular dichroism cyclodextrin complementarity-determining region capillary electrophoresis collision energy capillary electrokinetic chromatography capillary electrochromatography capillary electrokinetic fractionation channel electron multiplier cryo-electron tomography cystic fibrosis cyan fluorescent protein capillary gel electrophoresis comparative genomic hybridization contour-clamped homogeneous electric field chromatin immunoprecipitation chromatin immunoprecipitation and chip analysis chemical ionization collision-induced dissociation capillary isoelectric focusing chromosomal in situ suppression confocal laser scanning microscopy critical micelle concentration combinatorial molecular phenotype copy number variant combined bisulfite restriction analysis co-immunoprecipitation correlation spectroscopy clustered regulatory interspaced short palindromic repeats

CI CID CIEF CISS cLSM CMC CMP CNV CoBRA co-IP COSY CRISPR

Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

1076

Appendix 2: Symbols and Abbreviations

CRM CSP CTAB CTF CZE

charged-residue model cold-shock protein cetyltrimethylammonium bromide contrast transfer function capillary zone electrophoresis

ESR EST ETD ETD ETR

electron spin resonance expressed sequence-tagged dissociation electron transfer electron transfer dissociation electron transfer reaction

DABS-Cl

4-dimethylaminoazobenzene-4-sulfonyl chloride DNA-binding domain direct blotting electrophoresis data-dependent acquisition diethyl pyrocarbonate diisopropyl fluorophosphate density functional theory denaturing gradient gel electrophoresis dihydro-rhodamine data-independent acquisition N,N´ -diisopropylcarbodiimide digoxigenin:anti-digoxigenin difference gel electrophoresis dimyristoyl-phosphatidylcholine dimethyl sulfate dimethoxytrityl DNA methyltransferase 2,4-dinitrophenol dissolved oxygen diversity-oriented synthesis diphenylthiourea dummy residues double-stranded differential scanning calorimeter differential scanning calorimetry double-stranded RNA binding domain

F FAB FACS FCM FDA FDR Fe-BABE

FT-IR FT-NMR FWHM

flow rate fast atom bombardment fluorescence-activated cell sorting fluorescence correlation microscopy Food and Drug Administration framework-determining region Fe 1-(p-bromoacetamidobenzyl)ethylenediaminetetraacetic acid free electron laser field effect transistor fast Fourier transform focused ion beam free induction decay flame ionization detector field inversion gel electrophoresis far-infrared fluorescence in situ hybridization fluorescein isothiocyanate fluorescence lifetime imaging microscopy fluorescein:anti-fluorescein N-formyl-Met-Leu-Phe fluorenylmethoxycarbonyl fluorescence recovery after photobleaching Förster resonance energy transfer fluorescence resonance energy transfer fluorescent speckle microscopy Fourier transform Fourier-transform ion cyclotron resonance mass spectrometer Fourier transform infrared Fourier-transform NMR full-width at half-maximum

GCC GFP GN6mATC GOD gRNA GSD GST GWAS

graphitized carbon column green fluorescent protein Dam-methylation glucose oxidase guide RNA ground-state depletion glutathione-S-transferase genome-wide association studies

HAT HBsAg HCD HCV HDAC HDR HETP HGP HIC HIV HLH HOBt HP-AC HP-AEX

acetyl transferase hepatitis B surface antigen higher-energy collisional dissociation hepatitis C virus histone deacetylase homology-directed repair height equivalent to one theoretical plate Human Genome Project hydrophobic interaction chromatography human immunodeficiency virus helix-loop-helix protein 1-hydroxybenzotriazole high-performance affinity chromatography high-performance anion exchange chromatography

DBD DBE DDA DECP DFP DFT DGGE DHR DIA DIC DIG DIGE DMPC DMS DMT DNMT DNP DO DOS DPTU DR ds DSC DSC dsRBD EBV ECD EDS EDTA EELS EGFR EI EI EIA EIC ELISA ELSD EMBL EMBOSS EMSA ENDOR ENU EOF EOF EOM EPL EPO EPR ER ESEEM ESI ESI ESI-MS

Epstein–Barr virus electron capture dissociation energy-dispersive X-ray spectrometer ethylenediaminetetraacetic acid electron energy loss spectroscopy epidermal growth factor receptor electron impact electron impact ionization enzyme-immune assay extracted ion chromatogram enzyme-linked immunosorbent assay evaporative light scattering detector European Molecular Biology Laboratory European Molecular Biology Laboratory Open Software Suite electrophoretic mobility shift analysis electron nuclear double resonance ethylnitrosourea electroosmotic flow endoosmotic flow ensemble optimization method expressed protein ligation erythropoietin electron paramagnetic resonance endoplasmic reticulum electron spin echo envelope modulation electrospray ionization electron spectroscopic imaging electrospray ionization mass spectrometry

FEL FET FFT FIB FID FID FIGE FIR FISH FITC FLIM FLUGS FMLP FMOC FRAP FRET FRET FSM FT FT-ICR-MS

Appendix 2: Symbols and Abbreviations

HPAEC-PAD

HP-ANPC HP-CEX HP-GPC HP-HIC HP-HILIC HP-IEX HPLC HP-NPC HP-RPC HP-SEC HPTLC HSQC HTH HVTEM

high performance anion exchange chromatography with pulsed amperometric detection high-performance aqueous normal-phase chromatography high-performance cation exchange chromatography high-performance gel permeation chromatography high-performance hydrophobic interaction chromatography high-performance hydrophilic interaction chromatography high-performance ion exchange chromatography high-performance liquid chromatography high-performance normal-phase chromatography high-performance reversed-phase chromatography high-performance size exclusion chromatography high-performance thin-layer chromatography heteronuclear single-quantum coherence helix-turn-helix structure high-voltage transmission electron microscope

ICAT ICM ICPL IDP IEC IEF IEM IEX IMAC IMS IP IP IPG IRE IRMPD IRS ISD ISH ITC ITC ITP iTRAQ

isotope-coded affinity tag imaging cycler microscopy isotope-coded protein label intrinsically disordered protein ion exchange chromatography isoelectric focusing ion evaporation model ion exchange chromatography immobilized metal chelate chromatography ion mobility spectrometry isoelectric point immunoprecipitation immobilized pH gradient internal reflection element infrared multiphoton dissociation interspersed repetitive sequence in source decay in situ hybridization isothermal titration calorimeter isothermal titration calorimetry isotachophoresis isobaric tags for relative and absolute quantitation

k K KLH KTS

retention factor specific conductivity keyhole limpet hemocyanin Klaus Tschira Foundation

LADH LAMMA LC LCAO

liver alcohol dehydrogenase laser microprobe mass analysis liquid chromatography linear combinations of atomic orbitals

LC-MS/MS LCR LCR LD LDA LDA LDI LDI-MS LED LIF LIMS LINC LM LNA LOC LOD LSM LTR LVSEM μeff MALDI MALDI TOF MS MALDI-TOF MBD MCE MCP MD MD-HPLC

MeCP MeDIP MEF MEKC MELC MES ME-SIMS MFM MHC MIP MIR MIR MIRA miRNA MMD MMLV Env MO MOPS MPE-Fe MPS MR MRM MS MSA MSP MTF

1077

liquid chromatography tandem mass spectrometry ligase chain reaction locus control region linear dichroism low-density amorphous ice linear discriminant analysis laser desorption/ionization laser desorption ionization mass spectrometry light-emitting diode laser-induced fluorescence laboratory information management system light-induced co-clustering beam path in the light microscope locked nucleic acid lab-on-a-chip lactate oxidase laser scanning microscope long-terminal repeat low-voltage scanning electron microscope effective mobility matrix-assisted laser desorption ionization matrix-assisted laser desorption/ionization time-of-flight mass spectrometry matrix-assisted laser desorption/ionization time of flight combined bisulfite restriction analysis microchip electrophoresis microchannel plate molecular dynamics multidimensional (multistage, multicolumn) high-performance liquid chromatography methyl-CpG-binding protein methylated DNA immunoprecipitation mouse embryonal fibroblast micellar electrokinetic chromatography multi-epitope-ligand cartography minimal ensemble search matrix-enhanced SIMS magnetic force microscope major histocompatibility complex molecular imprinted polymer mid-infrared multiple isomorphous replacement methylated-CpG island recovery assay microRNA multimolecular domain Moloney murine leukemia virus envelope molecular orbital 3-N-morpholino-1-propane sulfonic acid methidiumpropyl-EDTA-Fe massive parallel sequencing molecular replacement multiple reaction monitoring mass spectrometry multistage activation methylation-specific PCR modulation transfer function

1078 N N6m

A

NA NA NAD+ NAGK NAR NASBA NAT NBS NBT NCL NCS NCS NGS NHEJ NIR NLS NMR NOESY NP-LC NSCLC NSOM OD OLA OMA ONPG OPA OP-Cu ORD ORF PA PACE PAcIFIC PAD PAGE PALM PAM PAP PAT PBPC PBS PC PCA PCA PCMD PCR PDB PDMS PED PEG PELDOR PFGE PID PITC PMF PMMA

Appendix 2: Symbols and Abbreviations

plate number N6-methyladenine Avogadro’s constant numeric aperture nicotine adenine dinucleotide N-acetyl-L-glutamate kinase nucleic acids research nucleic acid sequence-based amplification N-acetyltransferase N-bromosuccinimide nitro-blue tetrazolium salt native chemical ligation N-chlorosuccinimide noncrystallographic symmetry next-generation sequencing non-homologous end joining near-infrared nuclear localization signal nuclear magnetic resonance nuclear Overhauser effect spectroscopy normal-phase liquid chromatography non-small cell lung cancer near-field scanning optical microscopy

PMSF PNA PNGase F PPC PPI PQC PRE PRM PS PSD PSF PSSM PTC PTH PTM PTR PVDF PVP

phenylmethylsulfonyl fluoride peptide nucleic acid peptide N-glycosidase F pressure perturbation calorimetry protein–protein interaction piezoelectric quartz crystals paramagnetic relaxation enhancement parallel reaction monitoring power spectrum post source decay point spread function position-specific scoring matrix phenylthiocarbamoyl phenylthiohydantoin post-translational modification proton transfer reaction poly(vinylidene fluoride) polyvinylpyrrolidone

QCL QD QSRR QY

quantum cascade laser quantum dot quantitative structure–retention relationship quantum yield

optical density oligonucleotide–ligation assay optical multichannel analyzer o-nitrophenyl-β-D-galactopyranoside ortho-phthaldialdehyde bis(1,10-ortho-phenanthroline)-copper(I) complex optical rotation dispersion open reading frame

RS RAM RCF RCR RDC RF RFLP RGB RIA RISC RMSD RNAi RNP RP RPA RPC RP-LC rpm RRM rRNA RSV RSV RT

resolution restricted access material relative centrifugal force repair chain reaction residual dipolar coupling replicative form restriction fragment length polymorphism red-green-blue radioimmunoassay RNA-induced silencing complex root-mean-square deviation RNA interference ribonucleoprotein reversed phase ribonuclease protection assay reversed-phase chromatography reversed-phase liquid chromatography rotations per minute relative resolution map ribosomal RNA Rous sarcoma virus respiratory syncytial virus reverse transcriptase

S/MRM SAA SAXS SBLD SCAM SCX SD SDA SDS SDS-PAGE SEC SELEX

selected or multiple reaction monitoring serum amyloid A small-angle X-ray scattering structure-based ligand design scanning cysteine accessibility method strong cation exchange standard deviation strand displacement amplification sodium dodecyl sulfate SDS-polyacrylamide gel electrophoresis size-exclusion chromatography systematic evolution of ligands by exponential enrichment scanning electron microscope secondary electron multiplier

photoactivatable programmable autonomously controlled electrode precursor acquisition independent from ion count pulsed amperometric detection polyacrylamide gel electrophoresis photoactivated localization microscopy protospacer adjacent motif peroxidase–anti-peroxidase process analytical technology polar bonded-phase chromatography phosphate-buffered saline solution peak capacity protein fragment complementation assay principal component analysis power of combinatorial molecular discrimination polymerase chain reaction Protein Data Bank polydimethylsiloxane pulsed electrochemical detection poly(ethylene glycol) pulsed electron–electron double resonance pulsed-field gel electrophoresis photon-induced dissociation phenyl isothiocyanate peptide mass finger print poly(methyl methacrylate)

SEM SEV

Appendix 2: Symbols and Abbreviations

SFM SFX SHAPE shRNA SI SICM SIDT SILAC SIM SIMS SIR siRNA SMALDI SNIM SNOM SNP snRNA SNV SOG SPE SPFS SPM SPR SPRI SREBP SRM ss SSB SSCP SSIM STED STEM STET STM STORM STRP STS SWATH-MS τc TAD TAE TAP TBAB TBE TE TEA TEAA TEM

scanning force microscopy serial femtosecond crystallography selective 2´ -hydroxyl acylation analyzed by primer extension short hairpin RNA secondary ion ionization scanning ion conductance microscope single ion in droplet theory stable isotope labeling by amino acids in cell culture single-ion monitoring secondary ion mass spectrometry single isomorphous replacement small-interfering RNA scanning microprobe MALDI scanning near-field infrared microscopy scanning near-field optical microscopy single-nucleotide polymorphism small nuclear RNA single-nucleotide variant singlet oxygen generator solid-phase extraction surface plasmon fluorescence spectroscopy scanning probe microscope surface plasmon resonance solid-phase reversible immobilization sterol response element-binding protein single reaction monitoring single-stranded single-stranded binding single-strand conformation polymorphism saturated structured illumination stimulated emission depletion scanning transmission electron microscope singlet oxygen triplet energy transfer scanning tunneling microscope stochastic optical reconstruction microscopy short tandem repeat polymorphism sequence tagged site sequential windowed acquisition of all theoretical fragment ion mass spectra rotational correlation time topologically associating domain tris acetate tandem affinity purification tetrabutylammonium bromide tris borate tris-HCl/EDTA triethylamine triethylammonium acetate transmission electron microscope

1079

TSH TSS t0

N,N,N´ ,N´ -tetramethylethylenediamine tobacco etch virus trifluoroacetic acid triplex forming oligonucleotide temperature gradient gel electrophoresis total ion current total internal reflection total internal reflection fluorescence microscopy toponome imaging system thin-layer chromatography L-1-chloro-3-(4-tosylamido)-7-amino-2heptanone hydrochloride melting point transcription-mediated amplification tandem mass tag total correlation spectroscopy time-of-flight 1-oxyl-2,2,5,5-tetramethylpyrroline-3acetylene L-1-chloro-3-(4-tosylamido)-4-phenyl-2butanone retention time trp RNA-binding attenuation protein time-resolved fluorescence transverse relaxation optimized spectroscopy thyroid-stimulating hormone transcription start site void time

UAS Ub UHPLC UNG UV UV-DAD

upstream activating sequence ubiquitin ultrahigh pressure liquid chromatography uracil-N-glycosylase ultraviolet UV-diode array detection

V0 VR VSV-G

void volume retention volume vesicular stomatitis virus-Glycoprotein

W WLC

heat worm-like chain

YAC YFP

yeast artificial chromosome yellow fluorescent protein

ZMW ZPCK

zero-mode waveguide carbobenzoxy-L-phenylalanine-chloromethyl ketone

TEMED TEV TFA TFO TGGE TIC TIR TIRFM TIS TLC TLCK Tm TMA TMT TOCSY TOF TPA TPCK tR TRAP TRF TROSY

Appendix 3: Standard Amino Acids (three and one letter code)

non polar, aliphatic

aromatic

polar, uncharged

Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

1082

Appendix 3: Standard Amino Acids

positively charged

negatively charged

Appendix 4: Nucleic Acid Bases

General structure of nucleotides:

Nucleic acid bases

Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

1084

Appendix 4: Nucleic Acid Bases

Modified nucleic acid bases

Index

A Abbe, Ernst 181, 187, 485, 493 absolute molecule mass 4 absolute quantification (AQUA) of modified peptides 660 absorption bands, of most biological molecules 139 measurement 140–142 of photon 135 spectroscopy 131 acetic acid 228 acetonitrile 227 N-acetyl-α-D-glucosamine (αGlcNAc) 579 acetylated proteins detection of 651 separation and enrichment 649 acetylation 224, 645–647 sites, in proteins 656 identification of 655 N-acetyl-β-D-glucosamine (βGlcNAc) 579 N-acetylgalactosamine 129 achromatic objectives 183 acid-base properties 5, 234 acidic and basic acrylamide derivatives 261 acidic native electrophoresis 257 acidic peptides 562 acidification 3 activation domain (AD) fusion proteins 385 activation energy 37, 38 acylation 109, 110 adeno-associated viruses 913 adenosine triphosphate (ATP) binding pocket 1050 site 1048

Aequorea victoria 182, 192 aerosols, danger of contamination 770 affinity capillary electrophoresis (ACE) 285–286 binding constant, determined by 285 changing mobility 286 complexation of monovalent protein–ligand complexes 285 affinity chromatography 91, 268, 650 affinity purification mass spectrometry (AP-MS) 381, 1003 agarose concentrations DNA fragments, coarse separation of 692 migration distance and fragment length 692 agarose gels, advantages of 260 agglutination 72 aggregation number 288 AK2-antibodies 102 alanine 562 alanine-scanning method 870 albumin 3, 995 alcohol dehydrogenase (ADH) gene promoter 382 alkaline phosphatase (AP) 87, 741, 746, 753 alkylation 32 of cysteine residues 210 allele-specific hybridization probes 774 allergies 65 allotype determinations 80 all-trans conformation 48 amidation 110 amino acid analysis 118, 207, 301, 313 biophysical properties of 887 identification of 316–317 liquid chromatography with optical detection systems 303

post-column derivatization 303–305 pre-column derivatization 305–308 reagents used for 310 sample preparation 302 acidic hydrolysis 302 alkaline hydrolysis 303 enzymatic hydrolysis 303 using mass spectrometry 309 L-α-amino acid residues/termini 225 amino acid sequence analysis milestones in 319 problems 322 amino acids 323–324 background 324 initial yield 324 modified amino acids 324 purity of chemicals 324 sample to be sequenced 322–323 sensitivity of HPLC system 325 state of the art 325 6-aminoquinoyl-N-hydroxysuccinimidyl carbamate (ACQ) 307 ammonium sulfate 10 Amoeba dubia 785 AMPD anion 747 amyotrophic lateral sclerosis (ALS) 1059 analogue sensitive kinase alleles (ASKAs) approach 1047, 1048 kinases 1050 analogue-sensitive kinases, inhibitors and cofactors 1049 analytical method, development of 234–235 analysis of fractionations 238 column efficiency, optimization of 235

Bioanalytics: Analytical Methods and Concepts in Biochemistry and Molecular Biology, First Edition. Edited by Friedrich Lottspeich and Joachim Engels.  2018 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2018 by Wiley-VCH Verlag GmbH & Co. KGaA.

1086

Index

analytical method, development of (continued ) fractionation 237–238 retention factors, optimization of 235–236 scaling up to preparative chromatography 236–237 selectivity, optimization of 235 analytical ultracentrifugation 409–410 basics of centrifugation 411–412 photoelectric scanner in 410 principles of instrumentation 410 sedimentation velocity experiments 412 experimental procedures 413–414 N-acetyl-L-glutamate-kinase and signal protein PII, interaction between 414–415 physical principles 412–413 sedimentation–diffusion equilibrium experiments 415–416 anilinothiazolinone (ATZ) amino acid 316 antibiotic resistance genes 914 antibodies 63, 64, 1034 allotype 65 binding 87 in conjunction with use of natural Fcγ-effector functions 104 engineering 99–102, 101 Fc-receptor 65 handling of 68–69 microarrays 956 production 98 properties of 64–65 IgA 64 IgD 65 IgG 64, 65 IgM 64, 65 as reagents 64 types of 98–99 antibody-directed cellular cytotoxicity (ADCC) 104 antigen–antibody-complexes in vitro, reversible separation of 65 antigen–antibody reaction 64, 71–72 antigenic system 74 antigen interaction at combining site 67–68 antigens 69–71 antisense oligonucleotides 959, 961, 963, 964 antisense probe in vitro synthesis of 899 ApA-wedge/B-junction models 832 aperture diaphragm 183 apochromatic objectives 183 apolipoprotein B 3 apoptosis 645

aptamers 869, 971 high-affinity RNA/DNAoligonucleotides 971–974 selection of 971–973 Selex procedure 869 uses of 973–974 aqueous DNA solutions phenolic extraction of 666 Ardenne, Manfred von 486 arginine-rich motif (ARM) 859 aromatic amino acid 225, 561 Arrhenius’ equation 44 aryl azides 125 aryl(trifluoromethyl)diazirines 126 asparagine 579 Aspergillus orizae 862, 863, 896, 899 atomic force microscope (AFM) 486, 519 detection of ligand binding and function of 526 determining protein complex assembly and function by 524 functional states and interactions of individual proteins 526–527 gap junctions 523 imaging with 521 interaction between tip and sample 521–522 mapping biological macromolecules 522–524 preparation procedures 522 schematic representation of 520 single molecules 524–526 atomic orbitals (AOs) 134 attenuated total reflection (ATR) 166 automated projection spectroscopy (APSY) 463 autoradiogram 791 autoradiography 652 avian myeloblastosis virus (AMV) 761 avidin–biotin complex formation (ABC system) 87 azido salicylic acid 125 B Bacillus amyloliquefaciens 683 Bacon, Roger 181 bacterial artificial chromosomes (BACs) 788, 934 bacterial suspension 672 Bacto tryptone 671 band broadening 222, 223 barcode array 955 base-catalyzed β-elimination 651 base pairing, complementarity of 720 Baumeister, Wolfgang 486 B-cell 99, 100 Beckman Optima 410

benzophenone derivatives 123 photolabels 127 benzoyl cyanide (BzCN) 865 p-benzoyl-L-phenylalanine 123 bicinchoninic acid assay 27–28 bifunctional reagents 121, 122 binding tests 73, 85, 86, 99 binocular tubes 184 biochemical pathways complex structures, representation of 1024 oxygen demand 425–426 bioinformatics analysis 219 biological functions, alteration of 97–98 biologically relevant lipids, classification of 614 biological starting materials, disruption methods 9 bioluminescence resonance energy transfer (BRET) 408 BioMed Central Bioinformatics 877 biomimetic recognition elements 419 biomimetic sensors 427 aptamers 428 molecularly imprinted polymers 427–428 biophysical methods 131 biopolymers 131 biosensors 419, 428 anti-interference principle 424 concept of 420–421 construction/function 421–423 coupled enzyme reactions in 424 enzymatic analyte cycles 424 enzyme electrodes generation 423–424 sequence/competition 424 BIO system detection system 746 labeled nucleotides 742 biotin biotinylation, reagents for 128 biotinyl groups 129 disadvantage of 746 labeled dNTPs structure 743 biphasic column reactor, sequencers with 321–322 bis(1,10-ortho-phenanthroline)-copper (I) complex (OP-Cu) 849 bispecific antibodies 100 bis(trimethylsilyl)trifluoroacetamide (BSTFA) 621 bisulfite methylation analysis 819, 821 bisulfite PCR enzymes for restriction analysis 822 RASSF1A-promoter 822 restriction analysis 820–822, 821 bisulfite-treated DNA 820

Index

amplification and sequencing of 819–820 biuret assay 26 BLOSUM 62, 887 mutation data matrix 888 blotting 22. see also electroblotting; nucleic acid blotting capillary blots 706 dot blotting unit 707 membranes 705 blue native polyacrylamide gel electrophoresis 259 Bolton–Hunter reagent 109 bonding energy 37 Borries, Bodo von 486 “bottom up” protein analysis 314 bovine serum albumin (BSA) 24, 54 Bradford assay 28 branched DNA (bDNA) 752 amplification method 782 Bravais lattices 535 brilliant blue 251 Broglie, Louis de 486 5-bromo-4-chloro-indoxyl phosphate (BCIP) NBT, coupled optical redox reaction 748 5-bromo-UTP (BrU) structures of 906 bruton tyrosine kinase (BTK) 1049 bufadienolide K-strophanthin 745 buffer systems 255 substance 43–44 bump-and-hole method 1047 t-butyldimethylsilyl ether (TBDMS) 710 C Caenorhabditis elegans 942, 967, 1015, 1052 caesium chloride (CsCl) density gradient 703, 721 solutions 14 caged compounds 111 Ca2+ imaging 205 calcium phosphate 913 calibration curve 24, 25 calorimetry 47 Candida albicans 768 cap analysis of gene expression (CAGE) 868 capillary blots 706 capillary columns 621 capillary electrochromatography (CEC) 288–289 capillary electrophoresis (CE) 244, 275, 281–295, 296, 299, 649, 650, 827 basic principles 277 detection methods 279–281

fluorescence detection 280 mass spectrometry detection 280–281 UV-detection 280 engine, electroosmotic flow 278–279 historical overview 275–276 Joule heating 279 sample injection 277 electrokinetic injection 277 hydrodynamic injection 277 stacking 277 schematic view 276 setup 276–277 capillary 276–277 instrumental setup 276 voltage unit 276 capillary gel electrophoresis (CGE) 290–291, 701, 711 sifting media for 290 capillary isoelectric focusing (CIEF) 291–294 focusing with chemical mobilization 292–293 pressure/voltage mobilization 292 one-step focusing 292 capillary zone electrophoresis (CZE) 281–285, 561, 1027 buffer additives 284–285 capillary coating 284 electrodispersion 282 ionic strength 283 optimization of separation 283 peak broadening 282 pH-value of buffer 283 temperature 283 carbene forming reagents 126 carbodiimides 125 carbohydrates 572 5-carboxylcytosine (5caC) 817 carboxymethylaspartic acid 233 carboxypeptidases 328 cleaved amino acids, detection 328 polypeptides, degradation 327 specificity of 327 carotenoids 638 carrier ampholyte IEF advantages/disadvantages 261 properties for ideal carrier ampholyte 261 catalysts 37 cationic detergent electrophoresis 258 cDNA libraries 385 cell adhesion 64 cell arrangements 1039 cell disruption 7 cell isolation 96 cell sensors 425 cellular immunology 95–97 cellulose esters 16

1087

centimorgans (cM) 926, 929 centrifugation 9, 11, 411 basic principles 12 density gradient 14 CsCl solutions 14 sucrose 14 fractionation of separated bands 14 rotors for 11 techniques 12 differential 12–13 isopycnic 14 zonal 13–14 ceramides 641 cetyl(trimethyl)ammonium bromide (CTAB) 258, 279, 670 chain termination method 789, 790 channel electron multiplier (CEM) 356 chaotropic reagents 7 charged coupled device (CCD) cameras 144, 497, 544, 1059 sensor 808 chemical biology 1041 innovative chemical approaches to study biological phenomena 1041 multidisciplinary approach 1042 chemical crosslinking, reagents properties 866 chemical diversity 224 chemical genetics 1046 protein–protein interaction stabilizer fusicoccin 1046 small organic molecules, for protein function modulation 1042 ASKA technology 1050–1051 bump-and-hole approach 1047–1050 cyclic process 1042 forward and reverse 1046–1047 study of 1044–1046 switching biological systems 1051–1052 chemical labeling reactions 734 chemical nucleases, structure of 849 chemical protein 107 chemical reactions, rate of 36–37 chemical shifts 459 chemiluminescence 652 detection, of hydroperoxy lipids 621 substrates 745 chimeric antibodies 100 chiral MEKC 289 principle 290 chiral separations 289 chloramphenicol acetyltransferase (CAT) 911, 914 N-chlorobenzenesulfonamide 33 chlorobutane 316

1088

Index

4-chloro-7-nitrobenz-2-oxa-1, 3-diazole 118 chlorophyll–protein complexes 152 cholesterol, lipophilic molecules 969 chromatin epigenetic modifications analysis 828 chromatin immunoprecipitation (ChIP) assay 826, 850, 895 chip analysis (ChIP chip) 939 ChIP-on-chip analysis 952 chromatin-immuno-precipitation sequencing (ChipSeq) 813, 828 chromatin modifications 828 chromatograms 220, 221 chromatographically incompletely separated components ESI spectra 716 chromatographic dimension, peakpicking module 1008 chromatographic material, binding to 17 chromatographic separation modes 220, 224 employed in peptide and protein separation and 233 high-performance affinity chromatography (HP-AC) 233 high-performance aqueous normal phase chromatography (HP-ANPC) 230 high-performance hydrophilic interaction chromatography (HP-HILIC) 229–230 high-performance hydrophobic interaction chromatography (HP-HIC) 230–232 high-performance ion exchange chromatography (HP-IEX) 232–233 high-performance normal-phase chromatography (HP-NPC) 228–229 high-performance reversed-phase chromatography (HP-RPC) 227–228 high-performance size exclusion chromatography 227 for peptides and proteins 225–226 chromatographic traces 1007 chromophores in biological macromolecules electronic absorption properties of 154 chromoproteins 148 chromosomal in situ suppression (CISS) 919 chromosome breakage sites, physical markers 927 conformation capture technique chromosomal interactions 829

distance between genes 925 interaction analyses 828–829 chymotrypsin 24, 207 circular dichroism (CD) 178–180 cyclic peptides 567 15mer peptides with typical α-helical conformation 566 spectroscopy 564 circular polarized light 133 Clark, L.C. 429 clone-based mapping procedures 936 cloned gene, riboprobe creation 900 cloning systems 934, 935 positional cloning 938 reference system 937 cloud 875 CLUSTAL alignment 892 clustered regulatory interspaced short palindromic repeats (CRISPR) 974 CMC. see critical micellar concentration (CMC) CNBr cleavage 108 coefficients of variation (CVs) 1003 coherent light 184 co-immunoprecipitation (co-IP) 398–399 cold-shock proteins (CSPs) 860 ColE1 multi-copy plasmids, of enterobacteriaceae family 671 collision energy (CE) 1005 collision induced dissociation (CID) 357–358, 999 column efficiency 222, 234 combinatorial molecular phenotypes (CMPs) motif 1063 with proteins 1061 combined bisulfite restriction analysis (CoBRA) 820 comparative genomic hybridization (CGH) 747, 917, 921–924, 951 chromosomal 922 exemplary demonstration 923 hybridization and data acquisition 923 intermixture 921–922 microarray 920, 922–924 normalization 921 probes 918 competitive inhibitors 40–41 competitive (RT) PCR 765, 766 competitive radioimmunoassay (RIA) 82, 83 complementarity determining regions (CDRs) 101 complementary DNA (cDNA) 761, 843 complementary target sequences 758 complement fixation 94–95

complete MOTIFs 885 complex 3D data sets, analysis of 514 combination of EM and x-ray data 514–515 flexible fitting 515 hybrid approach 514 identifying protein complexes in cellular tomograms 515–516 rigid body docking 515 segmenting tomograms and visualization 515 complex protein mixtures, quantification of 24 computer-aided analysis 413 concentration 17 condenser 184 conditional protein splicing 1054 confocal high-speed-spinning disk systems (Nipkow systems) 199 confocal laser scanning microscopy (cLSM), principle 198 confocal spinning disk microscopy (Nipkow) 199 principle 199 trans conformation 48 contour-clamped homogenous electric field (CHEF) method 698, 699 contrast transfer function (CTF) 499–501, 501–503 cooling curves 54 cooling rates 49 Coomassie Brilliant Blue G250 28 Coomassie staining 252 spot analysis 985 Coons, Albert 182 copper ion 151 Cotton effect 564 coumarins 119 coupling liquid chromatography (LC) and mass spectrometry (MS), advantage 375 cover slip 184 CpG adjuvants 709 CpG dimer 948 CpG island 688 CpG-methylation 817 critical micellar concentration (CMC) 18, 59, 288 crosslinked protein 129 crosslinking factor 250, 251 crossover-electrophoresis 78 cryo-electron microscopy 496, 497 cryo-electron tomography (CET) 486,517 cryopreservation 69 crystallization 3 crystallographic R-factor 543 crystallographic unit cell and crystal packing 534 crystallography 529

Index

crystals, and x-ray diffraction 533–538 C-terminal sequence analysis 325 chemical degradation methods 325 degradation of polypeptides with carboxypeptidases 327 detection of the cleaved amino acids 328 specificity of carboxypeptidases 327 peptide quantities and quality of chemical degradation 327 Schlack–Kumpf degradation 325–327 cw-EPR spectroscopy 468–469 cyan fluorescent protein 406 cyanoethyl adducts, formation 711 cycle sequencing 795 cyclic peptides 224 cyclodextrins 289 cyclotron-movement 349 Cy-dyes 986 cysteines 656 acetylated 655 residues, chemical modification of 210 cystic fibrosis (CF) 767 cytochromes 149–151 cytogenetic methods 917–924 cytolytic T-cells effector cells, activation of 105 recruitment of 105 cytomegalovirus (CMV) 772 DNA virus 965 cytoplasmic RNA 677 migration of 698 cytosine methylation 817 cytosine, to uracil 819 cytotoxicity 64 D dabsyl chloride (DABS-Cl) 116, 307 dansyl derivatives, optical properties of 117 dark field microscopy 188 data dependent acquisition (DDA) 1000 peptide quantification 1002 data independent acquisition (DIA) 1000, 1008 data interpretation 369 data mining 1030–1031 DAVID database 878 Dcm-methylation 817 deamidation 224 decipher regulatory cis-elements 907 van Deemter-Knox equation 222 delta restrictions cloning 787 denaturation denaturing high-performance liquid chromatography (dHPLC) 940

DNA sequencing, gel-supported methods 792 proteins 7 in situ hybridization 919 Denny–Jaffe reagent 126 deoxynucleotide triphosphates 759 depletion 982 depurination reaction 710 detection methods 189 chemical staining 189 direct and indirect immunofluorescence labeling 190–191 fluorescence labeling 189–190 for live cell imaging 191 fluorescence microscopy, fluorophores/ light sources 193–194 histological staining 189 incandescent lights 194 labeling with quantum dots 192 lasers 195 mercury vapor lamps (HBO lamps) 195 mercury–xenon vapor lamps 195 physical chemistry of staining (electro-adsorption) 189 physical staining 189 types of light sources 194 in vitro labeling with fluorescent fusion proteins (GFP and variants) 192–193 with organic fluorophores 191, 192 xenon vapor lamps 195 detergents 19 chromatographic support materials for separation of 21–22 ionic 20 micelles, formation of 18 nonionic 20 properties of 18 and removal 18 removal of 20–22 zwitterionic 20 diafiltration 16 dialysis 15, 17 diamagnetic biomolecules 466 diastereomers 710 diatomic molecule, vibration properties of 163 diazopyruvoyl compounds 126 dideoxy method 789 2´ ,3´ -di-deoxynucleotide 789 dielectric coefficient 247 diethyl pyrocarbonate (DEPC) 864 chemical formula/mechanism 676 difference gel electrophoresis (DIGE) 252, 270–271, 986 internal standard 271 minimal labeling 270–271

1089

principle 271 saturation labeling 271 different extraction methods 21 differential centrifugation 12–13 differential equation 36 differential interference contrast 186 differential RNA-seq (dRNA-seq) 868 differential scanning calorimeters (DSCs) 47, 48, 54 curve 50–53 design 49 instrument requirements 48–49 diffraction phenomena and imaging 187 Digitalis lanata 745 Digitalis purpurea 745 digoxigenin detection system 745 digoxigenin:anti-digoxigenin (DIG) hapten 753 system 744 structure of 744 4-(N,N-dihexadecyl) amino-7-nitrobenz2-oxa-1,3-diazole (NBDdihexadecylamine) 619 dihydrorhodamine (DHR) 96 2,5-dihydroxybenzoic acid (DHB) 1066 N,N´ -diisopropylcarbodiimide (DIC) 558 dilution 15 4,5-dimethoxy-2-nitrobenzyl chloroformate 111 dimethylformamide (DMF) 558 dimethyl sulfate (DMS) 846 modification 849 dimyristoyl-phosphatidylglycerol 60 dinitrophenol (DNP) 749 2,4-dinitrophenol (DNP) 742 diode array photometer 142 dioxetane chemiluminescence reaction 748 diphenylthiourea (DPTU) 315, 316 diphosphatase 793 dipolar coupling 473 direct agglutination 72 direct blotting electrophoresis (DBE) 741 disc electrophoresis 255–257 discontinuous polyacrylamide gel electrophoresis (disc PAGE) 244 principle of 256 discovery proteomics 996 disruption method 9 disulfide bonds , cleavage of 209–210 dithioerythritol (DTE) 685 dithiothreitol (DTT) 126, 209, 269, 685 DNA analysis 957 adenine methyltransferase 851 associated protein modifications 828

1090

Index

DNA analysis (continued ) backbone 801 binding motifs 835 binding proteins 828, 951 calcium phosphate crystals 913 copying enzyme 846 double helix 720 footprint analysis 844 scheme of 849 fragments 721, 833 helix parameters 832 hybridization 161, 948 hydrolysis method 827 methylation analysis 827–828 length standards 693 with methylation specific restriction enzymes 823–825 molecule 689 non-viral introduction 912 oligonucleotides 730, 732, 736, 843, 896, 953, 962 DNA chip technologies 429, 1035 DNA complexity, cot value 721 DNA Data Bank of Japan (DDBJ) 875 DNA:DNA hybrids 722 DNA–DNA interactions 1036 DNA library 675 DNA methylation 818, 895, 949, 950 with methyl-binding proteins 826 specific restriction enzymes 824 DNA methyltransferases (DNMTs) 817 DNA microarray technologies 939, 945, 953, 1036 barcode identification 954–955 beyond nucleic acids 956–957 DNA analyses comparative genomic hybridization (CGH) 951 genotyping 948 methylation studies 948–949 protein–DNA interactions 951–952 sequencing 949–951 DNA synthesis 952 on-chip protein expression 953–954 RNA synthesis 953 RNA analyses 946 splicing 947 structure/functionality 947–948 transcriptome analysis 946–947 structural analyses 956 universal microarray platform 955–956 DNA-modifications, analysis of 818 DNA polymerase 207, 429, 798 advantage of 756 DNA polymorphisms 927 DNA–protein binding partner 870

DNA-protein complex 837, 839, 867 chemical reagents diethyl pyrocarbonate (DEPC) 847 dimethyl sulfate (DMS) 846 halo-acetaldehydes 847 hydroxyl radical footprint 847–848 KMnO4/OsO4 846–847 N-ethyl-N-nitrosourea (ENU) 847 gel analysis of 838 interactions 857 DNA curvature 832–833 DNA topology 833–834 double-helical structures 831–832 DNA-RNA hybrid 961 DNase-I digest 845, 918 DNA sequencing 785 automated capillary 799 device 799 energy transfer dyes 797 fluorescent markings 796 gel-free methods 806 adaptations to library preparation/ sequencing 810 classic pyrosequencing 807 illumina sequencing 809–810 indexed libraries 811 mate-pair library creation 810–811 multiplex DNA sequencing 805 paired-end-sequencing 810 pyrosequencing, principle of 807 454-technology (Roche) 807–808 microdrop PCR genomic sequencing/ resequencing 812–813 semiconductor sequencing 811–812 sequencing technologies, applications 812 tag counting–RNASeq/ ChipSeq 813 target (exome) enrichment 811 next generation sequencing (NGS) 786 principle of 799 single molecule sequencing 813 nanopore sequencing 814 native hemolysin pore, in membrane 814 PacBio RS single molecule 814 single molecule real time (SMRT) 814 DNA sequencing, gel-supported methods 786 autoradiogram 791 chain termination method, to Sanger 790

chemical cleavage, according to Maxam/Gilbert cleavage reactions 801–804 end labeling 800–801 A G-cleavage reaction 802 Maxam–Gilbert method 800 multiplex DNA sequencing 804–805 RNA sequencing 805–806 solid phase process 804 α-thionucleotide analogues 804 thymine- and cytosine-specific fission reaction 803 cycle sequencing principle of 795 with thermostable DNA polymerases 795–796 denaturation/neutralization 792 2´ -deoxynucleotide 790 deoxynucleotide analogues structure, in dideoxy sequencing reactions 794 dideoxy method 789 2´ ,3´ -di-deoxynucleotide 789 diphosphatase 793 final denaturation 795 labeling techniques/verification methods automated sample preparation 798–800 energy transfer dyes 797 fluorescent labeling/online detection 797 internal labeling, with labeled deoxynucleotides 798 isotopic labeling 796–797 labeled terminators 798 online detection systems 798 primer labeling 797–798 primer hybridization 792–793 strand synthesis and additives 794 cofactors 794 nucleotide analogs 794–795 pyrophosphorolysis 793 T7 DNA polymerase 791–792, 796 dot blotting unit 707 dot immunoassay 88 double beam spectrometer 143 double-stranded RNA-binding domain (dsRBD) 860 Down’s syndrome 719 doxorubicin 104 doxycycline-inducible DNA-binding unit 1051 Drosophila embryos 720, 729

Index

Drosophila melanogaster 785, 942, 947, 967, 1016, 1052 dry chemistry 420 DryLab G/plus 236 dual-beam photometer 142 Duchenne muscular dystrophy 767 dye energy transfer 797 laser-induced emission 797 dynodes 356 E Eddy diffusion 222 Edman degradation 116, 210, 313, 315, 564, 654 amino acids, identification of 316–317 cleavage reaction of 316 length of amino acid sequence 319 phosphorylated and acetylated amino acids, localization of 654 quality of 317–319 reactions of 315–316 repetitive yield 317–319 Edman, Pehr 313 Edman sequencing 108 EDTA-containing buffer 845 Egger, David 182 Einstein’s formula 412 Ekins, Roger 1034 elastic scattering 171 electric field vector 134 electric ion traps 348–349 electroblotting 272 blot membranes 273 blot systems 272 semidry blotting 272–273 tank blotting 272 transfer buffers 273 electrochemical luminescence markers 742 electrodiffusion 78 electroelution from gels 263–264 electroendosmosis 247, 248 electrokinetic injections 277–278 electromagnetic radiation 132, 133 electromagnetic waves 132, 134 electron crystallography 486 electronic DNA biochips 428–429 electronic transition 133 electron microscope, imaging process 183, 485, 492, 498 analytical electron microscopy 494–495 electron beam with object, interactions of 493–495 electron energy loss spectroscopy 494

imaging and information content of 493–494 mass determination 494 Fourier transformation 499–501 perspectives of 516–517 with phase plate 495–496 pixel size 498–499 resolution 492–493 scanning electron microscopy 494–495 electron nuclear double resonance (ENDOR) 477–478 experiments 477–478 electron paramagnetic resonance (EPR) 119 applications for 479 local pH values 480 mobility 480–481 quantification of spin sites and binding constants 479–480 basics 467 comparison EPR/NMR 481–482 cw-EPR spectroscopy 468 electron spin and resonance condition 467–468 electron spin nuclear spin coupling 469–470 g-value 469 hyperfine anisotropy 470–472 significance 481 spectroscopy of biological systems 466 electron spin echo envelope modulation (ESEEM) 475–476 problems 475 three-pulse spectrum 476 electron spin–electron spin coupling 472–473 anisotropic component 473 electron spin nuclear spin coupling (hyperfine coupling) 469–470 cw-X-band EPR spectrum of TPA in liquid solution 470 Fermi contact interaction 469 electron tomography of individual objects 512–514 electron transfer dissociation (ETD) 999 fragmentation sources 992 electropherogram 296 electrophoresis 243, 690–691 capillary gel electrophoresis (CGE) 701–702 DNA fingerprinting 696 DNA, gel electrophoresis of agarose gels 691 circular DNA, separation of 692 denaturing agarose gels 693–694 double-stranded DNA fragments 691–692

1091

double-stranded DNA, nondenaturing PAGE of 694 low melting agarose/sieving agarose 694 native polyacrylamide gels, abnormal migrational behavior 694–695 oligonucleotide purification 696–697 polyacrylamide gels 694 practical considerations 692–693 protein–DNA complexes, nondenaturing gels 694 single-stranded DNA (SSCP) 695–696 gel media for 250 pulsed-field gel electrophoresis (PFGE) 698–700 applications 699–700 principle 698–699 RNA, gel electrophoresis 697 formaldehyde gels 697 RNA standards 697–698 two-dimensional gel electrophoresis (2D gel electrophoresis) 700 of DNA 700–701 of RNA 700 temperature gradient gel electrophoresis (TGGE) 701 electrophoretic mobilities 253 electrophoretic mobility shift analysis (EMSA) 836 electrophoretic separation techniques 243 historical review 244 media 243 separation principle 243 theoretical fundamentals 245–248 electrospray ionization (ESI) 335, 998, 1064 charged residue model 337 interfaces, types of 281 ion emission model (IEM) 337 ionization, principle of 335–338 macroscopic/microscopic 336 mass spectra, properties of 339–341 mass spectrometry 220, 234 MS/MS instruments 377 MS spectrum, of metabolically labeled 1015 sample preparation 341 schematic structure and variants of electrospray sources 339 source and interface 338–339 spectra 715 chromatographically incompletely separated components 716 components detection 716

1092

Index

electrospray ionization mass spectrometry (ESI-MS) 228, 711 electrospray mass spectrometry hydrophobic transmembrane peptide fragment 565 purified 16mer peptide amide 565 electrostatic attraction 60 electrostatic interactions 60 ellipticity 179 Ellman’s reagent 112 EMBL Nucleotide Sequence Database 786 emulsion-PCR reaction (emPCR) 807 ENCODE project 882, 883, 939 endoplasmic reticulum 129, 579 endoprotease AspN 985 endoprotease LysC 985 endoproteinases 328 endothermic processes 48 endothermic transition 48, 49 energy levels, transitions between 137 energy transfer group systems 797 enthalpy 35, 49, 59, 147 Entrez database system 878 entropy 35, 36, 147 enzymatic activity 44 enzymatic digestion 313 enzymatic DNA synthesis 952 enzymatic labeling 735 of DNA 736 nick translation 736 PCR amplification 736 random priming 736 reactions 735 reverse transcription 736 of RNA 736–737 terminal transferase reaction 736 enzymatic methods 651 as catalysts 37–38 enzyme-catalyzed DNA polymerization 794 enzyme-controlled reactions, rate of 38 enzyme diaphorase (DP) 753 enzyme immunoassays (EIA) 91–93 Biacore technique 93–94 enzyme-linked immunosorbent assay (ELISA) 745, 1033 signal amplification 751 enzyme–substrate complex 39 enzyme–substrate interactions 1038 enzyme sulfurylase catalyzes 807 epidermal growth factor 119 epidermal growth factor receptor (EGFR) 1041 epifluorescence microscope 188 epigenetic modifications, analysis of 817 DNA replication 817 epitope mappingz 70

Epstein-Barr virus (EBV) 785, 913 Er-YAG-laser 331 erythropoietin (EPO) 581 amino acid sequence with 588 GluC peptides, theoretical/ experimental masses of 589, 590 RP-HPLC/ESI-MS, total ion current 588 schematic representation 582 Escherichia coli 385, 683, 785, 933, 934, 953, 978 cell 856 cytosine 825 DNApolymerase holoenzyme 736 DNApolymerase I 736 promoter sequences 883 ethidium bromide 253 chemical structure of 702 geometric properties of circular DNA 703 ethyl acetate 316 ethylcyanoglyoxylate-2-oxime (Oxyma) 558 ethyl trifluorothioacetate 110 eukaryotic cells cytoplasmic RNA 697 eukaryotic DNA 721 eukaryotic mRNA species 678 European Bioinformatics Institute (EBI) 875 European Molecular Biology laboratory Open Software Suite (EMBOSS) analysis package 881 PEPSTATS routine 882 sequence analysis, on web 881–882 E-value 890 evaporative light scattering detectors (ELSD) 620 exon sequencing 940 exonuclease III (Exo III) 845 exothermic heat 54 expressed protein ligation (EPL) 1052, 1053 expressed sequence-tagged (EST) 776 extracted ion chromatogram (EIC) 713, 717 F Fab fragment 745 Faraday cup 357 schematic design and operating principle 357 far-Western blot 399–400 FASTA format 879, 880 Fayyad, U. 1029 Fc-engineering 101 Fenton reaction hydroxyl radicals 847

Fe 1-(p-bromoacetamidobenzyl)ethylenediaminetetraacetic acid (Fe-BABE) structure of 850 Ferguson plot 253 field inversion gel electrophoresis (FIGE) 698 filamentous phages 676 finger-printing methods 787 Fischer, Emil 573 five-stranded β-sheet 460 flame ionization detectors (FID) 620 florescence resonance energy transfer (FRET) 851 flow cell 266 flow velocities 222 9-fluorenylmethoxycarbonyl (FMOC) chloride 306–307 fluorescamine 304 fluorescein 119, 797 fluorescein:anti-fluorescein (FLUGS) system 742, 746 fluorescein cadaverine 119 fluorescein isothiocyanate (FITC) 116, 1059 fluorescein-tagged phosphoramidites 851 fluorescence 652 action spectroscopy 157 coded beads 1034 detection methods 653 detector 220 DNA chips 429 energy transfer 117 labeling 116, 119 with dansyl chloride 118 with fluorescamine 118 with phthaldialdehyde in thiols 118 methods 651 microscope 486 (see also fluorescence microscopic analysis, special) principles of 188 fluorescence correlation spectroscopy (FCS) 160, 202–203, 856, 1042 measurement 856 fluorescence energy resonance transfer (FRET) 749 fluorescence in situ hybridization (FISH) 729 fluorescence labeling, with fluorescein isothiocyanate (FITC) 118 fluorescence lifetime imaging microscopy (FLIM) 160, 204–205, 406 fluorescence loss in photobleaching (FLIP) 202 fluorescence microscopic analysis, special 186, 188, 197–198, 1064

Index

confocal high-speed-spinning disk systems (Nipkow systems) 199 light microscopic super resolution below Abbe limit 200–201 live cell imaging 199–200 measurement of movement of molecules 202 Ca2+ imaging 205 fluorescence correlation spectroscopy (FCS) 202–203 fluorescence lifetime imaging (FLIM) 204–205 fluorescence loss in photobleaching (FLIP) 202 fluorescence recovery after photobleaching (FRAP) 202 fluorescent-speckle microscopy (FSM) 203 Förster resonance energy transfer (FRET) 203–204 Raster image correlation spectroscopy (RICS) 203 spectral unmixing 204 multiphoton fluorescence microscopy 198–199 NSOM/SNOM (near-field scanning optical microscopy) 202 PALM (photoactivated localization microscopy) 201 (S)SIM ((saturated) structured illumination microscopy) 201 stimulated emission depletion (STED) 201 stochastic optical reconstruction microscopy (STORM) 201 total internal reflection fluorescence microscopy (TIRFM) 202 fluorescence quenching 119 fluorescence recovery after photobleaching (FRAP) 108, 160, 202 fluorescence resonance energy transfer (FRET) 381, 402–403, 852–853 bioluminescence resonance energy transfer (BRET) 408 fluorescent probes for 406–407 FRET based on fluorescence lifetime measurements 406 FRET estimation based on sensitized acceptor emission 404–405 FRET estimation using acceptor photobleaching 405–406 genetically encoded FRET pairs— caveats and challenges 407–408 instrumentation for intensity-based FRET measurements 404 key physical principles 403 methods of measurements 403–406 fluorescence spectroscopy 154

emission and action spectra 156–157 frequent mistakes in 161–163 fluorescence of complexes 162–163 FRET overdone 162 GFP overdone 162 incomplete/wrong labeling with fluorophores 161 quantum dots overdone 162 shading and inner filter effect I 162 shading and inner filter effect II 162 principles 154–156 fluorescence staining 252 fluorescence studies using intrinsic and extrinsic probes 157–158 fluorescence techniques 131 fluorescent activated cell sorter (FACS) apparatus 96 fluorescent antibodies 116 fluorescent detection 744 fluorescent DNA hybridization 917 absorption vs. emission-spectra 920 DNA probes 918 fluorescent hybridization signals, evaluation of 920 labeling strategy 917–918 DNA probes 918–919 nick translation 919 PCR labeling 919 random priming 919 signals, absorption vs. emissionspectra 920 in situ hybridization 919 denaturation 919 hybridization 919 pre-annealing 919 stringency 919 fluorescent in situ hybridization (FISH) 742, 917, 936, 940 of genomic DNA 920–921 interphase-FISH/fiber-FISH 921 metaphase 921 multicolor 920–921 fluorescent labeling 918 2´ -deoxynucleotides (fluorescein-15-dATP) 798 terminators 798 fluorescent-speckle microscopy (FSM) 203 1-fluoro-2,4-dinitrobenzene 313 fluorophores 119, 193, 194, 950 4-methylumbelliferone (4-MU) 915 5-fluorouridine (FU) structures of 906 in vivo labeling of nascent RNA 907 foam formation 617 FokI 683 advantage of 828 Folin–Ciocalteau phenol reagent 26

1093

formic acid 560, 561 5-formylcytosine (5fC) 817 Förster distance 852 Förster resonance energy transfer (FRET) 108, 160–161, 203–204 comparison between PELDOR and 479 components 749 efficiency (E) 406 principle 728 rate 161 Fourier analysis 500 Fourier, Joseph 499 Fourier transformation 350, 351, 499–501, 500, 501 mass spectrometer 219 fragmentation techniques 296, 357 collision induced dissociation (CID) 357–358 generation of free radicals (ECD, HECD, ETD) 360–362 photon-induced dissociation (PID, IRMPD) 360 prompt and metastable decay (ISD, PSD) 358–360 Franck–Condon principle 137, 154 Fraunhofer scattering 142 free amino acids, analysis of 303 free electron laser (FEL) 529, 550 free energy 35 free flow electrophoresis 266–267 principle 266 free working distance 183 frozen-hydrated specimens, imaging procedure for 496–497 fundamentally different network architectures 1030 fusion proteins 182 G galactocerebrosides 641 Galilei, Galileo 181 GAL4 system 869 DNA-binding domain 868 yeast Gal4 protein 381 g anisotropy 470–472 gas chromatography (GC) 1027 gas-phase fragmentation approaches 992 gas phase sequencer 321 gauche-conformers 48 gel chromatography 16 gel electrophoresis 248, 806, 839, 982, 984 gel retardation, background 836–838 nucleic acids 841 proteins separated 984

1094

Index

gel electrophoresis (continued ) two-dimensional principle and practical application 700 gel filtration methods 21, 666, 667, 708 gel-free sequencing methods 806 gel phase 48 gel retardation 837, 839, 840 gel-supported DNA sequencing methods 789 GenBank flat-file formatted database 880 gene defects 719 generation of free radicals (ECD, HECD, ETD) 360–362 genetic defects, detection of 773–776 allele-specific PCR 775 denaturing gradient gel electrophoresis (DGGE) 776 length variation mutations 773–774 oligonucleotide-ligation assay (OLA) technique 775 restriction fragment length polymorphisms 774 reverse dot blot 774 sequencing 774 single-strand conformational polymorphism (SSCP) 775–776 genetic engineering 571 genetic mapping 925 disease genes 932 genetic markers 927 human genome 931–932, 942 integration of 940–942 linkage analysis 929 χ 2 test 929–930 for plausibility 930–931 microsatellites 927 physical mapping cloning systems 934, 935, 936 genes, identification/isolation of 937–939 hereditary disease 940 high-throughput sequencing 936–937 recombinant clones 934–935 restriction of whole genomes 932–934 STS mapping 935 transcription maps, of human genome 939–940 recombination 925–927 restriction fragment length polymorphisms (RFLPs) 927 single nucleotide polymorphisms (SNPs)/single nucleotide variants (SNVs) 927–929 genetic predisposition 932 genome 610–611

genome-level regulatory information 882 genome sequencing 610 genome-wide association studies (GWAS) 940 genome-wide gene inhibition analysis, schematic representation 955 genomic loci 906 genomic region, integrated map 941 genotyping 948, 950 to phenotype, organizational structure 1025 Gibbs free energy 241 Gibbs–Helmholtz equation 61 Giemsa staining 941 G-less cassettes, for in vitro transcription 909 β-globin gene 765 glucose disaccharides 572 glucose enzyme electrodes 428–429 D-glucose, stereochemistry of 573 glutamate dehydrogenase 42 glutathione S-transferase (GST) analyzing interactions in vitro, GST-pulldown 397–398 protein 398 N-glycans chemical shift 604 composition analysis 600 core structures 581 coupling constants 605–606 exoglycosidase digestion 602–604 1 H NMR spectroscopy 604 1 H NMR spectrum 607 individual, analysis of 581–610 isolation of individual N-glycans 599–600 linkage with peptide backbone 579 mass spectrometry 604 methylation analysis 600–602 nuclear Overhauser effect (NOE) 607–608 pentasaccharide Man3-GlcNAc2 580 pool, release and isolation of 590–599 spatial interaction of sugar residues 609–610 structural reporter groups 606 structure 580 types of 580 O-glycans structure 580–581 glycerophospholipids (GP) 614 glycine-glycine removal 994 glyco-analytical studies 582 glycolipids (GL) 614 glycolysis 1024 glycome 610–611 goals 610

glycopeptides, mass spectrometric analysis on basis of 588–590 glycoproteins 251, 258, 571, 572, 579, 611 intact, analysis on basis of 582 electrophoretic analyses 582–584 monosaccharide components 585 neutral monosaccharide components 585–587 sialic acid 587 using lectins 584–585 purification 572 glycoproteome research, special challenge of 610 glycosidic bond 572, 574–578 anomeric configuration 575, 576 exo-anomeric effect 577–578 linkage direction 577 glycosylation 224, 329, 571, 579 O-glycosylation 579 glycosyltransferases 579, 610 glyoxal gels 697 good laboratory practice (GLP) conditions 611 good manufacturing practice (GMP) 611 G-protein coupled receptors 107 guide RNA (gRNA) 974 Gquery interface 878 gradient elution 227, 236 green fluorescent protein (GFP) 158–159, 182, 406, 911, 915, 1043 variants 193 g-value 469 H Hahn, Erwin 475 Halobacterium salinarium 149, 523 HapMap Project 932 hapten antibodies 99 dNTP 733 structure of 743 H-bond donors 832 heat capacity 49 heating rates 49 heavy metals 25 height equivalent to one theoretical plate (HETP) 222 HEK293T cells 913 helical B-DNA, structure of 832 helical DNA conformations 832 helix-loop-helix proteins (HLHs) 835 helix-turn-helix structures (HTHs) 835 hemagglutination 72, 73 Henderson–Hasselbalch equation 44 hepatitis B surface antigen (HbsAg) 1035

Index

hepatitis B virus (HBV) 772 hepatitis C-virus (HCV) 752, 764, 772, 967 heptafluorobutyric acid 228 Herpes simplex 514 heterogeneous amplification systems 724 heteronuclear NMR-spectroscopy 219 hexaminidase 129 higher-energy collisional dissociation (HCD) 990, 999 high field/high-frequency EPR spectrometer 471–472 high-performance anion exchange chromatography (HP-AEX) 232 high-performance aqueous normal phase chromatography (HP-ANPC) 230 high performance cation exchange chromatography (HP-CEX) 232 high-performance gel permeation chromatography (HP-GPC) 227 high-performance hydrophilic interaction chromatography (HP-HILIC) 229–230 high-performance hydrophobic interaction chromatography (HP-HIC) 230–232 effect of salt cations 231 protein–hydrophobic surface interactions, enhanced by 232 protein separations, selectivity of 230 salts used in mobile phase 231 high-performance ion exchange chromatography (HP-IEX) 232–233 high performance liquid chromatography (HPLC) 219, 220, 827, 1027 chromatogram of peptide mixture 564 column, operating ranges 237 narrow bore 643 separation of peptides by 561 high-performance normal-phase chromatography (HP-NPC) 228–229 high-performance reversed-phase chromatography (HP-RPC) 227–228 high-performance size exclusion chromatography (HP-SEC) 227 high-resolution two-dimensional electrophoresis 267–268 difference gel electrophoresis (DIGE) 270–271 internal standard 271 minimal labeling 270–271 saturation labeling 271

first dimension, IEF in IPG strips 269–270 workflow 267 prefractionation 268 affinity chromatography 268 fractionation according to charge 269 subcellular components 268 proteins, detection and identification of 270 sample preparation 268 second dimension, SDS polyacrylamide Gel electrophoresis 270 high-throughput technology (hSHAPE) 866 high voltage transmission electron microscopes (HVTEMs) 493 HiSeq 810 histologic stains 190 histone acetyl transferases (HATs) enzymes 647 internal lysine residues 646 histone deacetylases (HDACs) 646–647 enzymes 647 histone H4 proteoforms 997 Hoechst 33258 669 Hofmeister series 10 homing endonucleases genes (HEGs) 683 homogenization 7 Hoogsteen base pairing 961 Hooke, Robert 181, 485 horizontal electrophoresis system 249 house-keeping gene 764 HPLC buffer 713 HPLC chromatography 715 HP-RPC purification 234 hubs 1030 Human Genome Project (HGP) 719, 785 human immunodeficiency virus (HIV) 772 HIV-Rev protein 869 provirus genomes 765 trans-activator response element (TAR) 869 humanized antibody 100 human proteins monoisotopic masses 989 human X chromosome physical and genetic map 931 Huntington gene locus 773 Huntington’s disease (HD) 773 trinucleotide (CAG-) expansion 773 HybProbes 725, 728 HybProbes 728 hybrid antibodies 100 hybrid instruments 351

1095

hybridization methods heterogeneous systems for qualitative analysis 723 heterogeneous systems for quantitative analysis 723–724 homogeneous systems for quantitative analysis 725 FRET system 728 intercalation assay 728–729 molecular beacon system 728 in situ systems 729 TaqMan/5´ nuclease methods 725–728 hybridoma technique 99 hydrogen bonds 169, 177 hydrolysis methods, DNA molecule 844 DNA polymerases, exonuclease activity of 845 DNase I 844–845 exonuclease III (Exo III) 845 λ exonuclease 845 hydrolytic cleavage 313 hydrophilic contaminants 15 hydrophilic peptide 316 hydrophilic proteins 20 hydrophobic molecules 59 hydrophobic peptides 562 1-hydroxybenzotriazole (HOBt) 558 hydroxyethyl cellulose (HEC) 701 hydroxyl radical reactions 847, 848 5-hydroxymethylcytosine (5hmC) 817 hydroxypropylmethyl cellulose (HPMC) 701 N-hydroxysuccinimide (NHS) 125, 270 N-hydroxysuccinimidyl-4-azidosalicylic acid 125 hypercholesterolemia 767 hyperfine anisotropy 472 hyperfine sublevel correlation experiment (HYSCORE) 476–477 pulse sequence 476 splitting scheme 476 theoretical HYSCORE spectrum 477 hypochromism 139 hysteresis effects 54 I identification, detection, and structure elucidation 368 identification 368–369 structure elucidation 369–375 verification 369 IgG antibody 828 IgG immunoglobulins antigen interaction at combining site 67–68 functional structure 66 schematic structure and function 66

1096

Index

Illumina sequencing systems 809, 810 exome enrichment 812 Ilyobacter tartaricus 523 image analysis 498 image cycler microscopy (ICM) 1058, 1060 antibody based 1057 image isomerism, D-glucose/ L-glucose 576 imaging cycler robots 1059–1063 multi-pipetting unit, a fluorescence microscope, and CCD camera 1059 imidazole 33 iminodiacetic acid 233 immobilization 23, 88 immobilized metal-chelate affinity chromatography (IMAC) 233, 649 on a membrane 17 pH gradients 244 reagents, analysis using 419 immune agglutination 73 immune binding 84–88 immune conjugates 104–105 immune defense 63–64 immune fixation 80 immunoagglutination 72, 73 application 72–73 direct agglutination 72 immunochemical detection 651, 652 immunocompromised AIDS patients 965 immunocytochemistry 90 immunodetection 88 immunodiffusion 75, 78 one-dimensional simple immunodiffusion of Oudin 75–76 two-dimensional immunodiffusion of Ouchterlony 77–80 immunoelectrophoresis 79 immunofluorescence labeling 191, 196 direct and indirect 190 immunoglobulins 995 VL variable domain 892 immunohistochemistry 90 immunopharmaceuticals, strategies for production 103 immunoprecipitating systems 73–75, 98, 649 immunosensors 426–427 immunotherapeutics 64 industrial-scale sequence production 875 inelastic scattering 171 informative markers 927 infrared spectroscopy (IR) 163, 1026 molecular vibrations 164–165

principles 163–164 proteins, infrared spectra of 168–171 technical aspects 165–168 inhibitors 40 competitive 40–41 non-competitive 41 RNA (RNAi) molecules 953 in situ hybridization (ISH) 917 in situ protein expression 954 insoluble proteins 6 instrumentation 319–322 intact nuclei, isolation of 906 Intelligent Systems in Molecular Biology Conference (ISMB) 877 interactomics–systematic protein–protein interactions DNA to protein microarrays 1035–1036 protein microarrays 1033–1034 antibody–antigen interaction 1037–1038 application of 1037–1039 enzyme–substrate interaction 1038 ligand–receptor interaction 1038 reverse phase microarrays 1038–1039 sensitivity through miniaturization–ambient analyte assay 1034–1035 interference 184 contrast microscopy 188–189 interfering substances 24 International HapMap Project 927 interspersed repetitive sequences (IRSs) 918 hybridization with singular probes 918 intrinsically disordered proteins (IDP) 464 in vivo proteolysis 207 iodinations 33 iodoacetamides 119, 125 ion chromatograms 1008 ion detectors 355 ion exchange chromatography (IEX) 240 ionic detergents 20 ionic retardation 21 ionic strength 43, 60 ionization methods 329, 330 ion mobility spectrometry (IMS) 378 ion pairing reagent 236 ion pair reversed-phase HPLC 712 separation 714 ion-trap mass spectrometer 219 isobaric tags for relative and absolute quantitation (iTRAQ) 1020 enzymatically cleaved proteome states 1020 isobaric reagents 1020 isocratic elution, optimization of 236 isocyanate 316

isoelectric focusing (IEF) 244, 259–260 carrier ampholytes 261 cathodal drift 261 electrode solutions 261 immobilized pH gradients 261–262 preparation of 262–263 measuring pH gradient 260 pH gradients, kinds of 260 principle 260 separation media 260 separator IEF 261 titration curve analysis 263 isopycnic centrifugation 14 isoschizomer EcoRII I-BstNI 825 isoschizomers 824 isotachophoresis (ITP) 293–295 order of electrolytes and 294 isothermal titration calorimeters (ITCs) 48, 54, 58, 61 curve 55, 59, 60 scheme 55 isothiocyanate, reaction with 111 isotope-coded affinity tag 650 isotope coded protein label (ICPL) 1018–1019 ICPLQuant 1019 reagents 1018 isotopically labeled linker (ICAT) 660 reagent 1016, 1017 technique 1017 isotopic distributions 989 isotopomer envelope 989 isotopomers 987 J Jablonski diagram 137–140 JASPAR database 884, 885 Joule heating 279 K karyograms 917 drawbacks of 917 Kepler, Johannes 181 Ketenes 126 K-homology domain (KH) 859 RNA-binding motif 860 Kjeldahl method 25 Klebsiella pneumoniae 768 Klenow fragment 788, 791 of DNA polymerase I 843 Knoll, Max 486 Koch, Robert 182, 485 Köhler, August 485 Köhler illumination optimization 184 Krebs cycle 1024 Kyoto Encyclopedia of Genes and Genomes (KEGG) 1024

Index

L labeled oligonucleotides 729 label probes radioisotopes, characteristics of 739 RNA probes, synthesis of 731 LabExpert 236 laboratory information management system (LIMS) 800 lactate oxidase (LOD) score 931 Lambert–Beer law 25, 140–142, 253 Lamm’s differential equation 412, 415 Larmor frequencies 474, 475 laser, for fluorescence excitation 280 laser microprobe mass analysis (LAMMA) 1064 N-Lauryl-sarcosine 678 law of mass action 36, 416 lectin-affinity chromatography 611 Leeuwenhoek, Antoni van 181 Leucine zipper proteins 835 Leu2 gene 382 ligase chain reaction (LCR) 751, 779–780 LightCycler® 728, 756 profiles 729 light emitting diodes (LEDs) 143, 144, 157 light field diaphragm 184–185 light field microscopy 187 light induced co-clustering (LINC) 408–409 light microscopic technologies 183 light scattering 144 consequence for absorption measurement 141 linear combinations of atomic orbitals (LCAO) 134 linear dichroism (LD) 177 IR spectroscopy 178 linear discriminant analysis (LDA) 1008 linear gradient gel, casting of 255 linear polarized light 132 linear polyacrylamide 701 linear-solvent-strength theory 236 Lineweaver–Burk plot 40, 41 linkage analysis 929 linoleic acid 621 lipidation 224 lipidome analysis 640–642, 641 advantages of ESI-MS based 641 biological sample, experimental strategy for mass spectrometric analysis 641 lipids 48, 613 analysis of selected lipid classes 626 fatty acids 627–628 nonpolar neutral lipids 628–629 polar ester lipids 630–633

whole lipid extracts 626–627 biological functions 613 combining analytical systems 623 coupling of gas chromatography and mass spectrometry (GC/ MS) 625–626 coupling of HPLC and mass spectrometry (LC/MS) 625 coupling of HPLC and UV/Vis spectroscopy 623–624 tandem mass spectrometry (MS/MS) 626 extraction 615 hormones and intracellular signaling molecules 633–638 liquid phase extraction 616 membranes 4 methods for analysis 618 chromatographic methods 618–622 1 H, 13C, and 31P NMR spectroscopy 623 immunoassays 622–623 mass spectrometry 622 UV/Vis spectroscopy 623 modified proteins, analysis 1052–1053 perspectives 642–643 solid phase extraction 616–617 structure and classification 613–615 vitamins 638–640 lipofection 913 lipoproteins 251 liposomes 48, 50 liquid chromatography (LC) 144, 220, 995, 998 liquid chromatography tandem mass spectrometry (LC-MS/MS) 999 analysis 713 bottom-up workflow 999 high-throughput analysis 996 instruments 376–377 oligonucleotides 709 mass spectrometric investigation 712–714 phosphorothioate oligonucleotide, IP-RP-HPLC-MS investigation of 714–717 principles of 709–711 purity investigation/ characterization 711–712 liquid handling system 798 liquid-ordered state 48 liquid phase sequenator 320 Listeria monocytogenes 700 live cell imaging 199–200 liver alcohol dehydrogenase (LADH) 174 liver tissue

1097

differential 1H ÍÌR spectroscopic analysis 1027 living systems, hierarchical representation of complexity 1025 loading capacity 228 locked nucleic acids (LNAs) 732 probes 733 locus control regions (LCRs) 895 Lowry assay 26–27 LTYYTPEYETK 1015 luciferase enzymes 748 luciferin bioluminescence reaction 748 luminescence 744 Luria-Broth 671 lymphocytes 65 resistant to HIV 967 lysate 672 LysC digest 1019 lysine residues 108 lysis 105 M macromolecular crowding 515 MAFFT 891 magnetic bead isolation 679 magnetic force microscope (MFM) 519 forces between AFM tip and object 520 principle 520 magnetic ion trap 349–355 magnification of lens 185 total, of a microscope 186 major histocompatibility complex (MHC) 63 maleimides 119 Malpighi, Marcello 181 mammalian cells, in vivo analysis 571 DNA transfer 912 chemical-based transfection 913 electroporation 913 lipofection 913 non-chemical methods 913 transient/stable expression 913 viral transduction 913 gene-regulatory cis-elements 911 promoters 911–912 promoter activity 911 marker chromosome 920 mass analyzer 329, 341–343 time-of-flight analyzers (TOF) 343–345 mass determination 362 calculation of mass 362 calibration 365 derivation of mass 366 determination of number of charges 365–366 influence of isotopy 362–365

1098

Index

mass determination (continued ) problems 366–367 signal processing and analysis 366 massively parallel sequencing (MPS) 785, 786, 806 mass spectrometry (MS) 108, 219, 220, 315, 329, 394, 711, 982, 1064 accurate MALDI mass spectrometry imaging 1068–1069 achievable spatial resolution 1065–1067 amino acid analysis using 309 aTRAQ-LC-MS-MS 309 CE-MS 309 direct infusion MS-MS 309 GC-MS 309 HILIC-MS 309 ion pair-LC-MS-MS 309 analysis 397 analytical microprobes 1064 basic principles 1001 coarse screening, by MS imaging 1068 components of 330 glioblastoma tissue section, MALDI images 1068 identification/characterization of 1069–1070 lateral resolution/analytical limit of detection 1067 mass spectrometric pixel images 1064–1065 mouse spinal cord, SMALDI MS/MS image of 1070 mouse urinary bladder tissue section, SMALDI images 1069 peptide within, standard MALDI preparation 1065 phosphorylated/acetylated proteins, detection of 653 position information microanalytical image 1065 visual coding 1066 protozoa during mating, SIMS images of 1068 quantification 378–379 sequencing of peptides, principle of 371 SIMS/ME-SIMS/cluster SIMS imaging 1067 SMALDI mass spectrometry imaging dried-droplet preparation 1067 techniques 866 toponome mapping, by ICM 1064 mate-pair sequencing 811 Mathieu equations 346, 347, 349 matrix-assisted laser desorption ionization (MALDI) 329, 1064, 1068 instrument 345

mass spectra, characteristics of 332 mass spectrometry (MALDI-MS) 330–331, 345, 643 in biochemical analytics, typical matrix substances for 331 protein analysis 333 sample preparation 332–335 matrix solution 296 mouse spinal cord 1070 peptide mass fingerprint (PMF) 1015 proof-of-principle 1064 sample preparation 334 time of flight (MALDI-TOF) analysis 840 mass spectrum, chemically acetylated histone H4 653 spectrometry 867 spectrum, monoclonal antibody 333 spectrum, peptide angiotensin II 333 Maxam-Gilbert method DNA sequence reaction 897 reaction 846 sequencing reaction 847 Maxam–Gilbert method 789, 800, 804, 820 MaxQuant 1019 Mbp1 protein 878 mean proteins 24 meiosis, recombination/crossing over 926 melting 47 membrane-bound DNA probe 906 membrane lipids 58 membrane proteins 6, 82 membrane–water interface 119 2-mercaptoethanol 209, 685 β-mercaptoethanol 126 Merrifield, Bruce 555 messenger RNA 128 metabolic fingerprinting 1023 metabolic labeling 660, 1015, 1016 metabolic profiling 1023 metabolite target analysis 1023 metabolomics 1023, 1024 analytic techniques, coupling with separation technologies 1028 application 1032 general strategies and techniques 1027 knowledge mining 1029–1030 profiling 1027–1028 and systems biology 1025–1026 technological platforms 1026–1027 metabonomics 1023 metalloproteins 151 metarhodopsin I (MI) 149 metarhodopsin II (MII) 149

methidiumpropyl-EDTA-Fe (MPE-Fe) 849 methionine 563 methoxyamine hydrochloride 621 methylated-CpG island recovery assay (MIRA) 825 methylated DNA immune-precipitation (MeDIP) 826 methylation analysis with bisulfite method 819 by DNA hydrolysis/nearest neighbor-assays 827–828 by methylcytosine-binding proteins 825–826 by methylcytosine-specific antibodies 826–827 principle of 820 methylation specific PCR (MSP) 822–823 principle of 822 RASSF1A-promoter 823 methylation specific restriction enzymes, DNA analysis 823–825 methyl binding domain (MBD) 825 5-methylcytosine 817 methylcytosine-binding proteins, methylation analysis 825–826 methylcytosine-specific antibodies, methylation analysis 826–827 N,N´ -methylenebisacrylamide 250 MethyLight method 823 4-methylumbelliferyl-β-Dgalactopyranoside (4-MUG) 915 micellar electrokinetic chromatography (MEKC) 286–288 capacity factor 286 electrophoretic mobility 286 principle 287 time window 287 micelle building agents 288 Michael addition 651 Michaelis–Menten equation 39–41 Michaelis–Menten theory 38–39 microarray analyses 945 microarray suitability DNA vs. proteins 1036 microbial sensors 425–426 microcalorimetry 47 microchannel plate 356–357 microchip electrophoresis (MCE) 297 miniaturization 298 separation of amino acid derivatives by chip-MEKC 298 microelectromechanical systems (MEMSs) 680 micro RNAs (miRNAs) 856, 959, 970–971

Index

microsatellites 927. see also polymorphic sequence tagged sites microscopic stages 184 microscopy techniques, biological structures and suitable 485 microspots sensitivity 1035 micro-total analysis systems (μ TAS) 297 Mie scattering 142 migration behavior, of superhelical 693 MIRArecombinantGST-tagged MBD2b protein 825 miRNA-122, liver-specific 971 mixed mode HILIC/cation-exchange chromatography (HILIC/ CEX) 229 MMLV RTase 761 mobile phase 221, 222, 618 salts used for HP-HIC 231 modern LC-ESI-MS systems 376 modern microscopic techniques 485 modulation transfer function (MTF) 497 molar mass 4 molar ratio 58 molar transition enthalpy 51 molecular orbitals (MOs) 134 molecular probes caging gives temporal and spatial control 1051 molecular replacement (MR) 541–542 molecular vibrations 164–165 molecules binding, to membranes 58 insertion and peripheral binding 58–61 molecules, energy levels of 135–137 molecules properties 134–135 monochromatic light 141 monochromator 143, 144 monoclonal antibodies 99 optimized, constructs with effector functions for therapeutic application 102–105 monoisotopic mass 987 monolithic polymer 288 monomer–dimer equilibrium 413 monomeric proteins 413 monosaccharide building blocks 574 sequence 572 3-N-morpholino-1-propane sulfonic acid (MOPS) 697 mouse embryonal fibroblasts (MEFs) 914 mRNA stability 978 MS-based proteomics 997 MSn-spectra of peptide, measured with ESI ion trap 374 multichannel spectrometers 144

multidimensional HPLC 238 design of scheme 240–241 fractionation of complex peptide and protein mixtures by 239 purification of peptides and proteins by 238–239 strategies for 239–240 multiphoton fluorescence microscopy 198–199 multiple anomalous dispersion (MAD) 540–541 multiple cloning site (MCS) 911 multiple isomorphous replacement (MIR) 538–540 multiplicity 135 MUSCLE tool 891 Mus musculus 942 mutated proteins 120 Mycobacterium tuberculosis 1009, 1011 Mycoplasma genitalium 938 m/z values 345, 351, 354 N N-acetyl-L-glutamate-kinase (NAGK) 414 Naja naja oxiana 863 nano-crystals 529 nano-HPLC flow rates 375 nano-RP-HPLC 650 National Center for Biotechnology Information (NCBI) 875 BLAST (Basic Local Alignment Search Tool) 890 yeast Mbp1 protein 889 native chemical ligation (NCL) 1052 native proteins 207 natural killer cells 104 natural light 132 nearest neighbor-analysis 827 nearest neighbor-assays, methylation analysis 827–828 near-field scanning optical microscopy (NSOM) 202 Neisseria gonorrhoeae 723 Neisseria meningitidis 723 nephelometry 84 Nernst function 151 nested PCR 766–767 one tube 767 neutral fragment masses 991 neutral loss scan 352 new antibody techniques 99–102 next generation sequencing (NGS) 108, 785 application areas 786 N-hydroxysuccinimide-activated haptens 737

1099

nick translation 736 ninhydrin 304 reagent 25 nitrilotriacetic acid 233 nitro-blue tetrazolium salt (NBT) coupled optical redox reaction 748 nitrocellulose 652 membranes 706 nitrogen laser 331 o-nitrophenol (ONP) 915 o-nitrophenyl-β-D-galactopyranoside (ONPG) 915 p-nitrophenyl diazopyruvate 126, 127 nitroxides 480 protonated/unprotonated, structure 480 N6-methyladenine 817 N-nitroso-alkylating reagent 865 NOESY spectra 459 Nomarski, Georges 182 non-coding RNAs (ncRNAs) 856 non-competitive inhibitors 41 non-crystallographic symmetry (NCS) 540 nonionic detergents 20 triton X-100 673 nonlinear anisotropic diffusion 513 nonpolar gas chromatographic columns 621 nonpolar ligands 227 non-protein nitrogen 25 non-radioactive detection systems 742 non-radioactive labeling 739 bioluminescence 748–749 biotin system 745–746 chemiluminescence 747–748 digoxigenin system 744–745 dinitrophenol system 746 direct detection systems 740–742 electrochemiluminescence 748 fluorescein:anti-fluorescein (FLUGS) system 746 fluorescence detection 749 FRET detection 749 indirect detection systems 742–747 luminescence detection 747 optical detection 747 signal-generating reporter group 740 in situ detection 749 non-radioactive modifications 733 non-radioactive reporter groups 741 non-small cell lung cancer (NSCLC) 1041 N-terminal sequence analysis 315 nuclear magnetic resonance (NMR) 1026 spectroscopy 107, 120, 314, 433–434, 486, 530 Bloch equations 436–437 limitations 433

1100

Index

nuclear magnetic resonance (NMR) (continued ) nuclear spin and energy quantization 434–435 populations and equilibrium magnetization 436 pulsed Fourier transformation spectroscopy 437–438 relaxation 437 speeding-up 463 theory 434–438 nuclear-run-on assays 905 nuclear-run-on transcription 906 5´ nuclease reaction format (TaqMan) 726 coupled amplification/detection 727 homogeneous detection systems 727 nucleic acid blotting 704 colony/plaque hybridization 707–708 dot/slot-blotting 707 membrane, choice of 704–705 Northern blotting 706–707 Southern blotting 705–706 nucleic acids 705 amplification systems 750 signal amplification 752–753 target amplification 751–752 blotting (see nucleic acid blotting) detection systems 738 by hybridization 722 non-radioactive systems 739–749 radioactive systems 738–739 staining methods 738 electrophoresis (see electrophoresis) fragments isolation, purification 729 using electroelution 708–709 using gel filtration/reversed phase 708 using glass beads 708 in gel matrix 691 hybridization 719 basic principles of 720 heterogeneous systems for qualitative analysis 723 heterogeneous systems for quantitative analysis 723–724 homogeneous systems 725, 729 intercalation assay 728–729 molecular beacon system 728 practice of 721–722 in situ assays 729 specificity 722–723 TaqMan/5´ nuclease amplification detection 725–728 isolation of fragments (see nucleic acid fragments, isolation of) labeling methods 733 chemical labeling 737–738 enzymatic labeling 735–737

photochemical labeling reactions 737 positions 733–735 mixture 700, 721 photoactive substances for detection of 737 probes for 729 DNA probes 730–731 LNA probes 732–733 PNA probes 732 RNA probes 731–732 restriction analysis 681 historical overview 682 principle of 681–682 restriction enzymes biological function 682 classification of 682–683 isoschizomeres 685 recognition sequences 683–685 type II 683 staining (see staining methods) stringency, specificity 722–723 in vitro restriction/applications complete restriction 685 genetic fingerprint 689 genomic DNA, restriction analysis of 686–688 incomplete/partial restriction 686 methylated bases, detection of 688 multiple restriction enzymes, combination of 686 partial restriction 686 restriction fragment length polymorphisms (RFLP) 688–690 restriction mapping 686, 687, 688 nucleic acid sequence-based amplification (NASBA) 751, 755, 778 nucleic acids, isolation/purification of alkaline lysis, principle of 672 ampicillin-containing media 670 carrier 668 cetyltrimethylammonium bromide (CTAB ) 670 CsCl density gradient 673 cytoplasmic RNA cultivated cells 677 tissue/cultivated cells 677–678 determination of concentration 668–669 DNA yield after anion exchange purification 673 double-single-stranded DNA absorption curves of 668 with ethanol 667 eukaryotic low molecular weight DNA 674 eukaryotic viral DNA 675 gel filtration 666–667

genomic DNA 669 additional steps 670 cell membranes/protein degradation, lysis of 669 enzymes/lysis reagents 669 phenolic extraction/subsequent ethanol precipitation 670 precipitation 670 lab-on-a-chip (LOC) system 680 low molecular weight DNA anion exchange chromatography 673 bacterial culture 671 density gradient centrifugation 673–674 lysis of bacteria 671–673 plasmid from bacteria 670–671 magnetic particles 679–680 optical density (OD) 668 phenolic purification 665–666 photometric determination 668 plasmid DNA by CsCl density gradient centrifugation 674 precipitation with ethanol 667–668 protein containing contaminations 665 RNA isolation 676 poly(A) 678–679 sensitive quantification method 669 single-stranded DNA 676 double-stranded DNA 676 M13 phage DNA 676 small RNA, isolation of 679 Tris-HCl or TE (Tris-HCl/ EDTA) 665 viral DNA, phage DNA 674–675 Nucleic Acids Research (NAR) 877 nucleophiles 125 nucleotide sequence 876 management, in laboratory 881 nucleotide triphosphate 951 numeric aperture (NA) 186, 187 O object characteristics 501–502, 504 n-octyl-β-D-thioglucopyranoside (OSPG) 673 octylglucoside 59, 60 off-gel isoelectric focusing, principle of 266 Ogston sieving effect 691 Ohm’s law 293 oil-immersion objective 187 okadaic acid 1045 oligodeoxyribonucleotide, comparison of 963 oligonucleotides 738, 793, 960 antisense 960, 963

Index

cell culture/animal models 964 human β-globin pre-mRNA 961 intermolecular triple helix 962 mechanisms 960–961 RNase H 960–961 RNA splicing, changes 961 as therapeutics 964–965 translation inhibition 961 aptamers, high-affinity RNA/DNAoligonucleotides 971–974 arrays 731 CGE electropherogram of 702 fingerprinting 936 gapmer 964 high-affinity 973 interferon response 967 intermolecular triple helix 962 mechanism and location 960 micro-RNA pathway 970 nucleotides modifications 962 oligosaccharide binding fold (OB fold) 860 positions for nucleotides modifications 962 probes 918 disadvantage of 722 ribose, by fluorine 2´ position 973 RNA interference, mechanism of 968 SELEX-strategy, for RNA aptamers isolation 972 short hairpin RNA (shRNA) vector expression 969 susceptibility to nucleases 962–964 synthesis 709, 710, 952 phosphoramidites 737 triplex forming oligonucleotides (TFOs) 961–962 uses 959 oligosaccharides 572 omics 1023, 1024 one-dimensional diffusion 76 one-dimensional NMR spectroscopy 437–438 chemical shift 439 1D experiment 438 line width 442 scalar coupling 439–442 spectral parameters 438–439 online detection systems 798 online sample concentration 295–296 one buffer stacking system 295 two buffer stacking system 295–296 open reading frames (ORFs) 789 o-phosphoserine (OPS) 233 o-phthaldialdehyde (OPA) 118 optical enzymatic detection systems 747 optical multichannel analyzers (OMAs) 144

optical rotation dispersion (ORD) 178 optical spectroscopic techniques 131, 132 physical principles 132 optical tweezers 856 orbital ion trap 349–350, 350–351 Orbitrap 350 linear ion trap with 354–355 organic radicals 466 organic solvents 10 ortho-phthaldialdehyde (OPA) 305, 306 oscillating dipole moment 172 Ouchterlony immunodiffusion technique 78 P PacBio RS single molecule real time DNA sequencing 814 paired-end-sequences, alignment of 811 paired-end tag sequencing 829 pair-end sequencing 810 Pancreatic DNase I 827 paraffin slices 197 parallel reaction monitoring (PRM) 1012 paramagnetic centers 467 paramagnetic dipole interaction 119 parfocal distance 185 partition coefficient 59, 60 passive immunoagglutination 72 Pauli principle 135 PCR. see also bisulfite PCR peak capacity 241 peak dispersion 223 peak distance 221 peak variances 223 peak widths 221 pentose phosphate pathway 1024 pepsin 207 peptide. see also peptide based quantitative proteome analysis analysis 329 antibodies 99 bonds 23, 30, 224 hydrolysis of 302 detection of 651 ESI-MS spectrum 373 fragmentation 562 general structure 313 monoisotopic masses 989 separation and enrichment 649 sequences 313 Peptide Atlas projects 1008 Peptide Atlas SRM Experimental Library (PASSEL) 1004 peptide based quantitative proteome analysis 998 bottom-up proteomics 998 data analysis/interpretation 999

1101

ionization 999 mass measurement/ fragmentation 999 peptide separation 998 proteolysis using trypsin 998 bottom-up proteomic strategies 1000 DDA, principle of 1000 DIA, principle of 1001 SRM, principle of 1000–1001 data dependent analysis (DDA) challenge 1003 principle/intended use 1002 strength/weaknesses 1002 typical applications 1003 extensions 1012 MSE 1013 MSX 1013 parallel reaction monitoring (PRM) 1012 precursor acquisition independent from ion count (PAcIFIC) 1012–1013 peptide quantification 1001–1002 proteome, complexity of 1000 selected reaction monitoring (SRM) 1003 analysis software 1007–1009 clinical studies 1010 data analysis 1007 eukaryotic model organisms 1009–1010 identification 1005–1006 method 1004–1005 microbial 1009 principle of 1003–1004 quantification 1006–1007 strength/weaknesses 1009 typical applications 1009 SWATH-MS principle/intended use 1010–1011 strength/weaknesses 1011 typical applications 1011–1012 peptide libraries, analytics of 567–569 approach to identifying 568 characterization 569 divide–couple–combine method 567 Edman degradation 567 peptide nucleic acid (PNA) 730 oligomers 732 peptide synthesis 555–560 common side products and side reactions during 560 approach 568 characterization by ESI mass spectrometry 569 divide–couple–combine method 567 Edman degradation 567 Fmoc- or Boc-strategy 557

1102

Index

peptide synthesis (continued ) and identification 555 important reaction mechanisms in 558 principles 555 structure and abbreviation of selected agents used in 557 synthesis on solid support 556 peptide synthesizer 559 peptidome analysis of 1029 collagenases 1029 peptidomics 1023, 1028–1029 peroxidase–anti-peroxidase (PAP) complex 87 Petran, Mojmir 182 PFGE gels 699 Pfu DNA polymerases 759 phagocytosis 64, 105 phase contrast condenser 185 phase contrast microscopy 187 phase contrast objective 185 phase diagram of aqueous protein solution 532 phase transitions 53, 54 phenol/chloroform/isoamyl alcohol (PCIA) 665 phenolic extraction 666 phenotypic/biochemical markers 931 phenylalanine 225 phenyl isothiocyanate (PITC) 306, 654 phenylthiocarbamoyl (PTC)peptide 315 phosphatase/deacetylase-treated sample 652 phosphatidylcholines 50, 641 phosphatidylethanolamines 641 phosphodiester bonds 856 phospholipases 615 phospholipid dimyristoylphosphatidylcholine 59 phosphomethylene-L-phenylalanine (Pmp) 1054 phosphopeptides 650 phosphoric acid (H3PO4) 650 phosphorothioates, disadvantages of 962 phosphorylated proteins 650 analysis of 1054 detection of 651 separation and enrichment 649 phosphorylation 224, 329, 369, 561, 645–646, 979 phosphoserine 648 phosphotyrosine-containing proteins, generic detection of 652 photoactivatable groups 127 photoactivated localization microscopy (PALM) 201, 486

photoaffinity labeling 121, 123 photobleaching 199 photodetectors 133 photolabeling peptides, with benzophenone 127 photolysis 125, 126 caged probe 1051 photometric measurement 142, 143–144 frequent errors in 143 main sources of error in 142 principle of 132–133 with circular polarized light 133 with linear polarized light 132 photon emission 915 photon energies 133 photon-induced dissociation (PID, IRMPD) 360 photoreceptors 133 phototoxicity 200 Phred quality 808 pH sensors 429 pH values 561 phycoerythrins 741 physical-chemical systems 35 pico-Newtons (pN) 855 piezoelectric quartz crystals (PQC) 429 pigment–protein complexes 140, 152 pI value 561 pixel protein profiling (PPP) 1062, 1063 plan achromatic objectives 185 plan apochromatic objectives 185 planar/bead-based arrays 1034 planar invitrogen microarray 1038 planar protein microarray 1033 plasma proteins 981 dynamic range 981 plasma proteome analyses 981 plasmid vector 788 ColE1 origin 671 copy numbers 671 plasminogen 995 plate number 222, 223, 235 point spread function (psf) 185 polarimeter 178 polarity 224, 225 polarization 132 microscopy 188 plane 178 polarized light 132 linear dichroism 175–178 methods using 175 polyacrylamide gel electrophoresis (PAGE) 711, 901 polyacrylamide gels 244, 694, 704 advantages of 260 separation of oligonucleotides 696 range 695

structure 250 vs. agarose 694 polycystic kidneys 767 polydimethylsiloxane (PDMS) 277 polyethersulfone 16 poly(ethylene glycol) (PEG) 675 poly(ethylene oxide) (PEO) 701 polyketides (PK) 614 poly-L-lysine 180 polymerase chain reaction (PCR) 721, 725, 751, 755, 918, 935, 950 alternative amplification procedures 777 nucleic acid sequence-based amplification (NASBA) 777 transcription-mediated amplification (TMA) 777 Alu PCR 770 amplification 819 avian myeloblastosis virus (AMV) RTase 761 enzymes 761 Moloney murine leukemia virus RTase 761 primers 762 procedure 762 RNA (RT-PCR) 761 in single reaction tubes 762 Tth DNA polymerase 762 in two reaction tubes 762 applications 772 genetic defects, detection of 773–776 human genome project 776–777 infectious diseases, detection of 772–773 branched DNA amplification (bDNA) method 782 contamination problems 770 avoiding 770–771 decontamination 771–772 digital PCR 769 DNA amplification 758, 763 additives 763 buffer 759 cycles 758–759 enzyme 759 hot start PCR 763 magnesium ions 763 nucleotides 759 primers 759–760 probe preparation 758 RNA amplification 763 template 763 templates 760 for DNA brand marking 794 DOP PCR 770 helicase-dependent amplification (HDA) 777–779

Index

instruments 756–757 inverse PCR 769 ligase chain reaction (LCR) 779–780 master mixes 759 mutagenesis techniques 870 optimization of reaction 763 polymerization of DNA 756 possibilities of 755–756 PRINS PCR 770 prospects of 782 Qβ amplification 780–781 quantitative PCR 763 competitive (RT) 765–766 external standard 764–765 internal standardization 765 RACE PCR 769 repair chain reaction (RCR) 780 RT and Taq DNA polymerase 762 schematic of 757 special techniques 766 asymmetric PCR 767 cycle sequencing 768 degenerate primers, use of 767 homogeneous detection procedures 768–769 multiplex PCR 767–768 nested PCR 766–767 quantitative amplification procedures 769 in situ PCR 769 in vitro mutagenesis 768 strand displacement amplification (SDA) 777 temperature/time profile of 758 typical course 764 vectorette PCR 769 polymerization 251 poly(methyl methacrylate) (PMMA) 277, 680 polymorphic markers 927, 931 polymorphic sequence tagged sites 927 Polyoma/SV40 nucleic acids 675 polypeptide 147 poly(vinylidene fluoride) (PVDF) 16, 652 membrane 984 membranes 273 polyvinylpyrrolidone (PVP) 14, 701 pore size 251, 255 porosity 288 gradient gels 254–255 positional candidate gene approach 939 positional cloning 938 position-specific scoring matrix (PSSM) 883 post-translational modification (PTM) 314, 315, 369, 645, 647, 648, 659, 998

amino acids, localization/ identification 653–654 analysis, based on 648 future perspective 661 quantitative analysis of 659–660 potential energy 35 power of combinatorial molecular discrimination (PCMD) 1061, 1063 33 P phosphates 733 precipitation 3, 9, 17, 64 of nucleic acids 10–11 using organic solvents 10 using trichloroacetic acid 10 precursor acquisition independent from ion count (PAcIFIC) 1012 precursor ion analyzes 352 prenols (PR) 614 preparative immunoprecipitation 81–82 preparative techniques 263 electroelution from gels 263–264 isoelectric focusing 265 between isoelectric membranes, principle of 265 preparative IEF between isoelectric membranes 265–266 preparative zone electrophoresis 264 preparative zone electrophoresis 264 pressure perturbation calorimetry (PPC) 61 primers 819 extension assay, principle of 902 oligo(dT) primers 760 for RT-PCR 762 secondary structures of 760 types of 759 principal component analysis (PCA) 506–508, 1031 approach of 506 factorial map 508 independent structures 507 mathematical procedure 507 motivation for 508 of prealigned projections of the protein complex 508 real EM data 508 real EM images, number of pixels 507 representation of two-pixel images in coordinate systems 507 sensitivity of 508 product ion analysis 352 prompt and metastable decay (ISD, PSD) 358–360 spectrum of angiotensin 372 Prosite Motif PS00029 882 prosthetic groups 23 protease 588 protease inhibitors 8

1103

protein aggregation 49 protein based quantitative proteome analysis 982 intact protein mass spectrometry, concepts 987–997 closing remarks/perspective 996–997 data analysis 991–995 high-throughput top-down proteomics 995 mass spectrometry, to measure intact proteins 990–991 top-down proteomics using intact protein mass spectrometry 987 using isotope labels 986–987 two-dimensional differential gel electrophoresis (2DDIGE) 986 two-dimensional-gel-based proteomics 982 peptide fragments, analysis of 985–986 proteins, imaging/ quantification 983–985 proteins separation 982–983 sample preparation 982 protein co-compartmentalization machine 1058 protein complexes 530 protein crosslinking 121 protein crystallization using the hanging drop method 532 Protein Data Bank (PDB) 543 protein degradation 978 protein determination 23 method for 24 staining methods for 26 protein–DNA interactions 951 protein dynamics, determination of 463–464 protein equivalent of a genome 977 protein expression 978 protein folding/misfolding 464 protein fragmentation 990, 992, 993 protein functional groups, chemical modification of 108–116 acylation 109–110 amidination 110 arginine residues 113 caged compounds 111 cysteine residues 112 glutamate and aspartate residues 112–113 histidine residues 116–117 lysine residues 108 methionine residues 114–115 reaction with isothiocyanate 111 reductive alkylation 110–111 tryptophan residues 114 tyrosine residues 113–114

1104

Index

protein functions biologically active molecules for modulation 1044 switching off 1044 protein glycosylation 572, 579 analysis of 581 protein imaging 1068 protein interactions 1036 protein–ligand interactions 299 protein localization 6 protein mass spectrometry 987 protein microarray applications, schematic representation 1037 protein microarrays 1037 protein–nucleic acid complexes by gel electrophoretic methods 836 molecular beacons 853 protein–nucleic acids interactions 831 dissociation constants, determination of 839–840 DNA footprint analysis 841 chemical nucleases 849–850 chemical reagents for modification 846–848 genome-wide 850–851 hydrolysis methods 844–845 interference conditions 848–849 labeling 843 primer extension reaction 843–844 DNA–protein complex dynamics, analysis of 840–841 filter binding 836 gel electrophoresis, background to retardation 836–838 genetic methods aptamers/Selex procedure 869 directed mutations, within binding domains 870 tri-hybrid method 868–869 physical analysis methods fluorescence correlation spectroscopy (FCS) 856 fluorescence methods 851 fluorescence resonance energy transfer (FRET) 852–853 fluorophores procedures 851–852 labeling procedures 851–852 molecular beacons 853 optical tweezers 855 scanning force microscopy (SFM) 854–855 surface plasmon resonance (SPR) 853–854 RNA interactions (see RNA–protein interactions) proteinogenic amino acids 876 protein profiles disease specific 100-dimensional discovery 1062

protein-protein interactions (PPIs) 381, 555, 1046 protein purification 23, 219, 315 goal of 5 proteins chemical modification 107 complexity and individuality of 23 complex structures, three-dimensional reconstruction of 509 denaturation of 209 alkylation of cysteine residues 210 cleavage of disulfide bonds 209 cysteine residues, chemical modification of 210 disulfide bonds and alkylation 209–210 2D gel 984 immunogenicity of 70 localization of 1033 peptides, separation methods of 4 properties of 3 size 3 splicing, conditional 1054–1055 staining 983 ProteinScape 1019 protein structure, determination 457, 462 high molecular weight systems and membrane protein structure and dynamics of 465–466 in-cell NMR spectroscopy 466 intrinsically disordered proteins 464–465 NMR spectroscopy 462–463 NOE signal intensity and respective proton distance, relationship between 457 protein dynamics, determination 463–464 protein folding and misfolding 464 protein–ligand complexes thermodynamics and kinetics of 464 residual dipolar couplings (RDCs) 458 secondary structure, determination of 458–461 structure calculation, constraints for 457–458 tertiary structure, calculation of 461–462 distance geometry method 461 root-mean-square deviation (RMSD) 462 simulated annealing 461 protein toponome, concept of 1058–1059 protein transfer 88 proteoforms 981

proteolytic enzymes 207–208 cleavage on membranes 208 in SDS-polyacrylamide gels 208–209 in solution 208 strategy 208 proteome analysis 108, 610–611, 977 based diagnostics 956 coverage 1012 general aspects 977 protein based quantitative analysis (see protein based quantitative proteome analysis) ProteomeXchange 1010 proteomic databases 611 sample preparation 980–982 SDS gel electrophoresis 977 starting conditions/project planning 979–980 prozone effect 72 protospacer adjacent motif (PAM) 974 PTEN knockout 1010 pulsed electron double resonance (PELDOR) 478 pulsed EPR experiments 473 basics 474 electron nuclear double resonance (ENDOR) 477–478 electron spin echo envelope modulation (ESEEM) 475–476 hyperfine sublevel correlation experiment (HYSCORE) 476–477 pulsed electron double resonance (PELDOR) 478 relaxation 474 spin echoes 474–475 pulsed-field gel electrophoresis (PFGE) 698, 934 pulsed laser beam 1064 pulsed liquid sequencer 321 pure protein solutions 24 purification techniques 4, 6 Pwo DNA polymerases 759 pyridoxal-5´ -phosphate 111 pyrimidines, 5,6-double bonds of 846 pyrophosphorolysis 793 pyrosequencing 807, 820 Q quadrupole analyzer 345–348 quadrupole-TOF (Q-TOF) 354 analyzers 377 quantitative determination, by staining tests 25–26 quantitative immunoprecipitation 73 quantitative structure–retention relationships (QSRRs) 241 quantitative trait loci (QTLs) 940

Index

quantum dots as fluorescence labels 159–160 labeling with 192 quasi-isothermal conditions 48 R RabGDP-dissociation inhibitor (RabGDI) 1053 Rab-GTPase Ypt1 1052 radioactive detection methods 652 radioactive labeling 31–33, 109, 733 of DNA sequencing 797 exchange positions 735 radioactive methods 651 radioactive nucleic acids direct autoradiography 739 fluid emulsions for cytological/cytogenetic in situ applications 739 fluorography 739 indirect autoradiography, with intensifier screens 739 pre-exposed X-ray film, for direct autoradiography/ fluorography 739 radioimmunoassays (RIA) 31, 82–83 radioisotopes 739 radiolabeling 32 Raman spectroscopy 171, 1026 principles 171–172 Raman experiments 172–173 Raster image correlation spectroscopy (RICS) 203 Rayleigh–Gans–Debye scattering 141 Rayleigh scattering 142 Rd1-SP-adapter ligation 812 Reactome 1024 reagents for introducing fluorophores 117 real-time RT-PCR (RT-qPCR) 904, 905 quantification of gene expression 904 REases 683 type-II restriction enzymes 684 recombinant antibody 100 recombinant DNA technologies 120 recombinant glycoproteins 571 recombinant proteins 7 recombinant retroviruses 395 recombinaseA(recA-), strains deficient of 671 recombination fraction 929 red-green-blue (RGB) images 1065 reductive alkylation 110–111 RefSeq 890 regular expression functions 882 regularly arrayed macromolecular complexes, three-dimensional reconstruction 511–512

relative centrifugal force (RCF) 12 relative molecular mass 4 relative resolution map (RRM) 236 Renilla luciferase 749 reporter gene assay 1047 reporter gene vector cis-acting sequences principle of mapping 912 resolution 185 capacity 342 optimization 224 power of isoelectric focusing 260 range, different methods for 529 resonance assignment 452 heteronuclear 3D spectra, analysis of 454 selective amino acid labeling 454 sequential assignment of homonuclear spectra 452–453 from triple-resonance spectra 454–457 resonance methods 134 resonance Raman spectroscopy 173–174 restricted access materials (RAMs) 238 sorbents materials 229 restriction analysis, methylation sensitive enzymes/insensitive isoschizomers 824 restriction enzymes 682 cleavage 685 restriction fragment length polymorphisms (RFLPs) 681, 927, 928 retention 228 factor 221, 223, 235 times 220, 221 volume 221 reveal bacteria 671 reversed-phase chromatography (RPC) 16–17, 240 reversed-phase HPLC 21 reverse immunoagglutination 72 reverse phase protein microarrays 1038 reverse transcriptases (RTases) 761, 896, 901 RNA-dependent DNA polymerase 901 Rev protein 859 rhodamine-tagged phosphoramidites 851 Rhodobacter capsulatus 157 rhodopsins 149, 176 ribonuclease-protection assay (RPA) 898–901, 900 ribonucleases (RNases) 959 ribonucleoprotein (RNP) 856 domain 859 ribose by fluorine 2´ position of 973

1105

ribosomal RNA 904 ribosomes 530 transfer-RNA (tRNA) molecules 947 ribozymes 479, 959, 965, 967 catalytic cycle of 966 discovery/classification 965–966 structure of 966 use of 966–967 RNA aptamers, SELEX-strategy 972 characteristic structural elements 858 DNA, helical grooves between 857 mimic fragments 765 molecules gel electrophoresis, twodimensional 701 splicing, analyzing 947 RNA-binding motifs, characteristic 859–860 RNA-binding proteins 859 RNA-dependent DNA polymerases 761 RNA-DNA hybrid molecules 896, 901 RNA/DNA sequences cloning/PCR amplification 788 electrophoresis 788 error correction/sequence data analysis 789 nucleic acid, isolation/purification of 788 purification 788 reconstitution 788–789 RNA electrophoresis 697 RNA-induced silencing complex (RISC) 967 RNA interference (RNAi) 959, 967–971, 968, 969 basics of 967–968 mediated by expression vectors 968–969 uses of 969–970 RNA isolation 679 RNAi-triggered cellular processes 968 RNAi-triggered knockdown 955 RNA-modifying reagents reagents for 864 structural formula 864 RNA polymerases 899, 905, 908 analysis of binding 855 RNApolymerases transcribe 908 RNA-protein complexes 866 analysis of 860 chemical modification 863–866 chemical crosslinking 866–867 CMCT (1-Cyclohexyl-3-(2morpholinoethyl)carbodiimide metho-p-toluolsulfonate) 864–865 customary RNases 862–863 diethyl pyrocarbonate (DEPC) 864

1106

Index

RNA-protein complexes (continued ) dimethyl sulfate (DMS) 864 ENU (ethylnitrosourea) 865 Fe-BABE (Fe 1-(p-bromoacetamidobenzyl) ethylenediaminetetraacetic acid) 865 hydroxyl radicals 865 in-line probing 865 kethoxal (α-keto-β-ethoxybutyraldehyde) 863–864 labeling methods 861–862 limited enzymatic hydrolyses 861 nuclease S1 863 photoreactive nucleotides, incorporation of 867 primer extension analysis 862 RNase CL3 863 RNase CVE 863 RNase T1 862 RNase T2 863 RNase U2 862 selective 2´ -hydroxyl acylation analyzed by primer extension (SHAPE) analysis 865–866 transcription start sites (TSS), genome-wide identification of 867–868 RNA-protein interactions 857 dynamics of 857–859 functional diversity 856–857 secondary structure parameters/unusual base pairs 857 RNA–protein recognition tri-hybrid system for the in vivo characterization of 868 RNA quantification, by Northern blot 903 RNA-RNA hybrid molecules 899 RNase H activity 777, 961 RNase inhibitors 676 RNAse protection assays 761 RNASeq 813, 946 RNases, specificities of 860, 861 RNA transcripts dot- and slot-blot analysis 903–904 Northern blot 902–903 nuclease S1 analysis 896, 897 quantitative 896 reaction principle 896 RNA 5´ /3´ ends 897–898 overview 895–896 primer extension assay 901–902 reporter gene expression 914 β-galactosidase (β-Gal) 915 chloramphenicol acetyltransferase (CAT) assay 914

green fluorescent protein (GFP) 915 luciferase assay 915 transcripts from transfected cells 915 reverse transcription polymerase chain reaction (RT-PCR) 904–905 ribonuclease-protection assay (RPA) 898–901 in vitro transcription, in cell-free extracts 907 additional techniques to analyze 911 G-less cassette 908–909 run-off transcription assays 909–910 template DNA/detection 908 transcription assay 907–908 transcription-competent cell extracts/ protein fractions, generation of 908 in vivo analysis nascent RNA labeling with 5fluoro-uridine (FUrd) 906–907 nuclear-run-on assay 905–906 ROBETTA server 893 robot-assisted microdispension system 610 robotic system 50 rocket immunoelectrophoresis 80–81 Rohrer, Heinrich 486 root-mean-square deviation (RMSD) 462 Rosetta program 893 rotors, for centrifugation 11 Rous sarcoma virus (RSV) 960 Royal Society of London 181 RP-LC-ESI-MS proteolytic digest of protein 377 RT-PCR techniques 904 schematic portrayal 761 run-off transcription reaction 910 Ruska, Ernst 486 S Saccharomyces cerevisiae 785, 938, 942, 947, 1009, 1037 salts removal 15 sample concentration 295 sample preparation 195, 249, 332–335 creation of paraffin slices 197 embedding 196–197 fixation 196 frozen slices 197 creation of frozen slices (rapid slices) 197 embedding 197 sealing 197

isolated cells 195–196 paraffin samples 196 proteome analysis 22 tissue biopsies 196 Sanger method 785, 786, 805 Sanger’s reagent 313 scanning calorimetry 47 scanning cysteine accessibility method (SCAM) 120 scanning electron microscope (SEM) 486 scanning force microscopy (SFM) 854–855 scanning ion conductance microscope (SICM) 519 scanning microprobe MALDI (SMALDI) 1065, 1068 mouse spinal cord 1070 mouse urinary bladder tissue section 1069 scanning near-field infrared microscopy (SNIM) 486 scanning near-field optical microscopy (SNOM) 486, 519 scanning probe microscopies (SPM) 486 scanning tunneling microscope (STM) 486 scattering artefacts, correction for 142 scFv-Antibodies 100 Schiff base 111, 149 Schlack–Kumpf degradation 326 Schleiden, Matthias J. 182 Schwann, Theodor 182 secondary electron multiplier (SEV) 356 channel electron multiplier 356 constructions with discrete dynodes 356 microchannel plates 356 secondary ion mass spectrometry (SIMS) 1064 ion imaging mode 1064 secretory proteins 886 sedimentation coefficient 12, 412 sedimentation–diffusion equilibrium 415 selected or multiple reaction monitoring (S/MRM) 1000 mass spectrometer 1004 quantification 1002 selective 2´ -hydroxyl acylation analyzed by primer extension (SHAPE) reagents 866 selectivity 221, 239 selenium 25 Selex procedure 869 separated proteins zones 251

Index

detection and quantification 251–252 imaging 252–253 separation efficiency 4 sequence composition 882 sequence data analysis 875 abstraction for biomolecules 876 and bioinformatics 875–876 homology based methods basic local alignment search tool (BLAST) 890 identity 887–888 optimal sequence alignment 888–890 PSI-BLAST algorithm 890–891 similarity 887–888 threshold 891 internet databases/services 877 data contents/file format 879–880 nucleotide sequence management, in laboratory 881 sequence retrieval, from public databases 878–879 multiple alignment/consensus sequences 891–892 procedures 893 structure prediction 892–893 sequence logo 883, 886 sequence patterns 882–886 coding regions, identification of 885–886 protein localization 886 transcription factor binding sites 884–885 sequence tagged sites (STSs) 776, 927 physical markers 927 screening 936 serial femtosecond crystallography (SFX) 550 serum amyloid A (SAA) 96 short hairpin (SH) groups modification 112 with p-chloromercury benzoate, analysis 112 short-hairpin RNA (shRNA) 953, 968 vector expression 969 shotgun method 786 sickle cell anemia 688 signal amplification 751 branch structures 752 coupled signal cascades 753 cyclic ADH 753 enzyme catalysis 752–753 signal-to-noise ratio 144, 504–506 averaging single particles 505–506 correlation averaging 504–505 filter approaches for crystal data 504 filtering in Fourier space 504 silica coated magnetic beads 679 silver staining 251, 252, 704

simplified coordinate system visualization of experimental data 1031 single beam photometer 142 single crystals, of amylase C 533 single ion monitoring (SIM) scan mode 348 single isomorphous replacement (SIR) 538–540 single molecule spectroscopy 174–175 single nucleotide polymorphisms (SNPs) 927, 940 identification of 948 microarray-based 948 polymerase extension reaction 949 SNV analysis 940 single nucleotide variants (SNVs) 927, 940 single particle analysis 516 three-dimensional reconstruction 509–511 single reaction monitoring (SRM) and MRM-analysis 352–353 SRMAtlas project 1008 SRMCollider 1005, 1009 single-strand conformational polymorphism (SSCP) 695, 775–776 single-stranded binding (SSB) 742 single-stranded DNA(ssDNA) 775 single-stranded RNA probes 731 singlet oxygen generator (SOG) 409 singlet oxygen sensitizer 409 singlet oxygen triplet energy transfer (STET) 409 size exclusion chromatography (SEC) 240 Skyline software 1008 slot-blotting 707 small angle X-ray scattering (SAXS) 529, 543–544 data analysis 547 de novo structure determination 547–548 method developments 549 machine setup 544–545 theory 545–547 SMART server domain 884 snake venom phosphodiesterase 827 sodium bisulfite catalyzes 819 sodium dodecyl sulfate (SDS) electrophoresis, for low molecular weight peptides 258 PAGE, analytical ultracentrifugation 409 polyacrylamide gel electrophoresis 257–258 sodium saline concentration (SSC) 723

1107

software tools 984 solid phase sequencer 320 soluble macromolecules 529 somatic cells 925 spatial frequency 499 spectral absorption coefficients 25 spectral unmixing 204 spectroscopic methods 28–29 fluorescence method 31 measurements in UV range 29–30 protein determination 29 spectral range 134 spectroscopy. see atomic force microscope (AFM) sphingolipids (SP) 614 sphingomyelins 641 spin label 119 reagents for 120 splice acceptor (SA) 938 splice donor (SD) sites 938 splitting scheme, for unpaired electron 469 spot patterns, phosphorylation status of proteins 652 (S)SIM ((saturated) structured illumination microscopy) 201 stable isotope labeling 1014 by amino acids in cell culture (SILAC) 1015 bottom-up proteomics isobaric labeling 1020–1021 non-isobaric labeling 1019 18 O labeling 1019 reagents 1019–1020 peptide standards 1006 in quantitative proteomics 1013 top-down proteomics 1013 chemical stable isotope labeling 1016 isotope-coded affinity tag method (ICAT) 1016–1018 isotope coded protein label (ICPL) 1018–1019 metabolic labeling 1014–1015 stable isotope labeling by amino acids in cell culture (SILAC) 1015–1016 staining methods characteristics 983 fluorescent dyes 702 DNA geometry, influence 703 ethidium bromide (3,8-diamino-5ethyl-6-phenylphenanthridinium bromide) 702–703 fluorescent dyes 703–704 silver staining 704 standard free energy 61 Staphylococcus aureus 768 stationary phase 221, 222, 618

1108

Index

stereoisomers 572 sterols (ST) 614 stimulated emission depletion (STED) 201 microscopy 486 stochastic optical reconstruction microscopy (STORM) 201 Stokes radius 254 strand displacement amplification (SDA) 779 Streptomyces avidinii 746 strong cation exchange (SCX) chromatography 649, 650 structural proteins 254 substance libraries, source of 1045 N-succinimidyl-3 [4-hydroxyphenyl] propionate 109 D-sugars series 572 composition starting from D-glyceraldehyde 572 L-sugars series 574 sulfonyl chloride 116 sulfuric acid 25 surface plasmon fluorescence spectroscopy (SPFS) 402 surface plasmon resonance (SPR) 853–854 Biacore technique measurement of antibody binding to antigen using 94 measuring device, principle of 854 spectroscopy 400–402 surface tension 227, 231 Svedberg equation 412, 415 Svedberg, Theodor 244 Svensson–Rilbe’s concept of “natural” pH gradients 244 SWATHAtlas 1010 SWATH-MS 1011 SYBR Green 728, 905 synthetic peptides characterization/identity of 562–564 peptide nucleic acids (PNAs) 732 purity of 561–562 structure, characterization of 564–567 systematic evolution of ligands by exponential enrichment (SELEX) 972, 973 systems biology, hypotheses/knowledge circular system 1030 454-system workflow 808 T tandem affinity purification (TAP) 394–395, 395–397 advantages 395 limitations 397

limitations of 397 mass spectrometric analysis 397 purification 395–397 retroviral transduction 395 tagging and purification of protein complexes 394 tandem mass spectrometry phosphorylated and acetylated amino acids, localization of 654–659 tandem-TOF (TOF-TOF) 353–354 Taq DNA polymerases 730, 736, 758, 759 TaqMan® 756, 759 PCR 765 probes 725, 905 target amplification 751 elongation 751 transcription 751 in vivo amplification 751 targeted proteomics 1004 target protein, temporal control 1051 TATA binding protein (TBP) 835 TAT protein 481 t-butyl trifluoroacetate 559 T-cells 96 T-Coffee 892 T7 DNA polymerase-catalyzed sequencing reaction 791, 792 T-effector cells (TE) 105 temperature gradient gel electrophoresis (TGGE) 701, 840, 841 terminator exonuclease (TEX) 868 test strips 420 test system set-up 41–42 controls 45 detection system 42 pH value 43 physiological function, analysis of 42 selecting buffer substance and ionic strength 43–44 selecting substrates 42 substrate concentration 44 temperature 44 time dependence 43 TET enzymes 817 tetrabutylammonium bromide (TBAB) 711 Tetrahymena thermophila 965, 1068 N,N,N´ ,N´ -tetramethylethylenediamine (TEMED) 292 tetramethylrhodamine (TAMRA) 749 tetramethylrhodamine isothiocyanate 116 tet-repressor (TetR) 1051 therapeutic glycoproteins 571 thermal degradation 25 thermal denaturation 61 DNA species 720 thermal shift assay 531 thermograms 49, 51

thermosensors 49 thermostable DNA polymerases 795 thin-layer chromatography (TLC) 561, 618, 696, 911, 914 thiols 126 three-dimensional electron microscopy 508 three-dimensional NMR spectroscopy 449 HCCH-TOCSY and HCCH-COSY experiments 449–450 HNCA experiment 451–452 NOESY-HSQC and TOCSY-HSQC experiments 449 nomenclature of triple-resonance experiments 451 triple-resonance experiments 450–451 threonine 648 thymine/cytosine-specific fission reaction 803 thyroid stimulating hormone (TSH) 1035 time-of-flight analysis 345 time-resolved fluorescence (TRF) 749 time-resolved spectroscopy 144–147 TiO2 coated magnetic particles 649 titer 73 titration curve 60 analysis 263 TMHMM 886 TOPCONS 886 toponome cell/tissue, molecular networks of 1058 defined 1057 imaging cycler microscopy (ICM) antibody based 1057 map, structural representation 1058 reading technology, fundament of 1059–1063 theory, schematic illustration 1060 toponomics analysis 1057 biological systems 1057 co-compartmentalization and topological association rules 1058 total internal reflection fluorescence microscopy (TIRFM) 160, 202 total ion current (TIC) chromatogram 713 T7 phage DNA polymerases 845 T4-polynucleotide kinase (PNK) 901 phosphorylates 809 transcriptional profiling analysis 946 transcription factors 5 transcription-mediated amplification (TMA) 751, 755 transcription start site (TSS) 907, 909 transfer-RNA (tRNA) molecules 947

Index

transglutaminase 119 translocation 920 transmission electron microscope (TEM) 487–488 approaches to preparation 488 labeling of proteins 492 metal coating by evaporation 491–492 native samples in ice 488–490 negative staining 490 beam path 487 images of vitrified lipid vesicles 498 instrumentation 487–488 object holder, grids, and plunger for biological cryosamples 488 phase contrast 495 resolution 492–493 transmitted fluorescence microscope 188 transposon-mediated DNA sequencing 788 1,4,7-triazacyclononane 233 tributylphosphine 209 trichloroacetic acid (TCA) 10 precipitations 738 triethylamine (TEA) 711 triethylammonium acetate (TEAA) 711 trifluoroacetates 561 trifluoroacetic acid (TFA) 228, 333, 558, 560 trifluoromethyl diazirinobenzoyllysine 129 1-trifluoromethyl-1phenyldiazirine 126 trigonometric operation 501 triple-quadrupole (triple-quad) 351–352 triplex forming oligonucleotides (TFOs) 961 tripropylamine (TPA) 742 tris acetate (TAE) 692 tris borate (TBE) 692 tris-buffers 863 tris-(carboxy methyl)ethylene-diamine 233 Triticum aestivum 785 Trp1 gene 382 trypsin 588, 985, 1019 tryptophan 225, 303 Tth DNA polymerases 759 tumor tissue, differential methylation levels 826 T values 258 twin supercoiled domain model 834 two-dimensional NMR spectroscopy 443 COSY spectrum 445 2D experiment, general scheme of 443–445 heteronuclear NMR experiments 446–447

homonuclear 2D NMR experiments of proteins 446 HSQC – heteronuclear single quantum coherence 447–449 NOESY spectrum 445–446 TOCSY spectrum 445 two-hybrid system 381 AD fusion proteins and cDNA libraries 385–386 bacterial two-hybrid system (BACTH or B2H) 389, 391 bait proteins, used in Y2H screen 385 biochemical and functional analysis of interactions 393 biological relevance 394 computational analysis of interaction data 394 independent verification 393 localization 394 protein domains and motifs 394 carrying out Y2H screen 386–391 construction of bait and prey proteins 382–385 elements of 382 modifications 392 and extensions of technology 391–393 principle of 381–382 two-state model, curves 52 tyrosine 33, 225, 648 U ubiquitin monoisotopic and average m/z values 990 proteoforms, theoretical trypsin cleavage sites 994 top-down mass spectrum 988 ultrafiltration 16, 17 ultrahigh pressure liquid chromatography (UHPLC) 302 ultraviolet (UV) 133 HPLC chromatogram 717 spectroscopy 220, 224 UV-diode array detection (UV-DAD) 224 VIS/NIR spectroscopy 146 chlorophylls 152–153 chromoproteins 147–147 cytochromes 149–151 metalloproteins 151–152 principles 146–147 rhodopsins 148–149 uniaxial orientation samples 177 UniProt 879 unmethylated bases, chemical modifications 818 Ustilago sphaerogena 862

1109

V Van-Deemter-Knox plots 222, 223 van’t Hoff equation 61 van’t Hoff transition enthalpy 52 V-genes 101 vibrational modes 169 of peptide bond 169 vibration cell mills 8 virtual image 185 viscosity 247 vitamin A 638 vitamin D 638–639 vitamin E 640 vitamin H 746 vitamin K 640 von Helmholtz, Hermann 493 W Watson–Crick base pairing 832, 857, 861, 862, 962, 971 Watson–Crick hydrogen bonding 961 wavelength 538 interference of light 187 wavenumbers 133 wave-particle dualism 133 WebMOTIF 884 Web server based modeling 893 Western blot analysis 88, 89, 252, 652 autoradiograph-based after 2D-PAGE 652 whole protein molecules 1038 Wiki pathways 1024 Wilkins, Marc 977 Wolff-rearrangement 126 X X-chromosome 951 inactivation 817 Xenopus laevis 466 XML/JSON 879 Xq28, candidate genes 932 x-ray crystallography 107, 120, 131, 314, 486, 529, 530, 567 crystallization 531–538 model building and structure refinement 542–543 phase problem 538–542 x-ray diffraction basics of 536 image of crystal 537 x-ray free electron LASER (XFEL) 549 detection and analysis 550 machine setup and theory 549–550 principle 549 samples 550

1110

Index

Y

Z

yeast artificial chromosome (YAC) libraries 935 yeast Mbp1 transcription factors 879 Y2H interactions 399 Y2H screen 386–391 Y2H system 381 limits 390

Z-DNA sequences 956 Zeeman splitting 468 Zernike, Frits 182, 485 zero mode waveguide (ZMW) zinc-finger proteins 835 zirconium oxide 649 zonal centrifugation 13–14

814

zone band broadening 223 zone electrophoresis 253–254, 278 zwitterionic detergents 20 zwitterionic/non-ionic detergents 982 zymogens 207 activation 208

WILEY END USER LICENSE AGREEMENT Go to www.wiley.com/go/eula to access Wiley's ebook EULA.