DNA Barcodes: Methods and Protocols (Methods in Molecular Biology, 858) 9781617795909, 1617795909

A DNA barcode in its simplest definition is one or more short gene sequences taken from a standardized portion of the ge

109 55 8MB

English Pages 485 [471] Year 2012

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

DNA Barcodes: Methods and Protocols (Methods in Molecular Biology, 858)
 9781617795909, 1617795909

Table of contents :
DNA Barcodes
Foreword
Preface
Contents
Contributors
Part I: Introduction
Part II: DNA Barcodes for the Tree of Life
Part III: Generating DNA Barcode Data
Part IV: Applications of DNA Barcode Data
Part V: Case Studies Using DNA Barcodes
INDEX

Citation preview

METHODS

IN

MOLECULAR BIOLOGY™

Series Editor John M. Walker School of Life Sciences University of Hertfordshire Hatfield, Hertfordshire, AL10 9AB, UK

For further volumes: http://www.springer.com/series/7651

DNA Barcodes Methods and Protocols Edited by

W. John Kress and David L. Erickson Department of Botany, National Museum of Natural History, Smithsonian Institution, Washington, DC, USA

Editors W. John Kress, Ph.D. Department of Botany National Museum of Natural History Smithsonian Institution Washington, DC, USA

David L. Erickson, Ph.D. Department of Botany National Museum of Natural History Smithsonian Institution Washington, DC, USA

ISSN 1064-3745 ISSN 1940-6029 (electronic) ISBN 978-1-61779-590-9 ISBN 978-1-61779-591-6 (eBook) DOI 10.1007/978-1-61779-591-6 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2012931933 © Springer Science+Business Media, LLC 2012 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Humana Press is part of Springer Science+Business Media (www.springer.com)

Foreword The diversity of life in a hectare of reef, a county of grassland, or a shipload of imports challenges biologists called to identify the species comprising biodiversity, functioning as ecosystems, or invading ports. The sequences of black-and-white barcodes that empower a newly hired clerk to wave a wand over a cart full of goods swiftly, print an itemized receipt infallibly, and order replacements invisibly call forth a vision of an analog for identifying species. The resemblance of barcodes on commercial products to sequences of DNA shown as black-and-white bars on electrophoretic gels reinforced the vision back in 2003 in the founding meetings of the barcode of life movement. This book edited by early adopters of DNA barcodes, John Kress and David Erickson, proves the barcode of life has arrived in environmental science. In less than a decade, they and the other authors in this volume have realized the vision of a short DNA sequence on a uniform locality of the genome to identify species rapidly and accurately. Because the currency in biology is species, their identification is no academic diversion. Biologists count the rise and fall of biodiversity in species. Regulators designate endangered species by their identified populations and reserve land where they identify the endangered. Governments appraise the success of preservation in the currency of species. Inspectors define quarantines in identified species. Biologists carry the weight of these consequences as they select an exact name from almost two million known species names or conjecture that a specimen may belong to one of millions more unknown species. Written as a sequence of four discrete nucleotides—CATG—along a uniform locality on a genome, a barcode of life provides a “digital” identifying feature, supplementing the more analog gradations of shapes, colors, and behaviors. A library of digital barcodes will provide an unambiguous reference that will facilitate identifying species invading and retreating across the globe and through centuries. Making a difficult task harder, many species metamorphose into different forms as they cycle through stages in their lives. Eggs may become caterpillars and caterpillars become butterflies but, of course, all remain the same species carrying the same genes. Different species may resemble one another or be too small to distinguish easily but each carries different alleles and thus barcodes, which can unmask their identity. Furthermore, an inspector of unloaded cargo on a dock or an analyst of the remains of diets in a stomach may be called to identify species from only a snippet, a hair, or a fin. The fragment may be unrecognizable, but it will faithfully carry the identifying barcode of the source. Since Carl Linnaeus (1707–1778) developed systematic naming, ranking, and classifying of organisms, biologists have produced master keys to all knowledge about a species in the form of binomial names. Biologists use distinguishing features, such as shape, color, or number of legs in taxonomic keys first to assign binomials, like Homo sapiens, and then to associate the names of the organisms with biological knowledge about the species and its relatives. Of course, the bank of names suffers some problems, such as when several names are applied to one species [1]. And, biologists continuously debate criteria for species. The diversity of life from bacteria to whales renders any single rule inadequate for defining all species. Nevertheless a few basic criteria, such as that distinct species do not interbreed and meld their genetic sequences, serve for many groups.

v

vi

Foreword

Since Charles Darwin (1809–1882) proposed a branching pattern of evolution in On the Origin of Species, biologists have sought to arrange a phylogenetic system of species on an evolutionary tree of life. A tree of life illustrates every introductory biology text. Barcoding will reveal whether a newly collected specimen belongs to a species already on the tree. Or if a specimen is a truly new species, barcoding will help place it as a new leaf among known species on the proper branch of the tree of life. Whatever the criteria for defining and recognizing species, their inheritance and their genes must differ to maintain species distinctions generation after generation. Since the molecular discoveries of the mid-twentieth century, genes intimate a code comprising sequences of the four nucleotides that constitute DNA. Even before the barcoding movement now embodied in the 200 member organizations from 50 countries of the Consortium for the Barcode of Life (CBOL), scientific revisions of species boundaries included DNA analysis, and the ability to distinguish new species included DNA divergences. The product barcode analogy leant impetus to the continuing matching of species and genetic differences. Commercial barcodes must be uniform across shelves and warehouses. For animals, concentration on the single segment of the mitochondrial COI gene across the far wider shelves of life imparted the necessary uniformity to avoid a Tower of Babel. Conceiving the series of nucleotides CATG as bars and their presence and order as digital bars opened the door to rapid and unambiguous connection of specimens. Instead of connecting biological specimens to shelves and suppliers, the DNA barcode of life connects them to curated collections in museums and herbaria, lifting their utility. It would also connect specimens to the biological literature of binomial names. DNA barcodes offer a globally consistent way to propose provisional or candidate species that experts have not yet honored with a full description and binomial name. Worries at first evoked by DNA barcoding have not been realized. It has heightened the nuance of the species concept, not diminished it. It has widened humanity’s view of diversity, not reduced diversity to ciphers. It has excited wonder at the knowledge hard-won through earlier techniques and accessible through the master key of binomial names. It has enhanced the need for systematists to match the flood of barcodes with a sound array of binomial names. Barcoding is not a mere slogan and an inadequate analogy. It is a now proven tool for understanding biodiversity. Recurring to the need for uniformity to avoid a Tower of Babel, the choice of a segment of the mitochondrial COI gene has excelled for almost all animal taxa. This barcode region meets four basic specifications: the locality must be present in all barcoded species; it must be shaved as short as possible; the locality must have sequences stable within a species through many generations; and it must nevertheless have sequences variable enough to distinguish species. As this book reports, botanists have now also found barcode regions that are proving successful from carrots and chamomile to oats and pines. Fungal barcodes are not far behind. Some observers do ask a single, searching question about the barcode of life arriving in environmental science: When will it be small, cheap, and convenient enough for nonexperts, even children? In particular, when will the needed equipment shrink to the size of a laptop or a handheld barcoder? In fact, even today the key machines have shrunk until they fit comfortably on a desk or tabletop. The analogy of the newly hired clerk faultlessly pricing the cart of goods suggests the ability to make taxonomic expertise go further, and very far if a handheld barcoder were present. Clues that such a goal will be achieved lie in reports of students detecting endangered marine species on sale in supermarkets, identifying insect traces in their homes, and

Foreword

vii

analyzing tea leaves with inexpensive equipment. As well as enabling specialized scientists to do more and lift the value of specimen collections, barcoding promises to enable laymen to appreciate the diversity of life. The array of opportunities offered by DNA barcodes must rest on a sound foundation of binomial names with associated, vouchered, and identified specimens—housed in readily accessible museum collections. A sound foundation of binomials based on new and existing natural history collections stands as the first priority for the success of DNA barcodes. Fortunately, the Global Names Architecture project associated with the Encyclopedia of Life has already amassed 19 million common and scientific names and is reconciling them for the two million or so species estimated to be known already. Within 5 years, we could celebrate the achievement of the international Barcode of Life (iBOL) project: access to the barcodes of an array of five million specimens sequenced from 500,000 species. Voucher specimens, which are prepared, curated, databased, often digitally imaged, and stored in natural history collections, will support this effort. Already, in just a handful of years, the DNA barcode of life database (www.boldsystems.org) has soared above 1.2 million specimens from about 150,000 species. Already, as the chapters in this book show, the library of barcodes linked to names and curated specimens is multiplying the knowledge of a marine ecologist about a reef, the quality of surveillance for invasive species, and the accuracy of labeling of food products. Such successes will motivate and sustain the further building of the reference library of barcodes and the removal of obstacles for its quick, frugal realization and use. Our vision, first inspired by a barcode wand in the hand of a supermarket clerk, is comparable magic for an ichthyologist on a research vessel with featureless fish larvae, a child on a woodland trail, or an inspector at a port infallibly identifying a species. Reading this book, we learn that science can make magic. New York, NY, USA

Jesse H. Ausubel Alfred P. Sloan Foundation

Reference 1. Patterson DJ, Cooper J, Kirk PM, Pyle RL, Remsen DP (2010) Names are key to the big new biology. Trends Ecol Evol 25:686–691.

Preface The use of a universally accepted short DNA sequence for identification of species has been proposed for application across all forms of life. Such a “DNA barcode,” a term first coined less than a decade before the publication of the present book, in its simplest definition is one or more short gene sequences ( Data management > Data submission protocol > download blank data submission template). 3. Open the template in Microsoft Excel. 4. On the Voucher Info sheet, enter Sample IDs (see Note 8). 5. Museum ID and Collection Code are optional (see Note 9). 6. Institution Storing must be completed for each sample. This is the location of the voucher specimen, not subsampled pieces of tissue (can be a private collection, see Table 2, museum or university).

3

DNA Barcodes for Insects

23

Table 2 Minimum data required for submission of specimen records to BOLD Voucher info sheet

Taxonomy sheet

Collection data sheet

Sample ID

Field ID

Institution Storing

Phylum

Country

QUI001

QUI001

Research collection of Carlisle Cullen

Arthropoda

USA

7. Proceed to the Taxonomy sheet and complete as fully as possible. 8. Proceed to the Specimen Details sheet and enter all known information (see Note 10). 9. Proceed to the Collection Data sheet and enter all known information (see Note 11). 10. Save the file (File > Save). 11. Spreadsheets can be uploaded by sending them through e-mail to [email protected] and must contain data in the Sample ID, Field ID, Institution Storing, Phylum, and Country columns (Table 2). 3.4. Specimen Imaging

1. Make a pedestal by mounting a 15 cm piece of drinking straw vertically in a wooden base. Plug the top of the straw with modeling clay to allow single-specimens to be pinned in the top (14). 2. Take pictures using the high-quality mode on your camera. If a fairly wide aperture (for shallow depth of field) is employed, background shadows will be negligible (14). 3. The specimen should be centered in the image frame. 4. Photos should be taken as close-up to the specimen as possible, leaving very little gap around the edges. 5. Use Landscape orientation. 6. Use 2 × 3 aspect ratio if possible. This will ensure that the images are not skewed when viewed in the BOLD image library. 7. If desired, a measurement scale may be included in the image to provide a size reference. 8. Use a standardized orientation (Table 3) as this makes it much easier to compare specimens within a project.

3.5. Submitting Images to BOLD

1. Create a folder on your desktop called Images and place in it all the image files (in .jpg format) you would like to upload. 2. To create a list of the files in the Images folder open a terminal window (Start > Run and type “cmd” into the black box that appeared in Windows), navigate to the Images folder (see Note 12), and then run one of the following commands: Windows: dir > list.txt ; MacOS: ls > list.txt.

24

J.J. Wilson

Table 3 Common standardized animal orientations for specimen imaging Orientation

Explanation

Dorsal

The anterior of the specimen should be facing the top of the image frame The specimen should be face-down, with the dorsal aspect of the head visible

Lateral

The anterior of the specimen should be facing the left side of the image frame The specimen should be oriented with the feet towards the bottom of the image

Ventral

The anterior of the specimen should be facing the top of the image frame The specimen should be face-up, with the ventral aspect of the head visible

3. Download a blank image submission template from BOLD (from www.boldsystems.org click Documentation > Data management > Image submission protocol > please click here to download a blank image submission template). Save the file (ImageData.xls) in the Images folder on the desktop. 4. Open ImageData.xls in Microsoft Excel. 5. Next open list.txt (see Note 13) and move the data into the Image File column in ImageData.xls. The cells in this column should contain the name of an image file including the extension (.jpg). 6. In the Original Specimen column type yes for original or no for not original. 7. In the View Metadata column choose one of the standard options from Table 3. 8. In the Caption column type any information you wish to appear by the image on BOLD. 9. Obtain the Sample IDs and Process IDs from BOLD by clicking on Data Spreadsheets under the Downloads menu on the left side of your Project Console (see Note 4). Choose to download the Progress Report, open the file bold.xls and copy the data from the Sample ID and Process ID columns into the appropriate columns in ImageData.xls. 10. Once you have filled in all the mandatory columns (see Note 14), save the file (File > Save). 11. The folder Images needs to be zipped before submission to BOLD. Most modern operating systems have built-in functionality for zipping (see Note 15) so this simply requires rightclicking on the folder and selecting Compress “Images” or something similar. 12. Navigate to your BOLD project’s Project Console and under the Uploads menu on the left click Specimen Images. Browse through to Images.zip and click Submit (see Note 16).

3

3.6. Tissue Subsampling

DNA Barcodes for Insects

25

1. Make sure that your specimens are correctly organized. This protocol is appropriate with pinned specimens stored in a drawer, specimens stored in ethanol tubes, and specimens stored in glassine envelopes. Print off a list of specimens and Specimen IDs so you can double-check they are going into the correct well as you go (e.g., QUI001 goes in A1, QUI002 goes in B1, …). 2. Use clean gloves. 3. Clean workspace and wipe bench with Eliminase. 4. Work on top of a KimWipe. 5. Get a microplate. These instructions are for a 96-well microplate; however, the procedure is essentially the same when working with single tubes. This is going to be the microplate in which the lysis takes place. Make sure the microplate is in the correct orientation, e.g., A1 is in the top right corner of the plate (Fig. 1). 6. It is worth assigning two wells as control wells (Fig. 1) at this point. 7. Put cap-strips on all rows. 8. Turn on the gas slowly and light the Bunsen burner, so that there is a small blue flame. When it is not possible or dangerous to use a Bunsen or gas burner an alternative Eliminase dip protocol can be just as effective (see Note 17) and can be easily adapted for use in the field.

Fig. 1. Diagram detailing how to organize and fill a microplate with tissue, including designation of control wells.

26

J.J. Wilson

9. Remove the first cap-strip (row A) from the microplate, and place it on a KimWipe. Take out the first specimen. 10. Take the forceps and dip them in ethanol (carefully shaking off any excess, not near the flame) and put them in the flame for a few seconds to burn off the ethanol. 11. Remove a small piece of tissue from the specimen (about a 2–3-mm-long piece of insect leg) and place it in the first well (A1) in the lysis plate. 12. It is also possible to use whole specimens in the lysis (17) or whole abdomens, in the case of a combined lysis/genitalia dissection (18) (see Note 18). 13. Continue on to the next specimen and well, making sure to sterilize the forceps between each sample with ethanol and flame. Place each specimen back in its drawer/tube/envelope before moving onto the next one. 14. Put the cap-strip back on as you finish a row and then carefully take the cap-strip off the next row. 15. Return your specimens to the freezer or cabinet. 16. For specimens in ethanol, the subsampled tissue needs to be completely dried before moving on to the next stage. Incubate at 56°C for 30 min, with the cap-strips slightly loosened, to evaporate residual ethanol. 3.7. Tissue Lysis

1. For one plate mix 5 ml of Insect Lysis Buffer and 0.5 ml of Proteinase K, 20 mg/ml in a sterile container (see Note 19 for single tubes). 2. Carefully remove all the cap-strips from the microplate (prepared as above). These instructions are for a microplate of tissue containing 96 wells (Fig. 1) but the procedure is the same for single tubes. 3. Add 50 μl of Lysis Mix to each well using a multichannel pipette. If you are careful not to touch the microplate with the tips, you can use the same tips right across the microplate. 4. Cover microplate with cap-strips. 5. Incubate at 56°C for a minimum of 6 h or overnight to allow digestion. It is not necessary to shake the microplate during incubation. 6. Centrifuge at 1,500 × g for 15 s to remove any condensate from the cap-strips (see Note 20).

3.8. High-Throughput DNA Extraction (Ivanova et al. (19))

1. Retrieve your microplate from the lysis stage above and remove cap-strips. Add 100 μl of Binding Mix to each sample using multichannel pipette. Cover plate with new cap-strips. Shake vigorously for 10–15 s and centrifuge at 1,000 × g for 20 s to remove any sample from the cap-strips.

3

DNA Barcodes for Insects

27

2. Remove cap-strips and transfer the lysate (about 150 μl) from the wells into the wells of a GF plate placed on top of a squarewell block using multichannel pipette. Seal the plate with selfadhering cover. 3. Centrifuge at 5,000 × g for 5 min to bind DNA to the GF membrane. 4. First wash step: Add 180 μl of Protein Wash Buffer to each well of GF plate (see Note 21). Seal with a new cover and centrifuge at 5,000 × g for 2 min. 5. Second wash step: Add 750 μl of Wash Buffer to each well of the GF plate (see Note 22). Seal with a new self-adhering cover and centrifuge at 5,000 × g for 5 min. 6. To avoid incomplete Wash Buffer removal open the sealing cover, close it, and centrifuge the GF plates again for 5 min at 6,000 × g. 7. Remove the self-adhering cover. Place GF plate on the lid of a tip box (see Note 23). Incubate at 56°C for 30 min to evaporate residual ethanol. 8. Position a PALL collar on a new microplate and place the GF plate on top. Dispense 30–60 μl of ddH2O (pre-warmed to 56°C) directly onto the membrane in each well of GF plate and incubate at room temperature for 1 min. Seal plate. 9. Place the assembled plates on a clean square-well block to prevent cracking of the collection plate and centrifuge at 5,000 × g for 5 min to collect the DNA eluate. Remove the GF plate and discard it. 10. Cover DNA microplate with cap-strips or aluminum PCR foil. This is your DNA and it can be temporarily stored at 4°C or at −20°C for long-term storage. Label it well. 3.9. Archival Specimen DNA Extraction (See Note 24)

1. Vortex the sample from lysis stage (about 150 μl) for 15 s (see Note 25). 2. Add 200 μl Buffer AL and vortex it (a white precipitate will most likely form). Add 200 μl EtOH 96% and vortex until it is homogeneous (there should be a lot less white precipitate). 3. Pipette the liquid (set the pipette to 650 μl) into a spin column, make sure to label the cap, and centrifuge it at 6,000 × g for 1 min. 4. Discard the collection tube (see Note 26) and put the spin column into a new tube. Add 500 μl Buffer AW1 and centrifuge at 6,000 × g for 1 min. 5. Discard the collection tube and put the spin column into a new tube. Add 500 μl Buffer AW2 and centrifuge at 20,000 × g for 5 min.

28

J.J. Wilson

6. Discard the collection tube and remove the spin column carefully so as not to let it touch the liquid. Put the spin column into a 1.5 or 1.7-ml microcentrifuge tube (see Note 27) and label it. 7. Add 100 μl Buffer AE (elution buffer) and let it sit at room temperature for 1 min. Centrifuge at 6,000 × g for 1 min. 8. Pipette the DNA out of the bottom of the microcentrifuge tube (about 100 μl), place it back into the spin column, and place the spin column back in the microcentrifuge tube. Centrifuge for an additional 1 min at 6,000 × g. 9. The liquid in the bottom of the microcentrifuge tube is your DNA. DNA can be temporarily stored at 4°C or at −20°C for long-term storage. Label it well. 3.10. Designing PCR Primers

1. Primers should be between 20 and 30 nt in length. 2. Avoid complementarity within and between primers. 3. The GC content should be approximately 50%. 4. Avoid mono- or dinucleotide repetition within primers. 5. The primer should end on a G or a C. 6. Primers should end on the second (or first if necessary) position of a codon. 7. The melting temperatures of primer pairs should be within 5°C of one another. 8. To design COI primers for a particular taxonomic group, try aligning as many COI genes from closely related taxa as possible (try surfing GenBank http://www.ncbi.nlm.nih.gov/ genbank/) for the desired species group. Design primers that are situated in regions that are conserved across all taxa. 9. Remember to target the “barcode” region [i.e., overlapping with the region targeted by the Folmer primers (20) (Table 4)]. 10. Primers can be tailed with M13 tails (Table 4) to improve amplification success (16) and facilitate high-throughput sequencing protocols (21). However, some tailed versions can form strong primer dimers, reducing PCR efficiency (e.g., LepF1 and LepR1).

3.11. PCR Set-Up

1. Prepare PCR master mix either for a single tube or 96-well microplate following the recipe in Table 5 where details on ingredient preparation are also provided. 2. We use LepF1 and LepR1 (Table 4) as the primer pair in a first amplification attempt (see Note 28). 3. Remember as above to wear clean gloves, clean benches with Eliminase and work on top of a KimWipe. Also work in a cold block if possible.

3

DNA Barcodes for Insects

29

Table 4 Common primers used for DNA barcoding insects Name

Sequence

Use with

Direction

References

LCO1490

GGTCAACAAATCATAAAGATATTGG

HCO2198

F

(20)

HCO2198

TAAACTTCAGGGTGACCAAAAAATCA

LCO1490

R

(20)

LepF1

ATTCAACCAATCATAAAGATATTGG

LepR1

F

(24)

LepR1

TAAACTTCTGGATGTCCAAAAAATCA

LepF1

R

(24)

MLepF1

GCTTTCCCACGAATAAATAATA

LepR1

F

(25)

MLepR1

CCTGTTCCAGCTCCATTTTC

LepF1

R

(25)

M13F (-21)

TGTAAAACGACGGCCAGT

F

(26)

M13R (-27)

CAGGAAACAGCTATGAC

R

(26)

Table 5 Basic recipe for PCR Amount of ingredient (ml) Ingredient

Single tube

96-well microplate

Ingredient preparation

10% Trehalose

6.25

625

Dissolve 5 g D-(+)-trehalose dehydrate in 50 ml of total volume of molecular grade ddH2O. Store at −20 C

ddH2O

2

200

Store at 4°C

10× Buffer

1.25

125

10× PCR Buffer for Platinun Taq. Store at −20°C

50 mM MgCl2

0.625

62.5

50 mM MgCl2. Store at −20°C

10 mM dNTPs

0.0625

6.25

10 mM dNTPs mix. Store at −20°C in 100 μl aliquots

10 μM F Primer working solution

0.125

12.5

Add 20 μl of 100 μM primer stock to 180 μl ultrapure ddH2O. Store at −20°C

10 μM R Primer Working Solution

0.125

12.5

Add 20 μl of 100 μM primer stock to 180 μl ultrapure ddH2O. Store at −20°C

Taq (5 U/μl)

0.06

6

Platinum Taq polymerase. Store at −20°C in 50 μl aliquots

Total

10.5

1,050

30

J.J. Wilson

4. Label your mix tube and microplate (see Note 29). 5. Return PCR ingredients to the freezer. 6. Mixes in tubes can be stored at −20°C for up to 3 months (1–3 freeze–thaw cycles do not affect performance). The content of a tube should be mixed by pipetting before use. 7. For microplate (see Note 30): Aliquot 1/8 of total mix volume to each of the tubes in an 8-tube PCR strip (see Note 31). Dispense desired volume (10.5 μl for 12.5 μl reactions) into each well of the 96-well plate using multichannel pipette. 8. Retrieve your DNA plate/tube from the fridge. Add 1–2 μl of DNA extract (see Note 32) to each tube/well (see Note 33). Seal and return DNA. Seal microplate with self-adhering aluminum foil (for PCR) or close the tube. 9. Centrifuge the microplate/tube at 1,000 × g for 20 s and start thermocycling. 3.12. PCR Thermocycle Program

1. Typical conditions for COI amplification include the initial denaturation at 94°C for 1 min; five cycles of 94°C for 30 s, annealing at 45–50°C for 40 s, and extension at 72°C for 1 min; followed by 30–35 cycles of 94°C for 30 s, 51–54°C for 40 s, and 72°C for 1 min; with a final extension at 72°C for 10 min, followed by indefinite hold at 4°C. 2. Centrifuge the microplate/tube at 1,000 × g for 20 s.

3.13. High-Throughput PCR Check

1. Precast agarose gels (E-gels) and docks (E-bases) to use them on are available from Invitrogen™. This system is bufferless, so exposure to Ethidium Bromide is minimized. However, gloves should be worn when handling and loading the gel. 2. The recommended program for 2% Agarose E-gel® 96 gel is EG and the run time is 6 min. 3. Plug the Mother E-Base™ into an electrical outlet. Press and release the pwr/prg (power/program) button on the base to select program EG. 4. Remove gel from the package and remove plastic comb from the gel. 5. Slide gel into the two electrode connections on the Mother or Daughter E-Base™. 6. Load 16 μl of ddH2O into wells with 12-multichannel pipettor. 7. Load 4 μl of sample from your PCR microplate. 8. To begin electrophoresis, press and release the pwr/prg button on the E-Base™. The red light changes to green. 9. At the end of run (signaled with a flashing red light and rapid beeping), press and release the pwr/prg button to stop the beeping.

3

DNA Barcodes for Insects

31

10. Remove gel cassette from the base and capture a digital image of the gel on UV transilluminator equipped with digital camera. 11. As a rough guide, set the filter to two for Ethidium Bromide and the exposure time to 2 s. 12. Analyze the image and align or arrange lanes in the image using the E-editor™ 2.0 software available at: http://tools.invitro gen.com/egels/. 13. White bands indicate product; square slots are the loading wells. 3.14. PCR Check: Important and Old Specimens

1. This protocol requires you make the gel yourself which is more time consuming, but cheaper on materials and produces gels that are more sensitive to product. 2. Gel should be ~5 mm thick: measure the size of the gel tray and determine the volume of liquid you will need to make a 5-mm thick gel (e.g., for a gel tray measuring 10 × 20 cm you would need to start with—20 × 10× 0.5—100 ml of 1× TAE buffer, see Note 34). 3. Make sure that your tray is on a flat surface with tape securely on the sides. 4. Tape the edges of the tray so that it will hold liquid. 5. Measure out the agarose powder onto a piece of weigh paper using a metal spatula. The amount of agarose powder that you need depends on the percentage of the gel. Generally, 2% gels are best (e.g., to make a 2% agarose gel for the tray that takes 100 ml of buffer you need 1 g of agarose powder). 6. Add the agarose to a large beaker or Erlenmeyer flask. Add the 1× TAE buffer to the agarose. 7. Place the flask in the microwave on high power for 30 s. Gently swirl the flask using heat resistant gloves. Heat for another few seconds until the agarose has dissolved. 8. Be very careful because the agarose could burn you. 9. Let the flask sit for 5 min on the lab bench at room temperature to cool. 10. Place a comb of desired well width into the tray. Pour the hot liquid into the middle of the tray, trying to avoid creating bubbles. Push any large bubbles to the edges of the tray using a clean pipette tip. 11. Allow the gel to cool for 30 min. Remove the tape from the edges. 12. Set up the gel rig. 13. The liquid in the base should be 1× TAE buffer (see Note 35). 14. The gel should now be set and you can remove the tape from the edges of the gel tray. 15. Slowly lower the tray into place in the gel rig.

32

J.J. Wilson

16. Add more 1× TAE buffer to the gel rig until the gel is completely submerged. 17. Carefully remove the comb from the gel by rocking it back and forth while pulling up slowly. 18. Cut a piece of parafilm and place flat on the lab bench. For every PCR product you will be adding to the gel place a 1 μl drop of loading dye (see Note 36) onto the parafilm. Take your PCR product and add 6 μl to one of the droplets of dye. Using the same pipette tip draw up the PCR product/dye droplet. 19. With a steady hand, add this to a well in the gel. With multiple samples be sure to keep track of which well is holding which sample. 20. The loading dye makes the product heavy so it will sink to the bottom of the well. You can hold the tip directly above the well without entering the gel. When you add the samples to the wells be careful not to poke a hole in the gel. 21. Always run a DNA ladder in a well beside your samples. A ladder of 100 bp would be appropriate and you should add 1 μl of ladder for every 5 mm of well width. 22. Close the top of the gel rig. 23. Remember to have the black electrode near the wells and the red electrode at the opposite end of the gel. DNA runs towards the positive electrodes. 24. Run the gel with the rig set to 150 V. 25. The loading dye forms two bands that you can see—wait until they have moved close to the bottom of the gel then turn off the rig (approximately 20 min). 26. Carefully transfer the gel from the tray into a plastic container for staining. Pour in diluted GelRed (see Note 37) until it covers the gel. Let it sit with moderate manual mixing for 20 min. 27. Pour the GelRed back into the bottle carefully using the funnel. 28. Capture a digital image of a gel on a UV transilluminator, equipped with digital camera, usually located in your institution’s dark room. 29. As a rough guide set the filter to two for Ethidium Bromide and the exposure time to 2 s. 30. Print and save the image. 3.15. Cycle Sequencing Set-Up

1. When sequencing PCR product, you sequence in both forward and reverse directions. This is done with two different reactions and each reaction mix should include only a forward or reverse primer, not both. For example, for each microplate of PCR product, two microplates must be set up for sequencing, one with the forward primer and one with the reverse primer (see Note 38).

3

DNA Barcodes for Insects

33

Table 6 Basic recipe for cycle sequencing Amount of ingredient (ml) Ingredient

Single tube

Dye terminator mix v3.1

0.25

26

5× ABI sequencing buffer

1.875

195

10% Trehalose

5

520

10 μM Primer working solution

1

104

ddH2O

0.875

Total

9

96-well microplate

91 936

2. Prepare cycle sequencing master mix either for a single tube or 96-well microplate following the recipe in Table 6 and details on ingredient preparation in Table 5. 3. Remember as above to wear clean gloves, clean benches with Eliminase and work on top of a KimWipe. Also work in a cold block if possible. 4. Label your mix tube and any microplates (see Note 29). 5. Return cycle sequencing ingredients to the freezer. 6. Mixes in tubes (or pre-made plates, see Note 39) can be stored at −20°C for up to 3 months (see Note 40). The content of a tube should be mixed by pipetting before use. 7. For microplate : Aliquot 1/8 of total mix volume (115 μl) into each of the tubes in an 8-tube PCR strip (see Note 31). Dispense desired volume (9 μl) into each well of the 96-well plate using multichannel pipette. Changing tips after every row (see Note 41). 8. Retrieve your PCR product from the fridge. Add 1.5 μl of PCR product (see Note 42) to each tube/well (see Note 33). Seal and return PCR plate. Seal cycle sequencing microplate with self-adhering aluminum foil (for PCR) or close the tube. 9. Centrifuge the microplate/tube at 1,000 × g for 20 s and start thermocycling. 3.16. Cycle Sequencing Thermocycle Program

1. Denaturation at 96°C for 2 min. 2. Thirty cycles of 96°C for 30 s, annealing at 55°C for 15 s. 3. Additional extension at 60°C for 4 min. 4. Indefinite hold at 4°C (see Note 43).

34

J.J. Wilson

3.17. Sequencing Clean-Up and Analysis (Ivanova and Grainger (22))

1. Sequencers should be operated by specially trained technicians, many facilities exist and will often require the Cycle Sequencing microplate, a supply of sequencing primer and a plate record (Table 7). 2. Measure dry Sephadex® G-50 with the MultiScreen® Column Loader into the Acroprep™ 96 Filter plate with 0.45 μm GHP membrane. 3. Hydrate the wells with 300 μl of ddH2O. 4. Let the Sephadex® hydrate overnight in the fridge or for 3–4 h at room temperature before use. 5. Put Acroprep™ plate together with MicroAmp® Optical 96-well Reaction Plate and secure with at least two rubber bands. 6. Make sure the two sets weigh the same (adjust weight by using different rubber bands). 7. Centrifuge at 750 × g for 3 min—this is to drain the water from the wells. Discard water from MicroAmp® plates (these plates could be reused for the same procedure without autoclaving). 8. Add the entire volume of the cycle sequencing reaction to the center of Sephadex® columns. 9. Add 25 μl of 0.1 mM EDTA pH 8.0 to each well of the new (or autoclaved) MicroAmp® plate.

Table 7 Example of a plate record for a 3730xl DNA Analyzer (Applied Biosystems) Container name

Plate ID

Description

Container type

LOP Plate1

LOP Plate1

COI-Barcodes

96-Well

AppServer

AppInstance

Well

Sample name

Comment

Results group 1

A01

LOP001-11

LepF1

CC

B01

LOP013-11

LepF1

CC

App type

Owner

Operator Plate sealing Schedule pref

Regular

CCDB

CCDB

Instrument Analysis Protocol 1 Protocol 1 FolA700

3730BDTv3-KBDeNovo_v5.1

FolA700

3730BDTv3-KBDeNovo_v5.1

Septa

1234

3

DNA Barcodes for Insects

35

10. To elute DNA attach MicroAmp® plate to the bottom of the Acroprep™ plate—secure them with tape and with rubber bands. 11. Make sure the sets weigh the same (adjust weight by using different rubber bands). 12. Centrifuge at 750 × g for 3 min. 13. Remove MicroAmp® plate and cover its top with Septa. 14. Place MicroAmp® plate into the black plate base and attach the white plate retainer. 15. Stack assembled plate in 3730xl DNA Analyzer (Applied Biosystems)—do not forget barcode and plate record. 16. Discard Sephadex® from Acroprep™ plate. 17. Using the Plate Manager of the Data Collection software (Applied Biosystems), import the plate record(s) for the plate being run. 18. Begin the run within Run Scheduler. 3.18. Uploading Raw Sequences to BOLD

1. The sequencing outputs a folder of files. The files you are interested in have an extension .ab1, e.g., LOP001-11_F.ab1. These raw files (traces) can be edited into the form we are use to seeing DNA sequences represented in, i.e., a string of letters. However, as editing can be a subjective task, BOLD also requires the raw files (traces) be uploaded as part of a barcode’s collateral data. 2. To add trace files of your new sequences to the appropriate records in BOLD create a folder on your desktop called Traces and place in it all the .ab1 files that you would like to upload. 3. To create the list of files in the Traces folder, open a terminal window (Start > Run and type “cmd” into the black box that appeared in Windows), navigate to the Traces folder (see Note 44), and then run one of the following commands: Windows: dir > list.txt; MacOS: ls > list.txt. 4. Download a blank trace submission template from BOLD (from www.boldsystems.org click Documentation > Data management > Trace submission protocol > please click here to download a blank trace submission template). Save the file (data.xls) in the Traces folder on the desktop. 5. Open data.xls in Microsoft Excel. 6. You can then open list.txt and move the data into the Filename (.ab1) column in data.xls (see Note 45). The cells in this column should contain the name of a trace file including the extension (.ab1) (see Note 46). 7. In the FORWARD PCR PRIMER column enter the registered name of the forward primer used during the PCR. Copy it down through the entire column to the end of your file list.

36

J.J. Wilson

Table 8 Example of file data.xls A

B

C

D

E

F

G

J

Filename (.ab1)

Score file (.phd1)

Forward PCR Primer

Reverse PCR Primer

Sequencing Primer

Read direction

Process ID

Marker

LOP00111_F.ab1

LepF

LepR

LepF

Forward

LOP001-11a

COI-5P

LOP00111_R.ab1

LepF

LepR

LepR

Reverse

LOP001-11b COI-5P

a

Formula typed into this cell is = left(A2,9) Formula in this cell is = left(A3,9)

b

8. In the REVERSE PCR PRIMER column enter the registered name of the reverse primer used during the PCR. Copy it down through the entire column to the end of your file list. 9. In the column SEQUENCING PRIMER enter the registered name of your sequencing primer. It will alternate between forward and reverse. For example, LepF1 should line up with the read direction Forward (Table 8). 10. In the Read Direction column enter Forward or Reverse depending on the direction of the .ab1 files it refers to (see Note 47). 11. In the Process ID column you need to type in the formula “= left (A2, $)” where A2 is the column with your first .ab1 file and $ is the number of characters in the Process ID. For example LOP001-11 has nine characters ($ = 9). Therefore, $ may be more or less depending on the number of letters in the Project Code. 12. Press Control and A to select the entire page. Press Control and C to copy the page, and then go Edit > Paste Special and chose values and press OK. This removes the formulas from your sheet. 13. Save this file under the name data.xls and save it in your folder Traces. 14. Delete list.txt from your Traces folder. 15. The folder Traces needs to be zipped before submission to BOLD. See Subheading 3.5 for details on how to do this. Save as Traces.zip. 16. Navigate to your BOLD project’s Project Console and under the Uploads menu on the left click Trace Files. Browse through to Traces.zip and click Submit (see Note 48).

3

3.19. Sequence Editing

DNA Barcodes for Insects

37

1. Open CodonCode (http://www.codoncode.com) and choose Create a new project and press OK. 2. Go to File > Import > Add Folder > Traces then press Import. 3. To see the files you just imported press ► beside the Unassembled Samples folder. 4. Your .ab1 files should be of the form “LOP001-11_F” where the first part “LOP001-11” refers to the Process ID and the second part “F” refers to the sequence direction, i.e., Forward. 5. Sort files by quality by double-clicking on Quality. Any sequences that are of very poor quality or of short length highlight them and click the trashcan to delete them. 6. Next select the Contig menu and move the cursor over Advanced Assembly. From the options that appear select Assemble in Groups. 7. A window will appear asking if you would like to Define sample name parts? Choose Define names… to bring up another small window. 8. There are two parts to our filenames. The first will be the Process ID and for your purposes the option in the Meaning menu can be left as Clone. Since the Process ID is followed by underscore choose _ (underscore) in the Delimiter menu for Clone. 9. For the second part choose Direction in the Meaning menu. We can ignore the Delimiter for the Direction part because there is no actual delimiter following the direction. 10. Delete all the additional parts that may appear on this window. 11. Next click Preview… to check how aligner is interpreting the sample names. Click Close to exit the preview. 12. Click OK to return to the Assemble in Groups window. 13. In order to assemble our files according to direction you should choose Direction in the Name Part section. Then click Assemble. 14. You should now have two folders, one called F with the forward sequences and one called R with the reverse sequences. 15. Next you need to cut the primers from your sequences. Highlight the R folder and reverse and compliment the sequences using the button with three black arrows on it. 16. Double click the R folder to open it. For the reverse sequences, you need to find the forward primer motif (e.g., LepF1) and delete it from the beginning of the consensus sequence. You will find the primer around 50 nucleotides from the end of the raw sequence. For example, in Fig. 2a, you would need to delete the sequence marked in bold and everything to the left of it.

38

J.J. Wilson

Fig. 2. An example of sequence editing.

17. When you have located the primer, highlight it on the consensus sequence at the bottom of the window and press the Delete key. 18. Next go to the opposite end of the consensus, the far right, and delete the consensus sequence from the point where many Ns appear all the way to its very right-hand edge. For example, in Fig. 2b, you would delete the sequence marked in bold and everything to the right of it. Close the window. 19. Double click the F folder to open it. Go to the far right of the consensus sequence and find the reverse and complement of the reverse primer (e.g., LepR1) at the very end. This means that at the right end of the forward sequences, you will find the complement of the reverse primer backwards (e.g., if the reverse primer is ATGC then you will find GCAT at the end of your forward sequence). This should be around position 690–700 bp on the consensus sequence. For example, in Fig. 2c, you would need to delete the sequence marked in bold and everything to the right of it. 20. When you have located the primer, highlight it on the consensus sequence at the bottom of the window and press the Delete key. 21. Next go to the opposite end of the consensus, the far left, and delete the consensus sequence from the point where many Ns appear all the way to its very left-hand edge. For example, in Fig. 2d, you would delete the sequence marked in bold and everything to the left of it. 22. Dissolve both folders by clicking on the button marker with a red X. 23. Highlight all sequences and press the button marked with a black N. This time in order to assemble our files according to

3

DNA Barcodes for Insects

39

Process ID choose Clone in the Name Part menu. Then click Assemble. 24. Specimens which only sequenced successfully in one direction will have files which remain in the Unassembled Samples folder (see Note 49). 25. Open each folder (contig) by double-clicking and make sure that forward and reverse sequences have the correct orientation, i.e., forward sequence is in black with the arrow pointing to the right and reverse in red with the arrow pointing to the left. If they are backwards, reverse-complement the two files in the folder by closing the window, highlighting the folder and clicking the button with three black arrows. 26. Correct ambiguous positions (“N”s) and gaps (“-”s) in consensus sequences by checking the original trace chromatograms, which are present in the CodonCode project. This is done by double-clicking on the consensus sequence. Always open both trace files (forward and reverse) and compare them. 27. Generally if reads conflict (i.e., different colored peaks appear in the same location on the forward and reverse chromatograms) you can decide which is more reliable based on sequence quality (e.g., less background noise, cleaner peaks, taller peaks). 28. Correct bases in contigs first, and then check the single sequences in the Unassembled samples folder. This is a good idea because not all contigs will be kept, some will be dissolved or deleted. 29. Make sure single sequences are also in the correct orientation before uploading to BOLD. 30. To export the consensus sequences select all the folders using shift click, go File > Export > Consensus sequences…, choose Current selection. Open the Options and check Include gaps in FASTA but uncheck all other options by clicking. Press Export. Save the file to the desktop as sequences.fas (see Note 49). 3.20. Sequence Aligning

1. Open the file sequences.fas in BioEdit (see Note 50). 2. Make sure Mode: is set to Edit using the drop-down menu. 3. Another drop-down menu will become visible to the right of the Edit drop-down. Make sure this is set to Insert. 4. Sequences that have ended up in the FASTA file in the wrong orientation may be corrected by highlighting the sequence name by clicking the cursor on it, clicking the Sequences menu at the top of the screen. Moving the cursor down the dropdown to Nucleic Acid and clicking, Reverse complement. 5. Sequences all need to be 658 bp and aligned to each other before being uploaded to BOLD. This can be done by typing additional Ns at the beginning and end of your sequences in

40

J.J. Wilson

Fig. 3. An example of an unaligned and aligned FASTA file.

the BioEdit Edit mode. Be sure to check across the whole alignment of your sequences that you have added the correct number of Ns. 6. In Fig. 3 featuring a 50-bp barcode for simplification, LOP001-11 is of full length. LOP002-11 needs 6 Ns adding to the left side of the sequence to become aligned, while LOP003-11 needs 4 Ns adding to the right side of the sequence to be 50 bps long. LOP005-11_F needs to be reverse complemented to be in the same orientation as the other sequences. 7. Sequences which were not part of a consensus (i.e., when one direction failed but the single sequence is of sufficient length and quality for submission to BOLD) may appear in the FASTA still tagged with the direction. This needs to be deleted, e.g., the sequence named LOP005-11_F should be renamed as LOP005-11 (Fig. 3). 8. If you are having trouble with the alignment, a good quality (i.e., 658[0n]) sequence can be downloaded from BOLD and imported into your BioEdit file as a guide, e.g., MHAHC824-05 (Fig. 3). Be sure to delete this sequence before saving the file. 9. Save the file (File > Save). 3.21. Sequence Upload and Publication

1. Open the file sequences.fas in a text editor. This can usually be done by double-clicking on the file icon on the desktop. 2. Under the Edit menu, click Select All then under Edit, click Copy. 3. Navigate to your BOLD project’s Project Console and under the Uploads menu on the left click Sequences. 4. Right click on the box Paste sequences in fasta format: and click Paste. 5. Select the Markers: as COI-5P and select or register your Run site.

3

DNA Barcodes for Insects

41

6. Click Submit. 7. Once you are happy with all your data make your project public on BOLD by adding Public as a user (see Subheading 3.1).

4. Notes 1. The user who creates the project becomes the Project Manager. The Project Manager is the only user who can add new users. The Project Manager can be changed by contacting the BOLD team at [email protected]. 2. We recommend avoiding the letters A, C, G, T, U, R, Y, M, K, W, S, B, D, H, V, and N. These are IUPAC nucleotide codes including the ambiguity codes which will appear in your sequences when a sequence-editing program is unable to make a base call. If you use these letters you may have difficulties later on during sequence editing and manipulation. 3. BOLD can store data for other regions besides COI-5P. The extracted DNA obtained by following these methods can be stored in a freezer and subsequently used as a template for other gene regions. For information about amplification of other regions frequently sequenced for insects see refs. 16 and 21. 4. To return to the Console of your new project simply log into BOLD and click on the project name in your list of projects. 5. If you will be creating multiple projects it may be worth launching a Campaign. To start a Campaign contact [email protected]. Your Campaign can then be selected on the Create New Project page. 6. We’ve had incidences where specimens that should have sequenced successfully failed unexpectedly. After consultation with the collector we were able to attribute the failure to storage conditions. 7. DNA leaks into the storage fluid (23). For large specimens which may be damaged by ethanol, whole specimen can be stored dry and legs can be removed into ethanol. 8. If you enter Sample ID on any other sheet it will overwrite important macros. When copying and pasting information into the spreadsheet, use Paste Special and select Values to avoid overwriting formulas. Avoid using the Project Code in Sample IDs if possible, because the Project Code will form the basis of Process IDs. We advise Sample ID = Field ID where possible. Also keep in mind that each project in BOLD can have a maximum of 999 samples. Alternatively Sample ID can refer directly to the catalog number of a museum collection voucher.

42

J.J. Wilson

9. Museum abbreviations should follow standard registers for biorepositories, e.g., http://www.biorepositories.org/. 10. In the Sex column use M for male, F for female, or H for hermaphrodite. Reproduction refers to the type of lifecycle (use S for sexual, A for asexual, or P for parthenogenic). For Life Stage use either I for immature or A for adult. Extra info will show up on taxonomic identification trees generated by BOLD. Notes will not be seen on the tree, but they will appear on BOLD on the Specimen Page. 11. Latitude (North–South) and Longitude (East–West) must be in decimal format. A useful website for this conversion is http://www.calculatorcat.com/latitude_longitude.phtml . Elevation must be in meters, but it is not necessary to put “m” for meters beside the number. 12. To navigate to the folder, type cd desktop, press enter, type cd Images, press enter. 13. In Excel, go to File > Open… (change Enable: to All files) navigate to Desktop > Images > list.txt and click Open. A window will open called Text Import Wizard. Select Fixed width and click Next >. Scroll down until you can see your first file named .jpg and move the arrowed line beside it over to the right a little so that it touches the file name click Next > and Finish. 14. For information on the optional columns: Measurement, Copyright etc, please refer to the BOLD handbook (http://www. boldsystems.org > Documentation > Image Submission Protocol). 15. If your machine does not have built in zipping functionality try downloading a free program (WinZip: http://www.winzip. com; WinRar: http://www.rarsoft.com; or MacZipIt: http:// www.maczipit.com). 16. BOLD will give a message advising if the submission was successful. If problems have occurred, refer to the Image Submission—Tips and Troubleshooting section of the BOLD handbook (http://www.boldsystems.org > Documentation > Image Submission Protocol). 17. Eliminase protocol: (a) Get four clean jars. Label the first one “Eliminase” and add about 10 cm3 of Eliminase. (b) Label the next three jars “Wash 1,” “Wash 2,” and “Wash 3” and fill them all with 10 ml or more of ddH2O. (c) Dip your forceps into the Eliminase, shake them a bit, remove them and wipe off excess liquid using a clean KimWipe. (d) Dip your forceps and shake them slightly in Wash 1, then Wash 2, and finally in Wash 3. Dry them off using a clean KimWipe.

3

DNA Barcodes for Insects

43

18. In these cases the tissue must be retrieved from the lysis buffer prior to undertaking the DNA extraction. 19. For single tube lysis add 50 μl of lysis buffer and 5 μl of ProK. 20. This is high speed so place the lysis plate on a clean square-well block to prevent cracking. 21. You will need to put about 18 ml of Wash Buffer in the reservoir for a full plate. 22. You will need to put about 75 ml of Wash Buffer in the reservoir for a full plate. 23. Square-well blocks can be washed with ELIMINase® (or with any other DNA removing detergent), autoclaved, and reused. 24. The Qiagen kits can be stored at room temperature (15–25°C). Some ingredients may need storage at lower temperatures. 25. The Qiagen protocol suggests that you use a mortar and pestle to crush part of the insect into a powder, but we have found that this is unnecessary. 26. The following buffers contain toxic components that must not be thrown in the regular garbage: AL, ATL, AW1. Be sure that all liquid waste from DNA extraction is kept in a well-labelled glass jar. 27. This tube is not from the kit. It is one with a cap (see Subheading 2.5). 28. We find this has a very high success rate (99%) with recently collected specimens ( Open… (change Enable: to All files) navigate to Desktop > Traces > list.txt and click Open. A window called Text Import Wizard will open. Select Fixed width and click Next >. Scroll down until you can see your first file named .ab1 and move the arrowed line beside it over to the right a little so that it touches the file name click Next > and Finish.

3

DNA Barcodes for Insects

45

46. Score files are not compulsory and the column can be left blank; however, for more information on these and how you can include them see the BOLD handbook. 47. Usually the files names should be of the form (e.g., LOP00111_F.ab1) Process ID (LOP001-11) and direction (F), which should make this easier. 48. BOLD will give a message advising if the submission was successful. If problems have occurred, refer to the Trace Submission— Tips and Troubleshooting section of the BOLD handbook ( http://www.boldsystems.org > Documentation > Trace Submission Protocol). 49. To export the single direction sequences, go File > Export > Samples…, choose Current selection. Open the Options and select Include gaps in FASTA but deselect all other options. Press Export. Save the file to the desktop as sequences.fas. 50. BioEdit can be downloaded for free from http://www.mbio. ncsu.edu/bioedit/bioedit.html.

Acknowledgment Heather Braid compiled the “Barcoding in the Hanner Lab” protocols (http://barcoding.wikia.com/wiki/Barcoding_in_the_ Hanner_Lab_Wiki), which greatly aided with the structuring and content of this chapter, and also provided Fig. 1. References 1. Floyd R, Wilson JJ, Hebert PDN (2009) DNA barcodes and insect biodiversity. In: Footit RG, Adler PH (eds) Insect biodiversity: science and society. Blackwell Publishing, Oxford, pp 417–431 2. Ratnasingham S, Hebert PDN (2007) BOLD: The barcode of life data system (www.barcod inglife.org). Mol Ecol Notes 7:355–364 3. Janzen DH, Hajibabaei M, Burns JM et al (2005) Wedding biodiversity inventory of a large and complex Lepidoptera fauna with DNA barcoding. Phil Trans R Soc Lond B 360:1835–1845 4. Janzen DH, Hallwachs W, Blandin P, Burns JM, Cadiou J-M, Chacon I et al (2009) Integration of DNA barcoding into an ongoing inventory of complex tropical biodiversity. Mol Ecol Res 9:1–25 5. Xhou X, Robinson JL, Geraci CJ, Parker CR et al (2011) Accelerated construction of a

6. 7.

8.

9.

regional DNA-barcode reference library: caddisflies (Trichoptera) in the Great Smoky Mountains National Park. J Nor Amer Benth Soc 30:131–162 iBOL (2010) Barcoding blitz targets Australian Lepidoptera. Barcode Bulletin 1(4):5 Vaglia T, Haxaire J, Kitching IJ et al (2008) Morphology and DNA barcoding reveal three cryptic species within the Xylophanes neoptolemus and loelia species-group (Lepidoptera: Sphingidae). Zootaxa 1923:18–36 Hausmann A, Hebert PDN, Mitchell A et al (2009) Revision of the Australian Oenochroma vinaria Guenée, 1858 species-complex (Lepidoptera, Geometridae, Oenochrominae): DNA barcoding reveals cryptic diversity and assesses status of type specimen without dissection. Zootaxa 2239:1–21 Wilson JJ, Landry JF, Janzen DH et al (2010) Identity of the ailanthus webworm moth, a

46

10.

11.

12.

13.

14.

15.

16.

17.

J.J. Wilson complex of two species: evidence from DNA barcoding, morphology and ecology. Zookeys 46: 41–60 Dinca˘ V, Zakharov EV, Hebert PDN, Vila R (2011) Complete DNA barcode reference library for a country’s butterfly fauna reveals high performance for temperate Europe. Proc R Soc Lond B 278:347–355 Virgilio M, Backeljau T, Nevado B, de Meyer M (2010) Comparative performances of DNA barcoding across insect orders. BMC Bioinformatics 11:4567–4573 Vogler AP (2006) Will DNA barcoding advance efforts to conserve biodiversity more efficiently than traditional taxonomic methods? Front Ecol Environ 4:270–272 Pons J, Barraclough TG, Gomez-Zurita J et al (2006) Sequence-based species delimitation for the DNA taxonomy of undescribed insects. Sys Biol 55:595–609 Winter WD Jr (2000) Basic techniques for observing and studying moths & butterflies (Memoir No 5). The Lepidopterist’s Society, Cambridge Hanner R. (2005) Proposed standards for BARCODE records in INSDC (BRIs). http:// barcoding.si.edu/PDF/DWG_data_standards-Final.pdf Regier JC (2008) Protocols, concepts, and reagents for preparing DNA sequencing templates. Version 12/4/08. http://www.umbi. umd.edu/users/jcrlab/PCR_primers.pdf Porco D, Rougerie R, Deharveng L, Hebert PDN (2010) Coupling non-destructive DNA extraction and voucher retrieval for small softbodied Arthropods in a high-throughput context: the example of Collembola. Mol Ecol Res 10: 942–945

18. Knölke S, Erlacher S, Hausmann A et al (2005) A procedure for combined genitalia extraction and DNA extraction in Lepidoptera. Insect Syst Evol 35:401–409 19. Ivanova NV, deWaard J, Hebert PDN (2006) An inexpensive, automation-friendly protocol for recovering high-quality DNA. Mol Ecol Notes 6:998–1002 20. Folmer O, Black M, Hoeh W, Lutz R, Vrijenhoek R (1994) DNA primers for amplification of mitochondrial cytochrome c oxidase subunit I from diverse metazoan invertebrates. Mol Mar Biol Biotech 3: 294–299 21. Wilson JJ (2010) Assessing the value of DNA barcodes and other priority gene regions for molecular phylogenetics of Lepidoptera. PLoS ONE 5:e10525 22. Ivanova N, and Grainger C (2006) Protocols: Sequencing. Canadian Centre for DNA Barcoding CCDB Protocols. http://www.dnabarcoding.ca 23. Shokralla S, Singer GAC, Hajibabaei M (2010) Direct PCR amplification and sequencing of specimens’ DNA from preservative ethanol. BioTechniques 48:232–234 24. Hebert PDN, Penton EH, Burns J, Janzen DH, Hallwachs W (2004) Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly, Astraptes fulgerator. Proc Nat Acad Sci USA 101: 14812–14817 25. Hajibabaei M, Janzen DH, Burns JM, Hallwachs W, Hebert PDN (2006) DNA barcodes distinguish species of tropical Lepidoptera. Proc Nat Acad Sci USA 103:968–971 26. Messing J (1983) New M13 vectors for cloning. Meth Enzymol 101:20–78

Chapter 4 DNA Barcoding Methods for Invertebrates Nathaniel Evans and Gustav Paulay Abstract Invertebrates comprise approximately 34 phyla, while vertebrates represent one subphylum and insects a (very large) class. Thus, the clades excepting vertebrates and insects encompass almost all of animal diversity. Consequently, the barcoding challenge in invertebrates is that of barcoding animals in general. While standard extraction, cleaning, PCR methods, and universal primers work for many taxa, taxon-specific challenges arise because of the shear genetic and biochemical diversity present across the kingdom, and because problems arising as a result of this diversity, and solutions to them, are still poorly characterized for many metazoan clades. The objective of this chapter is to emphasize general approaches, and give practical advice for overcoming the diverse challenges that may be encountered across animal taxa, but we stop short of providing an exhaustive inventory. Rather, we encourage researchers, especially those working on poorly studied taxa, to carefully consider methodological issues presented below, when standard approaches perform poorly. Key words: DNA barcoding, Invertebrates, CO1, Cytochrome c oxidase subunit I

1. Introduction DNA barcoding as a tool for species level identification was developed in zoology and remains most facile for animals (1), reflecting the unusually rapid rate of mitochondrial DNA (mtDNA) sequence evolution that characterizes most Metazoa (2). As a result, relatively short mtDNA sequences, as generated from single, routine PCR reactions, are sufficient for species delineation and identification in many taxa. To date, efforts have focused primarily on the initial ~650 base pair “Folmer” region of cytochrome c oxidase subunit I (COI) which typically accumulates several percentage differences between related animal species (1). In contrast, rates of mtDNA sequence evolution in other eukaryotes: protists, plants, and fungi are much slower (2). Barcoding approaches in these

W. John Kress and David L. Erickson (eds.), DNA Barcodes: Methods and Protocols, Methods in Molecular Biology, vol. 858, DOI 10.1007/978-1-61779-591-6_4, © Springer Science+Business Media, LLC 2012

47

48

N. Evans and G. Paulay

groups require alternate, longer, or multiple gene regions, or are limited to supraspecific levels of differentiation. Thus, COI barcoding remains largely a zoological proposition. “Other invertebrates” encompasses almost all animal diversity. That is, invertebrates comprise approximately 34 phyla, while vertebrates represent one subphylum and insects a (very large) class. Consequently, the barcoding challenge in invertebrates is that of barcoding animals in general. While standard extraction, cleaning, PCR methods, and universal primers work for many taxa, taxonspecific challenges arise because of the shear genetic and biochemical diversity present across the kingdom, and because problems arising as a result of this diversity, and solutions to them, are still poorly characterized for many metazoan clades. The objective of this chapter is to emphasize general approaches, and give practical advice for overcoming the diverse challenges that may be encountered across animal taxa, but we stop short of providing an exhaustive inventory. Rather, we encourage researchers, especially those working on poorly studied taxa, to carefully consider methodological issues presented below, when standard approaches perform poorly. Challenges for barcoding across the Metazoa can be grouped into two broad classes: intrinsic, genetic issues of species delimitation, evolutionary rate, and behavior of potential markers, and extrinsic, methodological issues from sample processing to sequence generation. 1.1. Intrinsic Challenges

Species delineation, by any technique, requires that differences in the character(s) used for delineation to be greater between members of different species than among members of the same species. For markers to be useful at the species-level, they need to accumulate measurable divergence over the time frame of speciation. Markers that show limited interspecific divergence or comparable levels of intraspecific variation, do not perform well for species delineation. Thus, the general utility of any given molecular marker for specieslevel “barcoding” requires a fairly rapid, but broadly conserved rate of sequence evolution. Yet, the mode and tempo of speciation and molecular evolution are certainly not constrained and thus any “universal” marker may fail for those clades whose evolutionary dynamics deviate from that expected. This has clearly been born out for DNA barcoding in Metazoa. The relatively high, but comparatively conserved rate of sequence evolution of mtDNA has made CO1 the marker of choice in animals. Yet it has been repeatedly demonstrated that mtDNA sequence divergence is generally too slow for species delineation in most lower, non-bilaterian, animals, much as it is in plants or protists (see below, and papers in this volume). Conversely, unusually rapid sequence evolution has also been detected within a number of metazoans including pulmonate land snails (3), minute animals such as many meiofauna (4) and some parasitic taxa (e.g., (5)).

4 1.1.1. Species Delineation

DNA Barcoding Methods for Invertebrates

49

Species delineation involves characterization of diagnostic differences among species. Characters used for species delineation can be broadly grouped into three classes: morphological, genetic, and isolating. While morphological and genetic characters can be assessed directly, isolation is usually inferred from other sources of evidence and is rarely directly tested through mating/fertilization experiments. Rate variation among characters has important consequences for species delineation; these consequences are especially important to consider when characters are of different classes (Table 1). When the rate of morphological differentiation is slow relative to genetic differentiation and isolation, cryptic species result; a potentially common situation in some taxa (6). Genetic taxonomy has the most to contribute to species delineation in these cases. When the rate of genetic differentiation is slow relative to morphological divergence and isolation, as for example in some African rift lake cichlid species flocks (7), simple barcoding is less powerful than morphological species delineation. When isolation evolves more rapidly than divergence in DNA sequences or morphology, for example with polyploid speciation (8) or when selection acts directly on isolating mechanisms (e.g., on gamete recognition proteins; (9)), species recognition becomes especially challenging. Finally, when isolation evolves slowly relative to the accumulation of morphological or genetic variation within species, intraspecific polymorphism will result. Variation among rates of morphological divergence, genetic divergence, and emergence of isolating mechanisms implies that a priori criteria for species recognition, such as predefined levels of morphological character state changes or genetic thresholds, are vulnerable to error, especially when applied across broad taxonomic groups (10).

Table 1 Consequences of variation in character evolution Character type Morphological

Genetic

Isolating

Consequence

Fast

Slow

Slow

Polymorphism

Fast

Fast

Slow

Polymorphism

Fast

Slow

Fast

Morphospecies, poorly substantiated by barcoding

Slow

Fast

Fast

Cryptic species, barcoding powerful

Slow

Fast

Slow

Genetic polymorphism

Slow

Slow

Fast

Species recognition challenging

50

N. Evans and G. Paulay

Because speciation in animals is predominantly allopatric (11), initial differentiation usually takes place in isolation. Sympatry is usually secondary and can follow only when isolation is sufficient to prevent fusion of lineages. As a result co-occurring taxa are typically well differentiated, making species recognition usually straightforward in sympatry. Reciprocal monophyly in two or more independent genetic or morphological characters implies reproductive isolation in a sympatric setting, thus such populations meet the criteria of the Biological Species Concept (12). Note that as all mitochondrial genes are on one locus, mitochondrial markers in themselves are insufficient to test for reproductive isolation, although a pattern of deep divergence between clades is suggestive of it. In an allopatric setting, reproductive isolation cannot be readily assessed, thus less stringent and subjective species concepts are usually applied. Reciprocal monophyly in two or more independent characters (genetic, morphological, or geographic) can be used to define Evolutionary Significant Units (ESUs) or phylogenetic species (cf. (13)). While threshold-based definitions of species are problematic, it is useful to compare the average levels of divergence between sympatric species and allopatric ESUs in any taxon. ESUs that are isolated by at least as deep sequence divergence as sympatric species in the same taxon make good species hypotheses. 1.1.2. Marker Choice

The ideal marker is easily amplified, exists in a single form (single copy, or multiple but identical copies) per cell or organism, and exhibits sufficient sequence variation to distinguish species. The “Folmer” region at the 5¢ end of COI (14) was proposed as an ideal DNA barcode for these reasons (1), serves well for the majority of animals, and is the most widely used gene region for barcoding. The rapidly growing numbers of COI barcode sequences across the Metazoa, partly as a consequence of large-scale efforts (CBOL, IBOL, BOLD, etc.), is leading to the availability of a large and growing library of barcodes from identified, vouchered specimens for comparison. For most taxa, the COI barcoding region remains an ideal choice, and is the focus of this chapter. Nevertheless, additional or alternative molecular markers may be considered when CO1 sequences are insufficient to distinguish recognized species, when they conflict with interpretations of morphological or isolating characters, or when amplification is challenging (see below). Because of slow rates of sequence evolution, the Folmer region of COI tends to be insufficient for species delineation across much of the Porifera, Cnidaria, Ctenophora, and Placozoa (15–21). Other single gene regions have been explored with varied success, but effective DNA taxonomy in these basal phyla is moving toward multiple gene region approaches (e.g., (17, 22)). Nevertheless, certain clades among non-bilaterians do exhibit rapid rates of mtDNA evolution and can be resolved at the species level using single mitochondrial gene regions (e.g., (23)).

4

1.2. Methodological Challenges

DNA Barcoding Methods for Invertebrates

51

Methodological challenges in barcoding are those of isolating, amplifying, and sequencing DNA in general. These are briefly outlined here and are dealt with below under their respective protocols. First, tissues need to be preserved so that DNA does not degrade. There is substantial variation among animals in how rapidly tissue and DNA breaks down and also how easily liquid fixatives penetrate; this influences specimen handling and preservation. Second, depending on the DNA extraction protocol used, various other metabolites may coextract with the DNA, and some of these may cause PCR inhibition. Inhibitors tend to be clade specific, are inconsequential for many large taxonomic groups, but are important in others. Inhibition can be addressed by changing extraction protocols to minimize coextraction of inhibitor, cleaning the DNA extract to remove the inhibitor, or diluting the extract to a level where the effect of the inhibitor is lost, but sufficient DNA remains for amplification. Third, primers may not amplify the marker or amplify unintended additional markers. The former results when the sequence at the annealing site has evolved too far from that of the primer used, and can be addressed by making PCR conditions less specific (using lower temperature for annealing, higher concentration of MgCl2, or degenerate primers), by designing better (i.e., taxon-specific) primers, or by changing to alternate markers. The latter can be addressed by more stringent PCR conditions in some cases, but becomes challenging when it is the result of gene duplication, either as nuclear copies of mitochondrial genes (NUMTs), separate male, and female mitochondrial lines, or heteroplasmy. Such multiple copies pose challenges as well as opportunities in some taxa and are addressed in more detail below.

2. Materials Researchers should have access to standard field and molecular laboratory supplies and equipment. To prevent contamination in either setting, researchers should have and use materials that enable sterile techniques. This includes latex or nitrile gloves, filter pipette tips, kimwipes, bunsen burners to flame reusable dissecting instruments, and diluted bleach. 2.1. Tissue Subsample Preservation

1. 95–100% ethyl alcohol (EtOH) or DMSO–EDTA–salt buffer: 20% DMSO, 0.25 M sodium-EDTA, and NaCl to saturation, pH 7.5. 2. 2.0 ml Screw Cap Microtubes or 96-well plates with caps.

2.2. Preparation of DNA

1. DNAzol® genomic DNA isolating reagent (Molecular Research Center, Inc).

52

N. Evans and G. Paulay

2. Proteinase K solution, 20 mg/ml: Combine and mix 100 mg Proteinase K with 5 ml sterile dH2O (or 2.5 ml dH2O and 2.5 ml glycerin). Store aliquots at −20°C. 3. Sterile polypropylene pellet pestles. 4. 100% and 75% EtOH (preferably ice cold). 5. 1.7 ml polypropylene microcentrifuge tubes. 6. TE buffer: 10 mM Tris–HCl pH 8.0, 1 mM EDTA (ethylenediaminetetraacetic acid). 2.3. DNA Quantification (Optional; See Subheading 3.4)

1. Spectrophotometer (e.g., NanoDrop™ variety by Thermo Fisher Scientific Inc.).

2.4. PCR Amplifications

1. PCR tubes or 96-well PCR plates.

2. Mass DNA Ladder.

2. Primers (10 μM): see tables 6 and 7 for a list universal and taxon-specific primers for the “Folmer” region of CO1, and a few alternative markers. 3. Deoxynucleotide (dNTP) solution mix at a concentration of 10 mM for each of the four nucleotides. 4. A variety of Taq polymerase enzymes and PCR reagents are available separately or in kits and most will be equally suitable (reagents are listed in Table 3). Be aware that concentrations and properties may vary between different varieties of PCR reagents. Also, variation can exist in the efficacy of different Taq polymerases. Two that consistently work well are Taq DNA Polymerase (New England Biolabs), and Platinum Taq DNA Polymerase (Invitrogen). The latter is a heat activated, “hot-start” polymerase and, though more expensive, is more tolerant of reaction assembly at room temperature. 5. 15% Trehalose, (~0.4 M): Dissolve 7.5 g trehalose dihydrate in 50-ml sterile dH2O. Heating may be needed. Store aliquots at −20°C. Final concentrations in a PCR cocktail should be at approximately 0.2 M (24). 6. Bovine serum albumin (BSA) 2.5 μg/μl solution: Combine and mix 1 ml Ultrapure BSA (at 50 mg⁄ml; Invitrogen) with 20 ml sterile dH2O. Store aliquots at −20°C. Use at 0.2–0.4 μg/μl final concentration in a PCR cocktail (25).

3. Methods 3.1. Field Methods

1. Here, we consider mostly specimen processing and data tracking; more detailed treatment of biodiversity survey methods are provided by Templado et al. and Eymann et al. (26, 27).

4

DNA Barcoding Methods for Invertebrates

53

The Consortium for the Barcode of Life (CBOL) have defined Barcode Data Standards that include required as well as recommended data fields for barcode records. Required categories include a unique identifier (usually a collection catalog number) for the voucher specimen in a biorepository, an identification, and a country code. Recommended categories include latitude, longitude, collector, and collection date. As deposition of a voucher in a biorepository is required, additional data should also be collected for each specimen, to meet basic data standards of collection databases. These include depth/elevation, a hierarchy of location fields (e.g., state, county, specific locality), habitat/microhabitat, host association if any, notes, fixative, and preservative. Numerous other data fields are also used by various collections or for specific taxa. Finally, it is important to note the existence and unique identifiers of photo, tissue, or extraction samples taken from the specimen. 2. Recording these data and specimen tracking are best accomplished in a series of data tables in a Field Information Management System (FIMS). FIMS can be set up in spreadsheet or relational database formats; specifically designed FIMS databases have been created for a number of field biodiversity/ barcoding projects (e.g., CReefs, Moorea Biocode). Spatiotemporal, habitat, and collector data are typically kept in the station table, and referred to in the specimen table through a unique station number. Specific notes about the specimen, such as microhabitat (unless this is parsed into the station table), host, fixation procedures, and reference to photos and tissue/ extractions subsamples, are kept in a specimen table. Tables on photos and subsamples complete a basic FIMS (Table 2).

Table 2 Main fields/field types needed in a FIMS Station table

Specimen table

Photo table

Subsample table

Station #

Field #

Photo #

Subsample #

Locality

Identification

Field #

Field #

Habitat

Station #

Photographer

Tissue type

Elevation/depth

Microhabitat

Date

Plate/well #

Coordinates

Fixative

Notes

Date

Photo taken?

Collectors

Tissue taken?

Notes

Notes

54

N. Evans and G. Paulay

Note that each table has a unique identifier (for station, specimen, photo, and subsample) for sample tracking; additional unique identifiers (e.g., collection catalog number assigned to voucher specimen) may be added and linked to these. 3. Samples collected from a station are handled in the field appropriately for the method and taxa involved, so that specimens remain in good condition for preparation of vouchers, tissue subsamples, and photographs. Specimens may be fixed immediately in the field after collection, or transported live to the field lab for further processing. Bulk field fixation is more time-expedient, but prevents immediate taking and differential preservation of tissue subsamples, photographing live/fresh animals, or tracking specimen-level information. Live transport to the field lab allows specimens to be handled individually for photography, subsampling, and specific fixation protocols, but is more time consuming. Combining these approaches can be useful. Specimens that die and deteriorate rapidly (e.g., sponges) can be photographed and subsampled in the field, while specimens where lab photography and fixation is especially useful (e.g., opisthobranch mollusks, flatworms) can be transported live to the field lab for further processing. In contrast, bulk samples too large to process in the field and field lab (e.g., plankton sample) can be fixed immediately after collection. Some bulk collection methods include collection of the substratum (e.g., leaf litter, marine sediments, reef rock) from which specimens are extracted in the field lab. The use of chemicals that damage DNA (e.g., formalin) for extracting specimens should be avoided. 4. Live samples taken to the field lab should be sorted to morphospecies by people sufficiently knowledgeable about the taxon to make this relatively accurate. Three to five specimens are useful to aim for in taxa where morphospecies accurately reflect genetic species, while more specimens are useful when cryptic complexes are expected. It is also informative to take samples from across the geographic range of species when possible, as geographically differentiated cryptic complexes are common. For animals that are too small to provide useable morphological and genetic samples from the same individual, two sets of specimens can be prepared for morphological vouchers and DNA sequencing. Detection of only one species in each set by subsequent analysis lends confidence that specimens pertain to the same species. Retention of the extracted specimen as specimen voucher is possible in microfauna where identification characters are cuticle-based (e.g., most arthropods), by gently digesting soft tissues for DNA (e.g., with proteinase K), and preserving the remaining cuticle “shell” for voucher. 5. Voucher preparation usually involves relaxing, killing, fixing, and preserving the specimen based on taxon-specific protocols (26).

4

DNA Barcoding Methods for Invertebrates

55

Some taxa require fixation for morphological study in fixatives (e.g., formalin, glutaraldehyde) that are incompatible with DNA preservation; taking subsamples for genetic analysis prior to fixation is essential for these. Subsampling of ethanol-fixed specimens can be delayed until return to the home lab for field-expediency. 6. Photodocumentation can provide online access to the voucher and captures information often lost in fixed specimens. Photos should capture characters that allow identification. They can be of the whole organism, close-ups of diagnostic features, or various preparations. Thus, photo efforts should be guided by someone knowledgeable about the taxon. For some organisms (e.g., many sessile invertebrates), in situ photos can be especially informative, as even collection will disrupt their appearance. Photographs of fresh and relaxed specimens record living color and morphological features that may be altered by preservation, and can facilitate taxonomy as much as genetic data. In some taxa (e.g., in decapod crustaceans, opisthobranchs), the most closely related species differ mainly in color pattern rather than in structural morphology. In contrast, images of preserved or prepared specimens are as or more useful than images of live specimens for other taxa (e.g., SEMs for bryozoans, sections for helminths). 3.2. Subsample Preservation

1. Subsamples should be taken from body parts that are not important for morphological identification, have a low probability of contamination (from symbionts, environment, food, etc.), and are rich in mitochondria (to minimize potential NUMT coamplification). 1–3 mm3 of tissue provides an ideal subsample, but much smaller amounts are also sufficient. Subsampling should be done on a clean surface, and the tools used flamed to prevent contamination. 2. Subsamples can be preserved by freezing, drying, EtOH, DMSO–EDTA–salt buffer, proprietary buffers, or placed directly into extraction buffers (see Chapter 14). Use at least 5× as much preservative as the volume of the tissue sample. Placing subsamples directly into extraction buffer followed by DNA extraction is efficient and provides high quality genomic DNA, but uses up the subsample preventing future alternative extraction procedures. Ethanol provides an ideal preservative that often doubles as a preservative for the morphological voucher. Ethanol concentrations between 70 and 100% all work well, but for voucher fixation only 70–75% should be used as higher concentrations can make specimens dehydrated, brittle, and can lead to preservation artifacts. DMSO–EDTA–salt buffer is easier to transport and can leave higher quality genomic DNA in some taxa (28), but as tissue disintegrates in this solution, it is inappropriate for voucher

56

N. Evans and G. Paulay

preservation. Keeping preserved subsamples in a fridge or freezer until extraction slows DNA degradation. 3. Subsamples are ideally placed in 96-well plate format (or in tubes arranged in 96-well format, such as Matrix Storage Tubes (Thermo Scientific)), and worked in that format through to sequencing. When subsamples are collected in small numbers, or for replicate samples of especially important specimens, small tissue vials can be used. As possibilities of sample mixup or cross-well contamination are substantial for 96-well plates, extra care needs to be taken to prevent this. Staggering samples of the same species among non-neighboring cells facilitates detection of contamination. 3.3. Preparation of DNA

There are a diversity of DNA preparation methods and most are suitable for any metazoan (see Note 1). For larger scale projects, we suggest the high-throughput, silica-based DNA extraction protocol described by Ivanova et al. (29) (see Note 2). Here, we present a manual extraction protocol using DNAzol®. This simple and reliable method works well with most metazoans, produces highquality extractions, and is suitable for even decade old specimens. The manufacturer’s protocol can be carried out in minimal lab settings at room temperature in less than 30 min. However, we suggest the following modified approach to improve DNA yields and quality. 1. Place ~1–2 mm piece of tissue on parafilm. Remove extra storage buffer or EtOH by evaporation or blotting with a Kimwipe. Mincing tissue with a sterile blade can improve digestion and increase yield. Shaving off outer tissue layer can reduce inclusion of contaminants. 2. Transfer tissue to 1.7-ml polypropylene microcentrifuge tube. 3. Add 750 μl DNAzol® and 5 μl proteinase K (20 mg/ml). For challenging tissues, let it stand for ~10 min then add an additional 5–10 μl proteinase K. 4. Grind tissue with sterile pellet pestle (see Note 3). Alternatively, for particularly soft-bodied taxa (e.g., medusozoans) a simple vortex is sufficient. 5. Allow tissue to digest for ~24 h at room temperature, preferably on a rocking shaker. For fresh, high-quality tissue a 1-h digest may be sufficient, while for poor quality samples digestion can be extended for >24 h. 6. Centrifuge sample at ~12,000 × g for 15 min. 7. Carefully pipette supernatant into new 1.7 ml tube avoiding disturbance or transfer of pellet at bottom of original tube. Leaving some (5 min. Orient tubes in same direction (e.g., all cap hinges facing out), to facilitate locating and thus avoiding DNA pellet (angled at bottom of tube) during subsequent pipetting. 12. Carefully pipette or decant nearly all the supernatant without disturbing DNA pellet. If pellet is disturbed repeat from step 9. 13. Rinse sample twice, adding ~1 ml of 75% EtOH, centrifuging at 12,000 × g, and decanting (or pipetting) as above. 14. Carefully remove any remaining EtOH from tube, by pipetting or evaporation, without disturbing DNA pellet. Residual EtOH can be evaporated by leaving tube open but protected from contamination (e.g., covered with a Kimwipe), for up to a few hours. 15. Add 30–50 μl TE buffer or sterile dH2O, then mix by flicking tube or gently vortexing. Leave at room temperature for several hours or overnight to dissolve DNA pellet. If the extraction is noticeably viscous the concentration is likely high and additional TE buffer or sterile dH2O should be added. Alternatively, if yields are consistently low consider adding less TE/dH2O in future extractions. 16. Resulting genomic DNA extraction is ready for use in PCR. However, quantification and dilution of extraction should be considered. 3.4. DNA Quantification

The ideal DNA concentration for PCR falls broadly around 50–100 ng/μl and extractions tend to approximate this range. DNA quantification is optional, but we recommend it when PCR success is variable (see Note 4). In these cases genomic DNA extractions can be reliably quantified with a spectrophotometer. However, cruder estimates can be successfully made by electrophoresis of a 1-μl aliquot of DNA on a 0.8% agarose gel in parallel with a properly mass-quantified DNA ladder.

3.5. DNA Storage

Genomic DNA will degrade unless it is kept cold. Freezing is preferable but repeated freeze–thaw cycles also result in degradation of DNA (see Note 5). For repeated use over short term (days) store DNA at 4°C, but for longer term (weeks to a few years) keep DNA extracts at least at −20°C. Only temperatures at or below −80°C effectively halt degradation and should be used for long-term storage.

58

N. Evans and G. Paulay

3.6. PCR Amplifications

1. Calculate reagent volumes needed for PCR cocktail based on volumes in Table 3 multiplied by the number of reactions. A negative control reaction (i.e., no DNA template) should always be included; a positive controls (of easily/previously amplified DNA) is also useful. Due to pipetting errors it can be useful to add extra reaction volumes (we suggest approximately one additional reaction volume per 24 samples). 2. When including PCR enhancers replace corresponding volume of ddH2O with enhancer solution (Table 4). Inclusion of PCR enhancers provides a powerful and cheap method to improve PCR success, especially for poor quality DNA or those that contain inhibitory compounds (see Note 6).

Table 3 Standard PCR cocktail for one 25 ml reaction Reagents

25 ml rxn

ddH2Oa

19.5 μl 2.5 μl

10× buffer 50 mM MgCl

1.25 μl

10 μM forward primer

0.25 μl

10 μM reverse primer

0.25 μl

10 mM dNTPs

0.125 μl

Taq polymerase (5 units/μl)

0.125 μl

b 2

PCR cocktail total (1 rxn) DNA template

24 μl 1 μl

See Tables 6 and 7 for appropriate primer pairs May adjust ddH2O volume to accommodate additional reagents, DNA template, or PCR enhancing additives b Yields 2.5 mM MgCl2; a range between 0.5 and 3 mM is recommended, with higher concentrations making primer annealing less stringent a

Table 4 Optional PCR enhancing additives. To include, appropriately decrease ddH2O volume in 25 ml PCR cocktail PCR enhancers 15% Trehalose stock

12.6 μl

BSA 2.5 μg/μl stock

2–4 μl

4

DNA Barcoding Methods for Invertebrates

59

3. Thaw, on ice, DNA samples and all necessary PCR reagents, except for Taq polymerase. Taq is especially sensitive to thermal degradation, is usually suspended in glycerin and thus remains liquid at −20°C, thus can be pipetted directly from “frozen” tubes. 4. Combine PCR reaction cocktail on ice (see Notes 7 and 8). 5. Vortex and spin cocktail. 6. Pipette 24 μl of PCR cocktail into each PCR reaction tube. 7. Pipette 1 μl of appropriate DNA template into each reaction tube, except negative control. 8. Firmly seal reaction tubes and place in thermal cycler machine. 9. Start appropriate PCR thermal cycling profile program. 3.7. PCR Thermal Cycling Profiles

PCR profiles vary greatly and researchers should make an initial effort to test and optimize reactions for their particular thermal cycler machine, targeted marker, primer pair, and taxa of interest. Suggested thermal cycling profiles appear in Table 5 (see Note 9 for detailed explanations). These profiles, adjusted to the suggested

Table 5 Three thermal cycling approaches Approach

Utility

Thermal cycling profilea

Standard

Well suited for taxonspecific primers

94°C for 5 min, 35 cycles (94°C for 30 s, Ta [given in Table 7] for 45 s, 72°C for 1 min), 72°C for 5 min, hold at 4°C

Step-up

Decreases annealing specificity, appropriate for universal primers. Risk: co-amplification of contaminants or nontarget sequence

94°C for 5 min, 5 cycles (94°C for 30 s, ~5°C below Ta [given in Table 7] for 45 s, 72°C for 1 min), 30 cycles (94°C for 30 s, Ta [given in Table 7] for 45 s, 72°C for 1 min), 72°C for 5 min, hold at 4°C

Stepdown

Increases annealing specificity, eliminating co-amplified products. Risk: no amplification of target sequence

94°C for 5 min, 5 cycles (94°C for 30 s, ~5°C above Ta [given in Table 7] for 45 s, 72°C for 1 min), 30 cycles (94°C for 30 s, Ta [given in Table 7] for 45 s, 72°C for 1 min), 72°C for 5 min, hold at 4°C

See Note 9 for description Table 7 for appropriate annealing temperatures

a

60

N. Evans and G. Paulay

annealing temperatures for the primers used (Table 7), should provide a starting point for amplification of sequences less than approximately 800 bps in length. Longer sequences will require longer extension times and may require multiple amplifications and primer pairs. Usa a “heated lid” option for the thermal profile program whenever available, to prevent PCR reactions from condensing on the tube lids. 3.8. PCR Product Confirmation

Successful PCR amplification should be confirmed by electrophoresis of 2.5–5 μl of the amplicon and a molecular ladder in neighboring wells in a 1% agarose gel, made (or stained) with a dilute (~0.5 μg/ml) ethidium bromide (EtBr) solution and visualized under UV light (see Notes 10 and 11). Successful amplicons will appear as single, distinct bands. Faint or additional, spurious bands suggest that PCR or primer optimization is needed. Those conducting high-through put efforts may consider commercially available 96-well precast gel systems (e.g., Invitrogen E-gel 96 system).

3.9. Sequencing

Given the infrastructure, resources, and expertise needed to do sequencing, this step is usually contracted out to commercial or university core facilities. It is prudent to “shop around” for sequencing services (even internationally), comparing everything from volumes of PCR reactions requested, to average completion times. In our experience, researchers who can guarantee a high volume of samples may be able to negotiate a better price. We also recommend working with those facilities capable of affordably handling raw PCR products. However, in some cases it may still be more cost effective to “clean” or purify PCR products before submitting them. A variety of methods can be found in this volume and elsewhere (see Note 12).

3.10. Sequence Data Processing and Verification

1. Construct bidirectional contigs: Bidirectional sequence data (i.e., from forward and reverse sequencing primer reactions) should be assembled into a single contig to provide a reliable consensus sequence (see Note 13). This can be accomplished with various software including Geneious, Sequencher, UGENE, and MEGA (see Note 14). 2. Check sequence identity: Sequence identity should be checked to reduce the likelihood of proceeding with contaminant sequences, misidentified samples, or pseudogenes (discussed below). This can be carried out by conducting a BLAST query (http://blast.ncbi.nlm.nih.gov/Blast.cgi) or a “Taxon ID Tree” in BOLD (http://www.boldsystems.org/) (see Note 15). The quality of these queries is directly related to the library of available sequences. For understudied clades (especially highly divergent ones), these queries perform less favorably. Be cautious if results suggest unexpected taxon affinities.

4

DNA Barcoding Methods for Invertebrates

61

3. Create alignment: Before comparative analyses can proceed, sequences (of the same gene region) should be assembled into a multiple sequence alignment. There are a number of free programs capable of performing this (e.g., MAFFT, MUSCLE, ClustalW) (see Note 16). 4. Detection of contaminant and pseudogene sequences: Poor quality, contaminant or nontarget sequences can also be identified when they fail to properly align to other sequences, or if they possess unique indels or atypical sequence regions. For protein coding genes (e.g., CO1), nucleotide sequences should be translated into amino acid data, to confirm an open reading frame (i.e., no stop codons). Stop codons are indicative (but their absence is not a guarentee) that a nonfunctional, pseudogene region was amplified (a serious concern). For mtDNA genes, be careful to choose the correct mitochondrial genetic code for inferring amino acid sequences, as this varies among some animal phyla (see Note 17). Indels and introns, while rare, do exist within metazoan mitochondrial protein coding genes (21, 30–33). If translational frameshifts result, such sequences could be incorrectly identified as pseudogenes (30). 3.11. Alternative or Additional Markers

Additional or alternative molecular markers should be considered when CO1 data is insufficient to distinguish recognized species, when it conflicts with interpretations of morphological or isolating characters, or when CO1 amplification remains challenging (but see Notes 18–20). A brief overview of the most commonly used alternative species level markers can be found below (see Note 21 for mitochondrial, and Note 22 for nuclear markers). A limited selection of universal primers and additional references are provided in Tables 6 and 7. Successful amplification of many of these markers will require additional research and troubleshooting (see Notes 19 and 20).

4. Notes 1. Preparation of DNA from tissue subsample can be accomplished by either extraction protocols or cruder DNA “release” methods (34). DNA “release” approaches digest the tissue such that DNA is quickly brought into solution, but they stop short of isolating it from the lysate. These methods are both fast and inexpensive but do not produce DNA samples suitable for archiving and can contain PCR inhibiting compounds. However, if organisms or tissue samples are exceedingly small (e.g., meiofauna) and extraction protocols fail to produce suitable DNA yields, a DNA release approach ensures that no

62

N. Evans and G. Paulay

Table 6 Phylum specific strategies, caveats and references for Barcoding invertebrate fauna Phylum

Alternative CO1 “Folmer” region primers

Caveats

Notable references

Acanthocephala

1,2

(61–63)

Acoelomorpha

1

(64, 65)

Annelida

2,3

(42, 66–71)

Arthropoda: (excl. Hexapoda)

Chelicerate-F1, Chelicerate-R1, Chelicerate-R2, HCOoutout, CrustDF1, CrustDR1, CrustF1, CrustF2

2,3

(5, 40, 60, 72–83)

Brachiopoda

Cohen-Fwd, Cohen-Rev

1,2

(84, 85)

1,2

(86)

Bryozoa Cephalochordata

AmphL109, AmphH1325

(87)

Chaetognatha

(88)

Cnidaria: Anthozoa

MCOIF, MCOIR

2,3,4,5

(16, 17, 31, 32, 89–93)

Cnidaria: Medusozoa

LCOjf

2,4

(94–97)

Cnidaria: Myxozoa

1,2,4

(98, 99)

Ctenophora

1,2,4,5

(18, 100, 101)

Cycliophora

CycF, CycR

Echinodermata

COIceF, COIceR

(102) 2,4

(41, 103)

Entoprocta

1

(104)

Gastrotricha

1

(105)

1

(106)

Hemichordata

1,2,4

(107, 108)

Kinorhyncha

1

Loricifera

1

Micrognathozoa

1

(109)

2,3

(10, 13, 110–119)

Nematoda

2,3,4,5

(48, 55, 56, 82, 120, 121)

Nematomorpha

1

Gnathostomulida

Mollusca

Nemertea

COI-7, COI-D

dgLCO, dgHCO

HCOoutout

1,2

(122–124)

Onychophora

2

(125–127)

Orthonectida

1

(continued)

4

DNA Barcoding Methods for Invertebrates

63

Table 6 (continued) Phylum

Alternative CO1 “Folmer” region primers

Caveats

Notable references

Phoronida

1

(128)

Placozoa

2,3,4

(20, 21, 129)

Platyhelminthes

2,4,5

(65, 130–132)

2,3,4,5

(30, 47, 59, 133, 134)

Porifera

dgLCO, dgHCO

Priapulida

1

Dicyemida

1,3

(135)

Rotifera

2,3

(136–138)

Sipuncula

1,2

(139, 140)

Tardigrada

HCOoutout

1,2

(141–144)

Urochordata

Tun_fwd, Tun_rev2

1,2,4

(145, 146)

1,2

(119, 147)

Xenoturbellida

Caveats: 1. Limited or no DNA barcoding completed for this clade, 2. Additional/alternative markers, primer sets or pcr strategies reported (see references), 3. Peculiar genetics or biology warrant caution (see references), 4. Genetic code may deviate from standard invertebrate mtDNA code (see Note 17 or references), 5. “Folmer” CO1 region may be insufficient for DNA barcoding (see references)

DNA is discarded (a problem of varying degrees for most DNA extraction protocols). We recommend the DNA release protocol described by deWaard et al. (35) which utilizes Chelex® 100 (Bio-RAD). DNA extraction methods are more appropriate for isolating DNA from older tissues, from organisms known to possess PCR inhibiting compounds, and when high-quality, stable DNA is desired for molecular work or archiving purposes. Organic DNA extractions remain the cheapest and often most effective extractions methods but require the use of toxic materials (e.g. phenol and chloroform) and unless automated, are labor intensive. Where both tissue and funds are sufficient, commercially available silicabased DNA extraction kits provide both supplies and simple protocols that yield high-quality extractions. QIAGEN DNeasy® and Clonetech Nucleospin® tissue kits are among the most recommended, and are available in individual and 96-well plate formats. 2. This affordable high-throughput protocol has been adopted by the Canadian Centre for DNA Barcoding and can also be found on their website (http://www.ccdb.ca/pa/ge/research/ protocols). 3. To clean polypropylene pestels before reuse, soak them in bleach for >30 min, rinse with water, wrap in foil, and autoclave.

5¢–3¢ Forward primer sequence

CrustF1/HCO2198

LCO1490/ HCOoutout Chelicerate-F1/ Chelicerate-R1 Chelicerate-F1/ Chelicerate-R2 CrustDF1/ CrustDR1

COI-7/COI-D

CycF/CycR Cohen-Fwd/ Cohen-Rev Tun_fwd/Tun_rev2 AmphL109/ AmphH1325 COIceF/COIceR

LCO1490/ HC02198 dgLCO/dgHCO

TAAACYTCAGGRTGACCRAARAAYCA

GGTCWACAAAYCATAAAGAYATTGG

See above

GGATGGCCAAAAAATCAAAATAAATG

See above

TTTTCTACAAATCATAAAGACATTGG

CCTCCTCCTGAAGGGTCAAAAAATGA

TACTCTACTAATCATAAAGACATTGG

TCGTGTGTCTACGTCCATTCCTACTG TRAACATRTG TCTGGGTGTCCRAARAAYCARAA

ACTGCCCACGCCCTAGTAATGATATT TTTTATGGTNATGCC ACNAAYAARCAYGAYATYGGNAC GTAAATATATGRTGDGCTC

AACTTGTATTTAAATTACGATC TCNGAATAYCGNCGWGGTATNCC

TCGACTAATCATAAAGATATTAG ATTCGNGCNGAAYTNTCNCAGCC

See above

TTAAAATTACGRTCTGTYAAAAG TACCCYCGNCAAAAAC

TAAACTTCAGGGTGACCAAARAAYCA

GGTCAACAAATCATAAAGAYATYGG

CGRATGGARCTYTCTCAYCC ATTYTBCCNGGRTTTGG

TAAACTTCAGGGTGACCAAAAAATCA

5¢–3¢ Reverse primer sequence

GGTCAACAAATCATAAAGATATTGG

CO1 (5¢ “Folmer” region)

Primer name— forward/reverse

45 and 51°C

42 and 50°C 45 and 50°C 45 and 50°C 45 and 51°C

40°C

51°C

50°C 45–60°C

45°C NA

40–44°C

40–55°C

Ta

Table 7 Universal and phylum-specific primer pairs, sequences, and annealing temperatures

Crustacea

Crustacea

Arachnida

Gnathostomulida, Annelida Tardigrada, Chelicerata, Myriapoda Arachnida

Echinodermata

Urochordata Cephalochordata

Degenerate Universal Metazoa Cycliophora Brachiopoda

Universal Metazoa

Taxon utility

650 bp

650 bp

660 bp

660 bp

742 bp

657 bp

655 bp

586 bp 1 kb

487 bp 663 bp

650 bps

650 bps

Amplicon size

(73)

(80)

(79)

(77, 144, 150, 151) (79)

(149)

(41)

(145) (87)

(102) (84)

(148)

(14)

References

GGTTCTTCTCCACCAACC ACAARGAYATHGG TCTACAAATCATAAAGACATAGG GGTCAACAAATCATAAAGATATTGGAAC

5¢–3¢ Forward primer sequence

CGCCTGTTTATCAAAAACAT

AACCTGGTTGATCCTGCCAGT

ITS5/ITS4

TCCTCCGCTTATTGATATGC

ACGATCGATTTGCACGTCAG

TACTAGAAGGTTCGATTAGTC

TGATCCTTCCGCAGGTTCACCT

CCGGTCTGAACTCAGATCACGT

GAGAAATTATACCAAAACCAGG See above

See above

5¢–3¢ Reverse primer sequence

NA

52.5°C

52.5°C

47°C

45-55°C

55°C 45–51°C

42°C

Ta

Universal

Universal Metazoa

Universal Metazoa

Universal Metazoa

Universal Metazoa

Scleractinia Medusozoa

Crustacea

Taxon utility

>600 bps

0.8–1.3 kb

0.8–1.3 kb

1.8 kba

500 bp

650 bp 650 bp

650 bp

Amplicon size

Ta = reported annealing temperatures. Two temperatures indicate a Step-up or Step-down approach, a range indicates that reported Ta commonly varies a 18S can and often should be amplified and sequenced using multiple internal primers (see references)

GGAAGTAAAAGTCGTAACAAGG

See above

LSU D1D2 fw1/ LSU D1D2 rev2

ITS1, 5.8S, ITS2

AGCGGAGGAAAAGAAACTA

LSU D1D2 fw1/ LSU D1D2 rev1

28S rDNA (nuclear LSU) D1-D2 region

18 S-A/18 S-B

18 S rDNA (nuclear SSU)

16sar-L/16sbr-H

16 S rDNA (mitochondrial LSU)

MCOIF/MCOIR LCOjf/HCO2198

CrustF2/HCO2198

Primer name— forward/reverse

(158)

(57)

(57)

(156, 157)

(155)

(32, 152) (153, 154)

(73)

References

66

N. Evans and G. Paulay

4. Quantification of genomic DNA extractions is optional, because both low and high concentrations of quality DNA can provide good PCR amplifications. However, it is not uncommon to recover extractions as high as 500–1,000 ng/μl, and such high concentrations may inhibit PCR reactions and should be avoided. Highly viscous extracts are indicative of high DNA concentration. Creating and testing serial dilutions (e.g., 1:10, 1:100, and 1:1,000 dilutions of DNA in dH2O) can overcome PCR inhibition from high DNA concentration as well as from inhibiting contaminants present in an extraction. 5. Frost-free freezers often function by cycling temperatures and thus can promote sample degradation. Their use should generally be avoided for tissue, DNA, and reagent storage. 6. For problematic amplifications, including those from samples suspected to contain inhibitory compounds, there are a diversity of PCR enhancing additives that can be used (24, 25, 36, 37). Betaine, BSA, dimethyl sulphoxide (DMSO), and trehalose are among the most successfully used. Most enhancers function by stabilizing contaminants, enzymes, and PCR products in the reaction and many lower the melting temperature of GC-rich templates. We recommend both trehalose and BSA (Table 4). Given their consistent positive effects and minimal cost, we suggest always including one of these additives in a PCR reaction. Although we found no record of these reagents being used together, we suspect they would not interfere with one another. If PCR enhancers and dilutions (see Note 4) fail to provide reliable amplification it may be appropriate to try removing PCR inhibitors from DNA samples with commercially available DNA purification kits, or by trying alternate DNA extraction methods. Unfortunately, purification kits can negatively affect final DNA concentrations or quality (through size exclusion). 7. Prepared PCR cocktails are thermally unstable and should always be mixed just prior to use. However, addition of trehalose to the cocktail mix will enable storage and use of the cocktail mix for a few months, although repeated freeze–thaw cycles should be avoided. 8. Double (50 μl) or half (12.5 μl) this reaction volume are also commonly used. These all typically still include ~1 μl of DNA template. 9. A “Standard” PCR profile is widely applicable, especially for taxon-specific primers. However, degenerate or universal primers (e.g., the “Folmer” CO1 primers) may be more successful with a “Step-up” PCR approach in which lower annealing temperatures in the initial cycles facilitate less specific but more

4

DNA Barcoding Methods for Invertebrates

67

successful primer annealing. This creates a greater pool of target sequences for subsequent amplification cycles which run at higher, more optimal annealing temperatures. The downside to this approach is that lower initial annealing temperatures also facilitate amplification of nontarget regions or contaminant DNA. Some of these concerns can be avoided by not setting annealing temperatures lower than ~45°C. When amplification of nontarget sequences are of specific concern (e.g., when symbionts are known to be present) higher annealing temperatures can help and even a “Step-down” approach may be employed. “Step-down” profiles, like the related “Touchdown” approach (38) can avoid spurious PCR amplicons by beginning at higher, suboptimal annealing temperatures to increase primer specificity and the pool of target sequences before cycling at lower temperatures in which co-amplification occurs. Both “Step-Up” and “Step-down” approaches can be powerful tools for troubleshooting. 10. TAE (Tris–Acetate–EDTA) or TBE (Tris–Borate–EDTA) buffers work equally well for gel preparation and as electrophoresis buffers. If products are to be cut out of gel for further molecular work or preparations, TAE is recommended. 11. Given that EtBr is a known mutagen, use of gloves is necessary (nitrile forms provide better protection than latex). Review MSDS information and handle with care. 12. For ease of use, we suggest ExoSAP-IT (Affymetrix) for cleaning PCR products when this is not handled by the sequencing facility. The manufactures protocol works well, but using a 1:10 dilution of ExoSAP-IT is more affordable and still effective when the 37°C incubation period is extended by 15–30 min. Be aware that this method can make subsequent quantification of PCR products with a spectrophotometer unreliable. 13. If assembly of quality sequences into a bidirectional contig is not possible or several polymorphic sites are present, check the identity of each for contaminants (see appropriate section). Tracking down the source of this error is important and may involve reevaluating original specimens, DNA extractions, PCR reactions, or sequencing efforts. Additional troubleshooting may be required including cloning, sequencing, PCR amplification, extractions, or application of alternative primers. 14. Geneious Pro has a number of tools specifically developed to automate or speed up sequence editing and checking, and provides helpful video tutorials (http://www.geneious.com/). 15. Be aware that GenBank (the database queried in a BLAST search) is known to harbor misidentified species, pseudogene, and contaminant sequences, so caution should be taken when interpreting BLAST queries (39).

68

N. Evans and G. Paulay

16. For protein coding genes with few or no indels (e.g., CO1), we suggest employing simpler, faster algorithms. For more complex markers such as ribosomal DNA (rDNA) or those with large introns, alignment methods employing multiple iterative refinement steps are necessary. 17. NCBI (http://www.ncbi.nlm.nih.gov/) maintains an updated list of known alternative genetic codes under TAXONOMY > TOOLS > Genetic Codes. NCBI’s Translation Table 5 describes the standard invertebrate mitochondrial code. Relevant taxa with known deviations from this are: (a) Cnidaria, Ctenophora, and Placozoa (NCBI’s Translation Table 4). (b) Porifera (NCBI’s Translation Table 4 and 5). (c) Echinodermata and Hemichordata (NCBI’s Translation Table 9). (d) Urochordata (NCBI’s Translation Table 13). (e) Nematoda (NCBI’s Translation Tables 5 and 14). (f ) Platyhelminthes (NCBI’s Translation Tables 5, 9, 14 and 21). Understanding alternative codes for stop codons are the greatest concern when inferring the presence of pseudogenes from amino acid alignments. 18. When amplification is attempted in new or diverse sets of taxa (where primer efficacy has not been evaluated), primer fidelity can be a challenging problem. In these cases, taxon-specific primers can be developed for groups where universal primers do not work well (e.g., see refs. 40, 41). 19. Taxon-specific primers can be designed from alignments of sequences (partial or complete) from closely related taxa (try searching BOLD or GenBank). Focus on developing primers from highly conserved regions on either end of the desired gene. See Hoareau and Boissin (41) for a thorough example of this approach. 20. When developing new primers it is wise to adhere to the following guidelines: (a) Primers should be between 18 and 30 nucleotides, with a 40–60% GC content. (b) Melting temperatures of a primer pair should be between 52°C and 65°C and within 5°C of each other. (c) Avoid nucleotide repeats and base pair complementarities between and within primers. These can promote the formation of hairpin loops, self-annealing and primer dimers, and will negatively affect target sequence amplification. (d) Try including a G and/or C at or near the 3¢ end to provide a “GC clamp.”

4

DNA Barcoding Methods for Invertebrates

69

(e) For protein coding genes, avoid ending primer on a third base position of a codon. (f ) Primers should not exceed a total tenfold degeneracy, (e.g., twofold = two nucleotides substituted in one position). 21. Mitochondrial markers: Most metazoans share a highly conserved repertoire of approximately 37 mitochondrial genes (13 protein subunits, 2 rDNAs, 22 tRNAs) (42). In addition to this they often carry one or more non-coding “control” regions or intergenic spacers (IGS ). Though sequences of both tRNAs and noncoding regions have been successfully used for species level work (especially the latter), concerns with duplication or lack of true homology makes these markers problematic (5, 43–45). Furthermore, gene order rearrangement in mtDNA is common across Metazoa, even within closely related clades. As a result some caution should be taken when attempting to amplify through multiple gene regions (42, 46). I3-M11 partition of CO1: This region of CO1 (approximately 450 bp sequence) is just downstream of the Folmer region and has been demonstrated to be a more variable, thus successful species level marker for several clades including Porifera, Anthozoa, and Nematoda (47, 48). Ribosomal DNA (rDNA or rRNA): mtDNA includes sequences for a large ribosomal RNA subunit (termed LSU or 16S rDNA) and a small ribosomal RNA subunit (termed SSU or 12S rDNA). These sequences possess highly conserved, easily alignable domains interspersed with fairly divergent, difficult to align regions. Though care must be taken with alignment, these genes are often highly informative for both higher level and species level phylogenetic analyses. Many taxonomic communities prefer the dual utility of these genes, particularly the 5¢ end of 16S. Additional mtDNA protein coding genes: Twelve other protein coding genes are typically encoded in metazoan mtDNA (NADH 1-6, NADH 4L, CO 2, CO 3, Cyt b, ATP6, ATP8 ). These genes have each been used at varying success across Metazoa but a thorough review of their clade-by-clade specieslevel utility is outside the scope of this chapter. However, we refer readers to more universal protocols presented in the supplemental material of reference (49). 22. Nuclear markers: Development of nuclear, species-level markers for Metazoa are limited by a number of intrinsic challenges. Nuclear protein coding genes typically evolve at a much slower rate than mitochondrial ones, and can exist as members of complex gene families that make homology difficult to infer. Even closely related taxa can display unique gene duplication and extinction patterns. Thus, single copy nuclear genes are

70

N. Evans and G. Paulay

often not conserved as such across larger taxonomic scales (50). Furthermore, highly conserved nuclear genes often possess exons with little variation but can have introns that are highly variable and difficult to align. Some researchers have advocated taking advantage of this by developing exon primed intron crossing (EPIC) primers. EPIC markers amplify putative homologous intron regions that should be informative at or below the species level and thus may uniquely complement mtDNA barcoding data. Recent work by Chenuil et al. (51) provides both protocols for developing EPIC markers in Metazoa and a list of broad “universal” primers. Though effective species level nuclear protein coding genes do exist for some metazoan clades, none have emerged as a likely candidate for DNA barcoding across the kingdom. Traditional nuclear rDNAs that have long been used for multiple levels of analyses remain the best Metazoa-wide alternatives (52). Nuclear rDNA is comprised of three subunits separated by two internal transcribed spacers (ITS ). These are arranged in a single, tandem repeating unit from 5¢ to 3¢: 18S, ITS1, 5.8S, ITS2, and 28S. These high copy, repeating units typically maintain a single intragenomic sequence identity through concerted evolution (53), although this process can nevertheless maintain multiple unique rDNA copies. In addition, some clades are vulnerable to mobile elements that target rDNA and create a number of pseudogenes. For these reasons, many advocate caution when using these markers (54). These concerns can usually be addressed by intensive cloning and sequencing directed at a small fraction of the samples as well as amplification of larger sequences followed by nested PCRs. 18S (or SSU) rDNA: This large gene region (~1.8 kb) is generally better suited for higher level metazoan phylogenetics because of slow divergence rates. However, because it can be readily amplified it has served as a sort of higher taxon level DNA barcode. It is also popular in “species” level barcoding of nematodes (but see refs. 48, 55) and meiofauna in general, even though 18S may not show sufficient variation to distinguish between closely related species (4, 56). 28S (or LSU) rDNA D1-D2 region: At approximately 3.2 kb, 28S is even larger than 18S, however, it appears to have more informative, hypervariable regions, including the ~0.80–1-kb region of D1-D2, that some advocate using for lower taxonomic level analyses (57, 58). This marker certainly will be phylogenetically informative but its broad utility at the specieslevel has not been well documented (but see refs. 55, 59, 60). ITS1 and ITS2: Internal transcribed spacers 1 and 2 are thought to be under significantly less selective pressure than the rDNA subunit sequences (53). Their high sequence divergence rates

4

DNA Barcoding Methods for Invertebrates

71

are consistent with this, and have long made them important species level markers (52). However, this also likely explains why multiple unique copies are maintained in some metazoans. Yet these markers have played an important role for augmenting or replacing CO1 data in some clades where mtDNA performs poorly, especially many non-bilaterians. The length of ITS1 and ITS2 can significantly vary at the interspecific, intraspecific, and sometimes the intragenomic level, posing further challenges for species level analyses. This should be taken into consideration when troubleshooting these markers. References 1. Hebert PDN, Cywinska A, Ball SL, Dewaard JR (2003) Biological identifications through DNA barcodes. Proc R Soc Lond B Biol Sci 270:313–321 2. Lynch M, Koskella B, Schaack S (2006) Mutation pressure and the evolution of organelle genomic architecture. Science 311: 1727–1730 3. Davison A, Blackie RLE, Scothern GP (2009) DNA barcoding of stylommatophoran land snails: a test of existing sequences. Mol Ecol Resour 9:1092–1101 4. Creer S, Fonseca VG, Porazinska DL et al (2010) Ultrasequencing of the meiofaunal biosphere: practice, pitfalls and promises. Mol Ecol 19(Suppl 1):4–20 5. Hassanin A (2006) Phylogeny of Arthropoda inferred from mitochondrial sequences: strategies for limiting the misleading effects of multiple changes in pattern and rates of substitution. Mol Phylogenet Evol 38: 100–116 6. Knowlton N (1993) Sibling species in the sea. Ann Rev Ecol Syst 24:189–216 7. Verheyen E, Salzburger W, Snoeks J, Meyer A (2003) Origin of the superflock of cichlid fishes from Lake Victoria, East Africa. Science 300:325–329 8. Gregory TR, Mable BK (2005) Polyploidy in animals. In: Gregory TR (ed) The evolution of the genome. Academic, Waltham, MA, pp 428–501 9. Landry C, Geyer LB, Arakaki Y, Uehara T, Palumbi SR (2003) Recent speciation in the Indo-West Pacific: rapid evolution of gamete recognition and sperm morphology in cryptic species of sea urchin. Proc R Soc Lond B Biol Sci 270:1839–1847 10. Meyer CP, Paulay G (2005) DNA barcoding: error rates based on comprehensive sampling. PLoS Biol 3:e422 11. Coyne J, Orr H (2004) Speciation. Sinauer Associates, Sunderland, MA, p 545

12. Mayr E (1963) Animal species and their evolution. Harvard University Press, Cambridge, p 797 13. Meyer C, Geller J, Paulay G (2005) Fine scale endemism on coral reefs: archipelagic differentiation in turbinid gastropods. Evolution 59:113–125 14. Folmer O, Black M, Hoeh W, Lutz R, Vrijenhoek R (1994) DNA primers for amplification of mitochondrial cytochrome c oxidase subunit I from diverse metazoan invertebrates. Mol Mar Biol Biotechnol 3:294–299 15. Chen I-P, Tang C-Y, Chiou C-Y et al (2009) Comparative analyses of coding and noncoding DNA regions indicate that Acropora (Anthozoa: Scleractina) possesses a similar evolutionary tempo of nuclear vs. mitochondrial genomes as in plants. Mar Biotechnol 11:141–152 16. Huang D, Meier R, Todd PA, Chou LM (2008) Slow mitochondrial COI sequence evolution at the base of the metazoan tree and its implications for DNA barcoding. J Mol Evol 66:167–174 17. McFadden CS, Benayahu Y, Pante E et al (2010) Limitations of mitochondrial gene barcoding in Octocorallia. Mol Ecol Resour 11:1–13 18. Ortman BD (2008) DNA barcoding the medusozoa and ctenophora. Ph.D. Dissertation, University of Connecticut, Storrs, CT 19. Shearer TL, Coffroth MA (2008) DNA BARCODING: barcoding corals: limited by interspecific divergence, not intraspecific variation. Mol Ecol Resour 8:247–255 20. Signorovitch AY, Dellaporta SL, Buss LW (2006) Caribbean placozoan phylogeography. Biol Bull 211:149–156 21. Signorovitch AY, Buss LW, Dellaporta SL (2007) Comparative genomics of large mitochondria in placozoans. PLoS Genet 3:e13

72

N. Evans and G. Paulay

22. Wörheide G, Erpenbeck D, Menke C (2008) The Sponge Barcoding Project: aiding in the identification and description of poriferan taxa. In: Custódio M, Lôbo-Hajdu G, Haidu E, Muricy G (eds) Porifera research: biodiversity, innovation and sustainability. Museu Nacional de Rio de Janiero Book Series. Rio de Janeiro, Brazil, pp 123–128 23. Dawson MN, Jacobs DK (2001) Molecular evidence for cryptic species of Aurelia aurita (Cnidaria, Scyphozoa). Biol Bull 200:92 24. Spiess A-N-L, Mueller N, Ivell R (2004) Trehalose is a potent pcr enhancer: lowering of DNA melting temperature and thermal stabilization of Taq polymerase by the disaccharide trehalose. Clin Chem 50:1256–1259 25. Kreader CA (1996) Relief of amplification inhibition in PCR with bovine serum albumin or T4 gene 32 protein. Appl Environ Microbiol 62:1102–1106 26. Templado J, Paulay G, Gittenberger A, Meyer C (2010) Sampling the marine realm. In: Eymann J, Degreef J, Häuser C et al (eds) Manual on field recording techniques and protocols for all taxa biodiversity inventories. vol 8. ABC Taxa. Belgian National Focal Point for the GTI, Brussels, pp 273–307 27. Eymann J, Degreef J, Häuser C, Monje JC, Samyn Y, Van den Spiegel D (eds) (2010) Manual on field recording techniques and protocols for all taxa biodiversity inventories. vol 8. ABC Taxa. Belgian National Focal Point for the GTI, Brussels 28. Gaither M, Szabó Z, Crepeau M et al (2011) Preservation of corals in salt-saturated DMSO buffer is superior to ethanol for PCR experiments. Coral Reefs 30:329–333 29. Ivanova NV, Dewaard JR, Hebert PDN (2006) An inexpensive, automation-friendly protocol for recovering high-quality DNA. Mol Ecol Notes 6:998–1002 30. Rosengarten RD, Sperling EA, Moreno MA, Leys SP, Dellaporta SL (2008) The mitochondrial genome of the hexactinellid sponge Aphrocallistes vastus: evidence for programmed translational frameshifting. BMC Genomics 9:33 31. Sinniger F, Pawlowski J (2009) The partial mitochondrial genome of Leiopathes glaberrima (Hexacorallia: Antipatharia) and the first report of the presence of an intron in COI in black corals. Galaxea 11:21–26 32. Fukami H, Chen CA, Chiou C-Y, Knowlton N (2007) Novel group I introns encoding a putative homing endonuclease in the mitochondrial cox1 gene of Scleractinian corals. J Mol Evol 64:591–600

33. Milbury CA, Gaffney PM (2005) Complete mitochondrial DNA sequence of the eastern oyster Crassostrea virginica. Mar Biotechnol 7:697–712 34. Hajibabaei M, DeWaard JR, Ivanova NV et al (2005) Critical factors for assembling a high volume of DNA barcodes. Philos Trans R Soc Lond B Biol Sci 360:1959–1967 35. DeWaard J, Ivanova N, Hajibabaei M, Hebert P (2008) Assembling DNA barcodes. Analytical protocols. In: Martin C (ed) Methods in molecular biology. Humana, Totowa, pp 275–293 36. Bickley J, Hopkins D (1999) Inhibitors and enhancers of PCR. In: Saunders GC, Parkes HC (eds) Analytical molecular biology: quality and validation. Royal Society of Chemistry, Cambridge, UK, pp 81–102 37. Ralser M, Querfurth R, Warnatz H-J et al (2006) An efficient and economic enhancer mix for PCR. Biochem Biophys Res Comm 347:747–751 38. Hecker KH, Roux KH (1996) High and low annealing temperatures increase both specificity and yield in touchdown and stepdown PCR. Biotechniques 20:478–485 39. Siddall ME, Fontanella FM, Watson SC et al (2009) Barcoding bamboozled by bacteria: convergence to metazoan mitochondrial primer targets by marine microbes. Syst Biol 58:445–451 40. Schubart C (2009) Mitochondrial DNA and decapod phytogenies: the importance of pseudogenes and primer optimization. In: Martin JW, Crandall KA, Felder DL (eds) Decapod crustacean phylogenetics. CRC, Boca Raton, FL, pp 47–64 41. Hoareau TB, Boissin E (2010) Design of phylum-specific hybrid primers for DNA barcoding: addressing the need for efficient COI amplification in the Echinodermata. Mol Ecol Resour 10:960–967 42. Gissi C, Iannelli F, Pesole G (2008) Evolution of the mitochondrial genome of Metazoa as exemplified by comparison of congeneric species. Heredity 101:301–320 43. Chen C, Chiou CY, Dai CF, Chen CA (2008) Unique mitogenomic features in the scleractinian family pocilloporidae (Scleractinia: Astrocoeniina). Mar Biotech 10:538–553 44. Rawlings TA, Collins TM, Bieler R (2003) Changing identities: tRNA duplication and remolding within animal mitochondrial genomes. Proc Natl Acad Sci USA 100: 15700–15705 45. Walther E, Schofl G, Mrotzek G et al (2011) Paralogous mitochondrial control region in the

4

DNA Barcoding Methods for Invertebrates

giant tiger shrimp, Penaeus monodon (F.) affects population genetics inference: a cautionary tale. Mol Phylgenet Evol 58:404–408 46. Machida R, Miya M, Nishida M, Nishida S (2006) Molecular phylogeny and evolution of the pelagic copepod genus Neocalanus (Crustacea: Copepoda). Marine Biol 148: 1071–1079 47. Erpenbeck D, Hooper JNA, Worheide G (2006) CO1 phylogenies in diploblasts and the “Barcoding of Life” – are we sequencing a suboptimal partition? Mol Ecol Notes 6:550–553 48. Derycke S, Vanaverbeke J, Rigaux A et al (2010) Exploring the use of cytochrome oxidase c subunit 1 (COI) for DNA barcoding of free-living marine nematodes. PLoS One 5:e13716 49. Simon C, Buckley TR, Frati F, Stewart JB, Beckenbach AT (2006) Incorporating molecular evolution into phylogenetic analysis, and a new compilation of conserved polymerase chain reaction primers for animal mitochondrial DNA. Ann Rev Ecol Syst 37:545–579 50. Simpson R, Wilding C, Grahame J (2005) Intron analyses reveal multiple calmodulin copies in Littorina. J Mol Evol 60:505–512 51. Chenuil A, Hoareau TB, Egea E et al (2010) An efficient method to find potentially universal population genetic markers, applied to metazoans. BMC Evol Biol 10:276 52. Hwang UW, Kim W (1999) General properties and phylogenetic utilities of nuclear ribosomal DNA and mitochondrial DNA commonly used in molecular systematics. Korean J Parasitol 37:215 53. Eickbush TH, Eickbush DG (2007) Finely orchestrated movements: evolution of the ribosomal RNA genes. Genetics 175:477–485 54. Harris DJ, Crandall KA (2000) Intragenomic variation within ITS1 and ITS2 of freshwater crayfishes (Decapoda: Cambaridae): implications for phylogenetic and microsatellite studies. Mol Biol Evol 17:284 55. Derycke S, Fonseca G, Vierstraete A et al (2008) Disentangling taxonomy within the Rhabditis (Pellioditis) marina (Nematoda, Rhabditidae) species complex using molecular and morhological tools. Zool J Linn Soc 152:1–15 56. Bhadury P, Austen MC (2010) Barcoding marine nematodes: an improved set of nematode 18S rRNA primers to overcome eukaryotic co-interference. Hydrobiologia 641: 245–251 57. Sonnenberg R, Nolte A (2007) An evaluation of LSU rDNA D1-D2 sequences for their use in species identification. Front Zool 4:6

73

58. Markmann M, Tautz D (2005) Reverse taxonomy: an approach towards determining the diversity of meiobenthic organisms based on ribosomal RNA signature sequences. Philos Trans R Soc Lond B Biol Sci 360:1917–1924 59. Cárdenas P, Rapp HT, Schander C, Tendal OS (2010) Molecular taxonomy and phylogeny of the Geodiidae (Porifera, Demospongiae, Astrophorida)–combining phylogenetic and Linnaean classification. Zoolog Scripta 39: 89–106 60. McLain DK, Li J, Oliver JH (2001) Interspecific and geographical variation in the sequence of rDNA expansion segment D3 of Ixodes ticks (Acari: Ixodidae). Heredity 86:234–242 61. Benesh DP, Hasu T, Suomalainen L-R, Valtonen ET, Tiirola M (2006) Reliability of mitochondrial DNA in an acanthocephalan: the problem of pseudogenes. Int J Parasitol 36:247–254 62. Martínez-Aquino A, Reyna-Fabián ME, Rosas-Valdez R, Razo-Mendivil U, de León GP-P, García-Varela M (2009) Detecting a complex of cryptic species within Neoechinorhynchus golvani (Acanthocephala: Neoechinorhynchidae) inferred from ITSs and LSU rDNA gene sequences. Int J Parasitol 95: 1040–1047 63. Steinauer ML, Nickol BB, Ortí G (2007) Cryptic speciation and patterns of phenotypic variation of a highly variable acanthocephalan parasite. Mol Ecol 16:4097–4109 64. Sikes JM, Bely AE (2008) Radical modification of the A-P axis and the evolution of asexual reproduction in Convolutriloba acoels. Evol Dev 10:619–631 65. Telford MJ, Herniou EA, Russell RB, Littlewood DT (2000) Changes in mitochondrial genetic codes as phylogenetic characters: two examples from the flatworms. Proc Natl Acad Sci USA 97:11359–11364 66. Chang C-H, Rougerie R, Chen J-H (2009) Identifying earthworms through DNA barcodes: pitfalls and promise. Pedobiologia 52:171–180 67. Aguado MT, Nygren A, Siddall ME (2007) Cladistics analysis of nuclear and mitochondrial genes. Cladistics 23:552–564 68. Carr CM (2010) The polychaeta of canada: exploring diversity and distribution patterns using DNA barcodes. MSc Thesis, University of Guelph, Guelph, ON 69. James SW, Porco D, Decaëns T et al (2010) DNA barcoding reveals cryptic diversity in Lumbricus terrestris L., 1758 (Clitellata): resurrection of L. herculeus (Savigny, 1826). PLoS One 5:e15629

74

N. Evans and G. Paulay

70. Zhou H, Zhang Z, Chen H et al (2010) Integrating a DNA barcoding project with an ecological survey: a case study on temperate intertidal polychaete communities in Qingdao, China. Chin J Oceanol Limnol 28:899–910 71. Bely AE, Weisblat DA (2006) Lessons from leeches: a call for DNA barcoding in the lab. Evol Dev 8:491–501 72. Costa FO, Henzler CM, Lunt DH et al (2009) Probing marine Gammarus (Amphipoda) taxonomy with DNA barcodes. Syst Biod 7:365 73. Costa FO, DeWaard JR, Boutillier J, Ratnasingham S, Dooh RT, Hajibabaei M, Hebert PDN (2007) Biological identifications through DNA barcodes: the case of the Crustacea. Can J Fish Aquat Sci 64:272–295 74. Böttger-Schnack R, Machida RJ (2010) Comparison of morphological and molecular traits for species identification and taxonomic grouping of oncaeid copepods. Hydrobiologia 666:111–125 75. Bradford T, Adams M, Humphreys W, Austin A, Cooper S (2010) DNA barcoding of stygofauna uncovers cryptic amphipod diversity in a calcrete aquifer in Western Australia’s arid zone. Mol Ecol Resour 10:41–50 76. Goolsby JA, DE Barro PJ, Makinson JR, Pemberton RW, Hartley DM, Frohlich DR (2006) Matching the origin of an invasive weed for selection of a herbivore haplotype for a biological control programme. Mol Ecol 15:287–297 77. Murienne J, Edgecombe GD, Giribet G (2010) Including secondary structure, fossils and molecular dating in the centipede tree of life. Mol Phylogenet Evol 57:301–313 78. Navajas M, Navia D (2010) DNA-based methods for eriophyoid mite studies: review, critical aspects, prospects and challenges. Exp Appl Acarol 51:257–271 79. Barrett RDH, Hebert PDN (2005) Identifying spiders through DNA barcodes. Can J Zool 83:481–491 80. Radulovici AE, Sainte-Marie B, Dufresne F (2009) DNA barcoding of marine crustaceans from the Estuary and Gulf of St Lawrence: a regional-scale approach. Mol Ecol Resour 9:181–187 81. Ros VID, Breeuwer JAJ (2007) Spider mite (Acari: Tetranychidae) mitochondrial COI phylogeny reviewed: host plant relationships, phylogeography, reproductive parasites and barcoding. Exp Appl Acarol 42:239–262 82. Hurst GDD, Jiggins FM (2005) Problems with mitochondrial DNA as a marker in population, phylogeographic and phylogenetic studies: the effects of inherited symbionts. Proc R Soc Lond B Biol Sci 272:1525–1534

83. Engelstädter J, Hurst GDD (2009) The ecology and evolution of microbes that manipulate host reproduction. Ann Rev Ecol Evol Syst 40:127–149 84. Cohen BL, Bitner MA, Harper EM et al (2011) Vicariance and convergence in Magellanic and New Zealand long-looped brachiopod clades (Pan-Brachiopoda: Terebratelloidea). Zoolog J Linn Soc 162. doi: 10.1111/j.1096-3642.2010.00682.x 85. Lüter C, Cohen B (2002) DNA sequence evidence for speciation, paraphyly and a Mesozoic dispersal of cancellothyridid articulate brachiopods. Mar Biol 141:65–74 86. Gómez A, Wright PJ, Lunt DH et al (2007) Mating trials validate the use of DNA barcoding to reveal cryptic speciation of a marine bryozoan taxon. Proc R Soc Lond B Biol Sci 274:199–207 87. Kon T, Nohara M, Nishida M et al (2006) Hidden ancient diversification in the circumtropical lancelet Asymmetron lucayanum complex. Mar Biol 149:875–883 88. Jennings RM, Bucklin A, Pierrot-Bults A (2010) Barcoding of arrow worms (Phylum Chaetognatha) from three oceans: genetic diversity and evolution within an enigmatic phylum. PLoS One 5:e9949 89. Sinniger F, Reimer JD, Pawlowski J (2008) Potential of DNA sequences to identify zoanthids (Cnidaria: Zoantharia). Zoolog Sci 25:1253–1260 90. Concepcion GT, Crepeau MW, Wagner D et al (2007) An alternative to ITS, a hypervariable, single-copy nuclear intron in corals, and its use in detecting cryptic species within the octocoral genus Carijoa. Coral Reefs 27:323–336 91. Coleman AW, van Oppen MJH (2008) Secondary structure of the rRNA ITS2 region reveals key evolutionary patterns in acroporid corals. J Mol Evol 67:389–396 92. Chiou CY, Chen IP, Chen C et al (2008) Analysis of Acropora muricata Calmodulin (CaM) indicates that scleractinian corals possess the ancestral exon/intron organization of the eumetazoan CaM gene. J Mol Evol 66:317–324 93. Flot J-F, Magalon H, Cruaud C et al (2008) Patterns of genetic structure among Hawaiian corals of the genus Pocillopora yield clusters of individuals that are compatible with morphology. C R Biol 331:239–247 94. Miranda LS, Collins AG, Marques AC (2010) Molecules clarify a cnidarian life cycle – the “hydrozoan” Microhydrula limopsicola is an early life stage of the Staurozoan Haliclystus antarcticus. PLoS One 5:e10182

4

DNA Barcoding Methods for Invertebrates

95. Moura CJ, Harris DJ, Cunha MR, Rogers AD (2008) DNA barcoding reveals cryptic diversity in marine hydroids (Cnidaria, Hydrozoa) from coastal and deep-sea environments. Zoolog Scripta 37:93–108 96. Dawson MN (2004) Some implications of molecular phylogenetics for understanding biodiversity in jellyfishes, with emphasis on Scyphozoa. Hydrobiologia 530–531: 249–260 97. Ortman BD, Bucklin A, Pagès F, Youngbluth M (2010) DNA barcoding the Medusozoa using mtCOI. Deep Sea Res II 57: 2148–2156 98. Henderson M, Okamura B (2004) The phylogeography of salmonid proliferative kidney disease in Europe and North America. Proc R Soc Lond B Biol Sci 271:1729 99. Whipps CM, Kent ML (2006) Phylogeography of the cosmopolitan marine parasite Kudoa thyrsites (Myxozoa: Myxosporea). J Eukaryot Microbiol 53:364–373 100. Podar M, Haddock SH, Sogin ML, Harbison GR (2001) A molecular phylogenetic framework for the phylum Ctenophora using 18S rRNA genes. Mol Phylogenet Evol 21: 218–230 101. Gorokhova E, Lehtiniemi M, Viitasalo-fro S, Haddock SHD (2009) Molecular evidence for the occurrence of ctenophore Mertensia ovum in the northern Baltic Sea and implications for the status of the Mnemiopsis leidyi invasion. Limnol Oceanogr 54:2025–2033 102. Obst M, Funch P, Giribet G (2005) Hidden diversity and host specificity in cycliophorans: a phylogeographic analysis along the North Atlantic and Mediterranean Sea. Mol Ecol 14:4427–4440 103. Lessios HA (2008) The great American schism: divergence of marine organisms after the rise of the Central American Isthmus. Ann Rev Ecol Evol Syst 39:63–91 104. Fuchs J, Iseto T, Hirose M, Sundberg P, Obst M (2010) The first internal molecular phylogeny of the animal phylum Entoprocta (Kamptozoa). Mol Phyl Evol 56:370–379 105. Todaro MA, Kånneby T, Dal Zotto M, Jondelius U (2011) Phylogeny of thaumastodermatidae (gastrotricha: macrodasyida) inferred from nuclear and mitochondrial sequence data. PLoS One 6:e17892 106. Sørensen MV, Sterrer W, Giribet G (2006) Cladistics four molecular loci and morphology. Cladistics 22:32–58 107. Cannon JT, Rychel AL, Eccleston H, Halanych KM, Swalla BJ (2009) Molecular phylogeny of hemichordata, with updated

75

status of deep-sea enteropneusts. Mol Phyl Evol 52:17–24 108. Smith SE, Douglas R, Burke K, Swalla BJ (2003) Morphological and molecular identification of Saccoglossus species (Hemichordata: Harrimaniidae) in the Pacific Northwest. Can J Zool 141:133–141 109. Giribet G, Sorensen MV, Funch P et al (2004) Investigations into the phylogenetic position of Micrognathozoa using four molecular loci. Cladistics 20:1–13 110. Doucet-Beaupré H, Breton S, Chapman EG et al (2010) Mitochondrial phylogenomics of the Bivalvia (Mollusca): searching for the origin and mitogenomic correlates of doubly uniparental inheritance of mtDNA. BMC Evol Biol 10:50 111. Campbell DC, Johnson PD, Williams JD et al (2008) Identification of “extinct” freshwater mussel species using DNA barcoding. Mol Ecol Resour 8:711–724 112. Ghiselli F, Milani L, Passamonti M (2011) Strict sex-specific mtDNA segregation in the germ line of the DUI species Venerupis philippinarum (Bivalvia: Veneridae). Mol Biol Evol 28:949–961 113. Allcock AL, Barratt I, Eléaume M et al (2010) Cryptic speciation and the circumpolarity debate: a case study on endemic Southern Ocean octopuses using the COI barcode of life. Deep Sea Res II 114. Kelly RP, Sarkar IN, Eernisse DJ, Desalle R (2007) DNA barcoding using chitons (genus Mopalia). Mol Ecol Notes 7:177–183 115. Dinapoli A, Klussmann-Kolb A (2010) The long way to diversity–phylogeny and evolution of the Heterobranchia (Mollusca: Gastropoda). Mol Phylogenet Evol 55:60–76 116. Barr NB, Cook A, Elder P et al (2009) Application of a DNA barcode using the 16S rRNA gene to diagnose pest Arion species in the USA. J Molluscan Stud 75:187–191 117. Ladoukakis ED, Theologidis I, Rodakis GC, Zouros E (2011) Homologous recombination between highly diverged mitochondrial sequences: examples from maternally and paternally transmitted genomes. Mol Biol Evol 28:1–40 118. Breton S, Stewart DT, Shepardson S et al (2011) Novel protein genes in animal mtDNA: a new sex determination system in freshwater mussels (Bivalvia: Unionoida)? Mol Biol Evol 28:1645–1659 119. Bourlat SJ, Nakano H, Åkerman M et al (2008) Feeding ecology of Xenoturbella bocki (phylum Xenoturbellida) revealed by genetic barcoding. Mol Ecol Resour 8:18–22

76

N. Evans and G. Paulay

120. Bhadury P, Austen MC, Bilton DT et al (2007) Exploitation of archived marine nematodes – a hot lysis DNA extraction protocol for molecular studies. Zoolog Scripta 36:93–98 121. De Ley P, De Ley IT, Morris K et al (2005) An integrated approach to fast and informative morphological vouchering of nematodes for applications in molecular barcoding. Philos Trans R Soc Lond B Biol Sci 360: 1945–1958 122. Mateos E, Giribet G (2008) Exploring the molecular diversity of terrestrial nemerteans (Hoplonemertea, Monostilifera, Acteonemertidae) in a continental landmass. Zoolog Scripta 37:235–243 123. Maslakova S, Norenburg J (2008) Revision of the smiling worms, genus Prosorhochmus Keferstein, 1862, and description of a new species, Prosorhochmus belizeanus sp. nov. (Prosorhochmidae, Hoplonemertea, Nemertea) from Florida and Belize. J Nat History 42:1219–1260 124. Sundberg P, Thuroczy Vodoti E, Strand M (2010) DNA barcoding should accompany taxonomy – the case of Cerebratulus spp (Nemertea). Mol Ecol Resour 10:274–281 125. Daniels SR, Ruhberg H (2010) Molecular and morphological variation in a South African velvet worm Peripatopsis moseleyi (Onychophora, Peripatopsidae): evidence for cryptic speciation. J Zool 282:171–179 126. Trewick SA (2000) Mitochondrial DNA sequences support allozyme evidence for cryptic radiation of New Zealand Peripatoides (Onychophora). Mol Ecol 9:269–281 127. Podsiadlowski L, Braband A, Mayer G (2008) The complete mitochondrial genome of the onychophoran Epiperipatus biolleyi reveals a unique transfer RNA set and provides further support for the ecdysozoa hypothesis. Mol Biol Evol 25:42–51 128. Santagata S, Cohen BL (2009) Phoronid phylogenetics (Brachiopoda; Phoronata): evidence from morphological cladistics, small and large subunit rDNA sequences, and mitochondrial cox1. Zool J Linn Soc 157:34–50 129. Voigt O, Collins AG, Pearse VB et al (2004) Placozoa – no longer a phylum of one. Current Biol 14:944–945 130. Sanna D, Lai T, Francalacci P et al (2009) Population structure of the Monocelis lineata (Proseriata, Monocelididae) species complex assessed by phylogenetic analysis of the mitochondrial Cytochrome c Oxidase subunit I (COI) gene. Gen Mol Biol 32:864–867 131. Moszczynska A, Locke SA, McLaughlin JD et al (2009) Development of primers for the

mitochondrial cytochrome c oxidase I gene in digenetic trematodes (Platyhelminthes) illustrates the challenge of barcoding parasitic helminths. Mol Ecol Resour 9:75–82 132. Zarowiecki MZ, Huyse T, Littlewood DTJ (2007) Making the most of mitochondrial genomes–markers for phylogeny, molecular ecology and barcodes in Schistosoma (Platyhelminthes: Digenea). Int J Parasitol 37:1401–1418 133. Pöppe J, Sutcliffe P, Hooper JNA et al (2010) CO I barcoding reveals new clades and radiation patterns of Indo-Pacific sponges of the family Irciniidae (Demospongiae: Dictyoceratida). PLoS One 5:e9950 134. Wang X, Lavrov DV (2008) Seventeen new complete mtDNA sequences reveal extensive mitochondrial genome evolution within the Demospongiae. PLoS One 3:e2723 135. Watanabe KI, Bessho Y, Kawasaki M, Hori H (1999) Mitochondrial genes are found on minicircle DNA molecules in the mesozoan animal Dicyema. J Mol Biol 286:645–650 136. Derry AM, Hebert PDN, Prepas EE (2003) Evolution of rotifers in saline and subsaline lakes: a molecular phylogenetic approach. Limn Oceanograph 48:675–685 137. Fontaneto D, Kaya M, Herniou EA, Barraclough TG (2009) Extreme levels of hidden diversity in microscopic animals (Rotifera) revealed by DNA taxonomy. Mol Phy Evol 53:182–189 138. Gómez A, Serra M, Carvalho GR, Lunt DH (2002) Speciation in ancient cryptic species complexes: evidence from the molecular phylogeny of Brachionus plicatilis (Rotifera). Evolution 56:1431–1444 139. Du X, Chen Z, Deng Y, Wang Q (2009) Comparative analysis of genetic diversity and population structure of Sipunculus nudus as revealed by mitochondrial COI sequences. Biochem Genet 47:884–891 140. Kawauchi GY, Giribet G (2010) Are there true cosmopolitan sipunculan worms? A genetic variation study within Phascolosoma perlucens (Sipuncula, Phascolosomatidae). Marine Biol 157:1417–1431 141. Blaxter M, Mann J, Chapman T, Thomas F, Whitton C, Floyd R, Abebe E (2005) Defining operational taxonomic units using DNA barcode data. Philos Trans R Soc Lond B Biol Sci 360:1935–1943 142. Sands CJ, Convey P, Linse K, McInnes SJ (2008) Assessing meiofaunal variation among individuals utilising morphological and molecular approaches: an example using the Tardigrada. BMC Ecol 8:7

4

DNA Barcoding Methods for Invertebrates

143. Schill RO (2007) Comparison of different protocols for DNA preparation and PCR amplification of mitochondrial genes of tardigrades. J Limnol 66:164–170 144. Cesari M, Bertolani R, Rebecchi L, Guidetti R (2009) DNA barcoding in Tardigrada: the first case study on Macrobiotus macrocalix Bertolani & Rebecchi 1993 (Eutardigrada, Macrobiotidae). Mol Ecol Resour 9:699–706 145. Stefaniak L, Lambert G, Gittenberger A et al (2009) Genetic conspecificity of the worldwide populations of Didemnum vexillum Kott, 2002. Aquat Invasion 4:29–44 146. Nydam ML, Harrison RG (2007) Genealogical relationships within and among shallow-water Ciona species (Ascidiacea). Marine Biol 151:1839–1847 147. Bourlat SJ, Nielsen C, Lockyer AE et al (2003) Xenoturbella is a deuterostome that eats molluscs. Nature 424:925–928 148. Meyer CP (2003) Molecular systematics of cowries (Gastropoda: Cypraeidae) and diversification patterns in the tropics. Biol J Linn Soc 79:401–459 149. Kojima S, Segawa R, Hashimoto J, Ohta S (1997) Molecular phylogeny of vestimentiferans collected around Japan, revealed by the nucleotide sequences of mitochondrial DNA. Marine Biol 127:507–513 150. Prendini L (2005) Systematics of the group of African whip spiders (Chelicerata: Amblypygi): Evidence from behaviour, morphology and DNA. Organ Div Evol 5:203–236 151. Schwendinger PJ, Giribet G (2005) The systematics of the south-east Asian genus

77

Fangensis Rambla (Opiliones: Cyphophthalmi: Stylocellidae). Invertebr Syst 19:297–323 152. Fukami H, Budd AF, Levitan DR et al (2004) Geographic differences in species boundaries among members of the Montastraea annularis complex based on molecular and morphological markers. Evolution 58:324–337 153. Dawson MN (2005) Incipient speciation of Catostylus mosaicus (Scyphozoa, Rhizostomeae, Catostylidae), comparative phylogeography and biogeography in south-east Australia. J Biog 32:515–533 154. Martínez DE, Iñiguez AR, Percell KM et al (2010) Phylogeny and biogeography of Hydra (Cnidaria: Hydridae) using mitochondrial and nuclear DNA sequences. Mol Phyl Evol 57:403–410 155. Palumbi SR, Martin A, Romano S et al (2002) The simple fool’s guide to PCR, Version 2.0. Department of Zoology and Kewalo Marine Laboratory, Honolulu, HI 156. Apakupakul K, Siddall ME, Burreson EM (1999) Higher level relationships of leeches (Annelida: Clitellata: Euhirudinea) based on morphology and gene sequences. Mol Phylogenet Evol 12:350–359 157. Medlin L, Elwood HJ, Stickel S, Sogin ML (1988) The Characterization of enzymatically amplified eukaryotic 16S-like rRNA-coding regions. Gene 71:491–499 158. White T, Bruns T, Lee S, Taylor J (1990) Amplification and direct sequencing of fungal ribosomal RNA genes for phylogenetics. In: Innis M, Gelfand D, Sninsky J, White T (eds) PCR Protocols: a guide to methods and applications. Academic, New York, pp 315–322

Chapter 5 DNA Barcoding Amphibians and Reptiles Miguel Vences, Zoltán T. Nagy, Gontran Sonet, and Erik Verheyen Abstract Only a few major research programs are currently targeting COI barcoding of amphibians and reptiles (including chelonians and crocodiles), two major groups of tetrapods. Amphibian and reptile species are typically old, strongly divergent, and contain deep conspecific lineages which might lead to problems in species assignment with incomplete reference databases. As far as known, there is no single pair of COI primers that will guarantee a sufficient rate of success across all amphibian and reptile taxa, or within major subclades of amphibians and reptiles, which means that the PCR amplification strategy needs to be adjusted depending on the specific research question. In general, many more amphibian and reptile taxa have been sequenced for 16S rDNA, which for some purposes may be a suitable complementary marker, at least until a more comprehensive COI reference database becomes available. DNA barcoding has successfully been used to identify amphibian larval stages (tadpoles) in species-rich tropical assemblages. Tissue sampling, DNA extraction, and amplification of COI is straightforward in amphibians and reptiles. Single primer pairs are likely to have a failure rate between 5 and 50% if taxa of a wide taxonomic range are targeted; in such cases the use of primer cocktails or subsequent hierarchical usage of different primer pairs is necessary. If the target group is taxonomically limited, many studies have followed a strategy of designing specific primers which then allow an easy and reliable amplification of all samples. Key words: Amphibia, Testudines, Crocodylia, Sphenodontia, Squamata, COI primers

1. Introduction In contrast to numerous other taxa, especially fishes and birds among vertebrates, DNA barcoding of amphibians and reptiles is in a very early stage. We here use the term amphibians as encompassing all Lissamphibia, i.e., frogs, salamanders, and caecilians (as of February 2012, totaling 6,922 species: 6,115 frogs, 618 salamanders, and 189 caecilians) (1). Reptiles are a paraphyletic group and we use the term here to include, all nonavian extant taxa of the Testudines, Crocodylia, Sphenodontia, and Squamata (as of February 2008, 8,734 species: 313 turtles, 23 crocodiles, 2 tuataras, and 8,396 squamates) (2). W. John Kress and David L. Erickson (eds.), DNA Barcodes: Methods and Protocols, Methods in Molecular Biology, vol. 858, DOI 10.1007/978-1-61779-591-6_5, © Springer Science+Business Media, LLC 2012

79

80

M. Vences et al.

Only a few DNA barcoding campaigns on reptiles were initiated recently, e.g., DNA barcoding of the South African reptile fauna (also see the International Barcode of Life web site; www.ibol. org). To date, the number of studies and publications dedicated to DNA barcoding of reptiles in general is very limited. Exceptions are the manageable few species of marine turtles with high conservational implications, where a good progress of DNA barcoding was recently achieved (3, 4). Related to this issue of conservational biology and genetics, DNA barcoding was recently applied to identify species targeted by bushmeat practices and to identify among others alligators and crocodiles (5, 6). In amphibians, several test cases of COI DNA barcoding have been published (7–9) and an extensive DNA barcoding program is currently being carried out on Central and South American taxa and has already led to remarkable results (10). From our own work in progress, rich data sets, taxon coverage ca. 90 and 80% respectively, on amphibians and reptiles of Madagascar are available with research continuing to achieve complete taxon coverage, while ongoing field surveys will enable us to initiate similar barcoding efforts for the frogs of the Congo basin and of Cuba. Given the critical conservation status especially of many amphibians, implementation of larger amphibian DNA barcoding programs would be very useful. They would allow to more efficiently delimit the distribution area and habitat use of endangered species also on the basis of larvae or juveniles which currently cannot be reliably identified. Integration of molecular assessment would help to accelerate the pace of species discovery and the quality of species hypotheses (11, 12). Until 2010, the vast majority of amphibian and reptile COI sequences were not produced in the framework of the global DNA barcoding initiative but they are mostly the result of phylogenetic or phylogeographic studies where COI was used as one of the genetic markers. In addition, numerous COI sequences in GenBank originated from sequencing strategies in which a stretch containing full or partial ND1 and ND2 genes, intervening tRNAs, and only a short section (100–200 bp) of the 5¢ terminus of the COI gene are obtained for phylogenetic analysis (e.g., for amphibians see refs. 13, 14). We have not considered the studies involving this fragment in the primer tables given herein. Beyond investigations on DNA barcoding and phylogeny, there are a growing number of mitogenomic studies that have yielded COI sequences. Among the ones with stronger impact or including several species are (15–21) for reptiles and (22–25) for amphibians. These studies have certainly contributed to the number of available COI sequences, but are otherwise not related to the DNA barcoding effort as such. However, the available coverage of higher taxa such as orders and families in mitogenomic studies is of crucial importance because it allows the design of primers for a variety of regions of the mitochondrial genome (26),

5

DNA Barcoding Amphibians and Reptiles

81

including targeted COI primers for particular taxonomic groups or species in which universal primers may fail. A common theme in amphibian and reptile DNA barcoding is that there is no single pair of primers that will guarantee a sufficient rate of success across all taxa, which means that the strategy needs to be adjusted depending on the specific research question. As far as known there are also no primers universal within major amphibian or reptile subgroups, such as salamanders, frogs, snakes, or lizards. Our experiences show that for amplifying and sequencing large numbers of samples from a restricted taxonomic group (a single species or a complex of closely related species), it is most convenient to design specific primers. If a wide array of taxa are to be screened, either usage of a primer cocktail or a hierarchical approach is advisable (first using one pair of universal primers, and subsequently using a different set of primers for samples that have failed to amplify in the first attempt). A first compilation of mitochondrial DNA primers used in amphibians was published in 1999 (27) but only included a few COI primers. Although not comprehensive, Tables 1 and 2 show a representative overview of primers and annealing temperatures used so far in studies that involved sequencing of COI in a larger number of samples of amphibians or reptiles, respectively. The specificity of primers and the targeted fragment size vary case by case, and the position of primers in the COI gene, and relative to the Folmer region (28), is shown in Figs. 1 and 2. When barcoding amphibians and reptiles, it is to be kept in mind that many species and species complexes are evolutionarily old and contain cryptic candidate species and deep conspecific lineages (refs. 7, 29; see also Note 1). This situation appears to be more commonly encountered in the tropics. In temperate regions, on the one hand, species are better studied so that discovery of new cryptic lineages happens less frequently; on the other hand these species have often expanded from glacial refuges in the Pleistocene, so that similar mitochondrial haplotypes can be encountered over vast geographic ranges and divergences within species are less deep. Altogether, DNA barcoding of amphibians and reptiles based on COI is not fundamentally different from that in other animal groups and has the same promises. Specifics to be kept in mind are mainly the old age of many species and the potential presence of very deeply diverged mitochondrial lineages within species which (a) make it necessary to have very complete COI reference databases for a successful species identification and (b) accentuate the problem of primer failure in single samples even within species or species complexes. Below we give a brief overview of laboratory methods for tissue sampling and for extracting DNA as well as amplifying and sequencing COI from amphibian and reptile specimens. These methods,

Primer name

COIf

COIa

COIa2

LCO1490

HCO2198

VF2 t1

FishF2 t1

Specificity/origin

Universal

Universal

Universal

Universal

Universal

Fishes

Fishes

5,391

5,392

6,089

5,406

6,662

6,707

6,047

52 used in cocktail (tailed)

TGTAAAACGACGGCCA F GTCGACTAATCATA AAGATATCGGCAC

50; 49–50

50; 49–50

45

45; 57

45; 57

Annealing temperature (°C)

52 used in cocktail (tailed)

R

F

R

R

F

Direction

F

GTAAAACGACGGCCA GTCAACCAACCACA AAGACATTGGCAC

TAAACTTCAGGGA CCAAAAAATCA

GGTCAACAAATCA TAAAGATATTGG

CCTGCYARYCCTA RRAARTGTTGAGG

AGTATAAGCGTCT GGGTAGTC

CCTGCAGGAGGA GGAGAYCC

Position Sequence (5¢–3¢)

Tungara frogs (Physalaemus)

(50)

(52)

(52)

(28)

Clawed frogs (Xenopus)

Clawed frogs (Xenopus)

Poison frogs (Oophaga); Malagasy frogs (Mantellidae)

Poison frogs (Oophaga); Malagasy frogs (Mantellidae)

Tungara frogs (Physalaemus); dirt frogs (Craugastor)

(46)

(28)

Tungara frogs (Physalaemus), dirt frogs (Craugastor)

(46)

Primer reference Used for

(53)

(53)

(7, 8, 51)

(7, 8, 51)

(48)

(49, 51)

(49, 51)

Studies

Table 1 Selection of primers used for amplifying COI (fragments) in phylogenetic or phylogeographic studies of amphibians with details on taxon specificity and PCR conditions

82 M. Vences et al.

FR1d t1

VF1-d

VR1-d

LepF1

LepRI

BirdF1

BirdR1

BirdR2

“Desmognathus- 5,370 forward”

“Desmognathus- 6,005 reverse”

MVZ_201

Fishes

Fishes

Fishes

Butterflies

Butterflies

Birds

Birds

Birds

Dusky salamanders

Dusky salamanders

Arboreal salamanders (Aneides)

5,408

6,129

6,129

5,408

6,089

5,406

6,089

5,405

6,086

6,086

FishR2 t1

Fishes

TCAACAAAYCATAAA GATATTGGCACC

GTATTAAGATTTCGG TCTGTTAGAAGTAT

CGGCCACTTTACCYR TGATAATYACTCG

ACTACATGTGAGATG ATTCCGAATCCAG

ACGTGGGAGATAATT CCAAATCCTG

TTCTCCAACCACAAA GACATTGGCAC

TAAACTTCTGGATGT CCAAAAAATCA

ATTCAACCAATCATA AAGATATTGG

TAGACTTCTGGGT GGCCRAARAAYCA

TTCTCAACCAACCA CAARGAYATYGG

CAGGAAACAGCTAT GACACCTCAGGG TGTCCGAARAAYC ARAA

CAGGAAACAGCTAT GACACTTCAGGG TGACCGAAGAAT CAGAA

Position Sequence (5¢–3¢)

Primer name

Specificity/origin

F

R

F

R

R

F

R

F

R

F

NA

52

52

49–50

49–50

49–50

45 and 51

45 and 51

45 and 51

45 and 51

(7)

(58)

(58)

(57)

(57)

(57)

(56)

(56)

(55)

(55)

52 used in cocktail (54) (tailed)

R

( 8)

( 8)

( 8)

( 9)

( 9)

( 9)

( 9)

(53)

(53)

Studies

Arboreal salamanders (Aneides)

DNA Barcoding Amphibians and Reptiles (continued)

( 7)

Dusky salamanders (58) (Desmognathus)

Dusky salamanders (58) (Desmognathus)

Malagasy frogs (Mantellidae)

Malagasy frogs (Mantellidae)

Malagasy frogs (Mantellidae)

Various frog and salamander taxa

Various frog and salamander taxa

Various frog and salamander taxa

Various frog and salamander taxa

Clawed frogs (Xenopus)

Clawed frogs (Xenopus)

Primer reference Used for

52 used in cocktail (52) (tailed)

Annealing temperature (°C)

R

Direction

5 83

Primer name

MVZ_202

PP6

PP7

PP8

PP9

COI-1

COI-2

COI-3

COI-4

Specificity/origin

Arboreal salamanders (Aneides)

Physalaemus

Physalaemus

Physalaemus

Physalaemus

Fire-bellied toads

Fire-bellied toads

Fire-bellied toads

Fire-bellied toads

Table 1 (continued)

5,903

6,503

6,503

5,412

6,467

6,467

6,302

6,302

6,695

CCAGCAATGTCAC AATACCAAAC

GACAGAACATAGTGG AAGTGAGCTAC

GATACGACATAGTGG AAGTGGGCTAC

CAAATCACAAAGACA TTGGCACCCT

TCATGTAATACAATG TCTAGAGA

TCTCTAGAYATTGT ATTACATGA

GTTGGAATTGCRAT GATTATTGT TGCAGA

TCTGCAACAATAAT YATYCGCAATT CCAAC

GCGTCWGGGTART CTGAATATCGTCG

Position Sequence (5¢–3¢)

F

R

R

F

R

F

R

F

R

Direction

NA

NA

NA

NA

Internal sequencing primer

Internal sequencing primer

Internal sequencing primer

Internal sequencing primer

NA

Annealing temperature (°C)

(38)

(38)

(38)

(38)

(50)

(50)

(50)

(50)

(7)

Fire-bellied toads (Bombina)

Fire-bellied toads (Bombina)

Fire-bellied toads (Bombina)

Fire-bellied toads (Bombina)

Tungara frogs (Physalaemus)

Tungara frogs (Physalaemus)

Tungara frogs (Physalaemus)

Tungara frogs (Physalaemus)

Arboreal salamanders (Aneides)

Primer reference Used for

(38)

(38)

(38)

(38)

(48)

(48)

(48)

(48)

(7)

Studies

84 M. Vences et al.

6,176 5,908

6,707

COI-6

Cox

Coy

COI-smallF

COI-smallR

KLPf

HmCO1

CO1AXen-H

CO1h-L

CO1g-L

Fire-bellied toads

Australian Litoria frogs

Australian Litoria frogs

Australian frogs (Litoria aurea)

Australian frogs (Litoria aurea)

Australian frogs (Litoria)

South American hylid frogs (Dendropsophus minutus)

South American hylid frogs (Dendropsophus minutus)

Toads (Bufonidae)

Toads (Bufonidae)

TTCATACGTGGTAA CATTTTAGTCAAG

GGAATTATTTCCC AYGTWGTAAC

TGTATAAGCGT CTGGGTAGTC

CGTCACTCAGTA CCAAACCCCC

AAAGAACCTTTT GGTTACATGGG

CAAATACGG CCCCCATAGAT

TTGGCCTGCTA GGTTTTATTG

GGGGTAGTCAG AATAGCGTCG

TGATTCTTTGGG CATCCTGAAG

GCAGGGGTGTCC TCAATTCTAG

TGGTAATTCCTG CAGCAAGAAC

F

F

R

F

F

R

F

R

F

F

R

Direction

NA

NA

NA

NA

NA

(27)

Toads (Bufonidae) (27)

Toads (Bufonidae) (27)

(27)

(63)

(63)

(62)

(37)

South American hylid frogs (Dendropsophus minutus)

South American hylid frogs (Dendropsophus minutus)

Australian frogs (Litoria)

Australian frogs (Litoria aurea)

(37)

(59–61)

(59–61)

(38)

(38)

Studies

(63)

(63)

(62)

Step-down profile: (37) 60, 58, 56, 54

Australian frogs (Litoria aurea)

Australian Litoria frogs

Australian Litoria frogs

(59) (59)

Fire-bellied toads (Bombina)

Fire-bellied toads (Bombina)

(38)

(38)

Primer reference Used for

Step-down profile: (37) 60, 58, 56, 54

NA

NA

NA

NA

Annealing temperature (°C)

Position is given relative to the complete mitochondrial genome sequence of Discoglossus galganoi (GenBank accession number: AY585339). When multiple annealing temperatures are given, it refers to alternative temperatures used in different studies for the same primer or primer combination

5,162

6,137

6,526

6,222

6,695

6,089

5,840

5,984

COI-5

Fire-bellied toads

Position Sequence (5¢–3¢)

Primer name

Specificity/origin

5 DNA Barcoding Amphibians and Reptiles 85

HCO2198

C1-J-1718

C1-J-2191

CO1a

CO1f

COIcXen

COIfXen

COIaXen

COIeXen

Universal

Universal

Universal

Vertebrata

Vertebrata

Vertebrata

Vertebrata

Vertebrata

Vertebrata

6,398

6,539

5,307

5,787

5,898

6,539

5,939

5,466

5,921

CCAGTAAATAAC GGGAATCAGTG

TGTATAAGCGTC TGGGTAGTC

CCTGCCGGAGG AGGTGACCC

TCGTTTGATCAG TATTAATCAC

CCTGCAGGAGGA GGAGAT(orY)CC

AGTATAAGCGTCT GGGTAGTC

CCCGGTAAAATTAAAA TATAAACTTC

GGAGGATTTGGAAA TTGATTAGTTCC

TAAACTTCAGGGT GACCAAAAAATCA

GGTCAACAAATCAT AAAGATATTGG

LCO1490

Universal

5,262

Primer name Position Sequence (5¢–3¢)

Specificity

R

R

F

F

F

R

R

F

R

F

Direction

47

47

47

47

45–58

45–58

42

42

42–45

42–45

Annealing temperature (°C)

(45)

(45)

(45)

(45)

(45)

(45)

(67)

(67)

(28)

(28)

Primer reference

Anolis

Anolis

Anolis

Anolis

Turtle, tortoise, iguana, skink, crocodile

Turtle, tortoise, iguana, skink, crocodile

Lizard

Lizard

Lizard, turtle, gecko

Lizard, turtle, gecko

Used for

(74)

(74)

(74)

(74)

(68–73)

(68–73)

(66)

(66)

(62–64)

(62–64)

Studies

Table 2 Primers used for amplifying COI (fragments) in phylogenetic or phylogeographic studies of reptiles with details on taxon specificity and PCR conditions

86 M. Vences et al.

FishF2_t1

FishR2_t1

FR1d_t1

M13F (221)

M13R (227)

VF1

VR1

RepCOI-F

RepCOI-R

M72

Vertebrata (COI-3 cocktail)

Vertebrata (COI-3 cocktail)

Vertebrata (COI-3 cocktail)

Universal (COI-3 cocktail)

Universal (COI-3 cocktail)

Vertebrata

Vertebrata

Squamata

Squamata

Testudines

5,946

5,921

5,256

5,921

5,262

NA

NA

5,918

5,918

5,265

TGATTCTTCGGTCACCCA GAAGTGTA

ACTTCTGGRTGKCC AAARAATCA

TNTTMTCAACNAACC ACAAAGA

TAGACTTCTGGGTGGCC AAAGAATCA

TTCTCAACCAACCACAAA GACATTGG

CAGGAAACAGCTATGAC

TGTAAAACGACGGCCAGT

[M13R]ACCTCAGGGT GTCCGAARAAYCARAA

[M13R]ACTTCAGGGT GACCGAAGAATCAGAA

[M13F]CGACTAATCAT AAAGATATCGGCAC

[M13F]CAACCAACCAC AAAGACATTGGCAC

VF2_t1

Vertebrata (COI-3 cocktail)

5,265

Primer name Position Sequence (5¢–3¢)

Specificity

F

R

F

R

F

R

F

R

R

F

F

Direction

48 or 55

48.5

48.5

52

NA

(69)

(77)

(77)

(55)

(55)

51.1 5×, then 56.9 (54) 30×

51.1 5×, then 56.9 (54) 30×

51.1 5×, then 56.9 (54) 30×

51.1 5×, then 56.9 (54) 30×

Side-necked turtle

Squamata

Squamata

Boelen’s python, watersnake

Boelen’s python

Crocodile

Crocodile

Crocodile

Crocodile

Crocodile

51.1 5×, then 56.9 (54) 30×

Used for Crocodile

Primer reference

51.1 5×, then 56.9 (54) 30×

Annealing temperature (°C)

(continued)

(69)

(77)

(77)

(75, 76)

(76)

(5)

(5)

(5)

(5)

(5)

(5)

Studies

5 DNA Barcoding Amphibians and Reptiles 87

L-330COI

H-610COI

H-715COI

L-turtCOIc

H-turtCOIc

L-turtCOI

H-turtCOI

H-turtCOIb

L-COIint

H-COIint

Testudines

Testudines

Testudines

Testudines

Testudines

Testudines

Testudines

Testudines

Testudines

Testudines

5,634

5,792

6,119

6,059

5,968

6,066

5,234

5,946

5,843

5,564

TAGTTAGGTCTACAG AGGCGC

TGATCAGTACTTATCAC AGCCG

GTTGCAGATGTAAAA TAGGCTCG

CCCATACGATGAA GCCTAAGAA

ACTCAGCCATCTTA CCTGTGATT

TGGTGGGCTCATAC AATAAAGC

TACCTGTGATTTTAA CCCGTTGAT

GCCAAATCCTGGTAA GATTAAGAT

GTATTTAGGTTTCGGT CAGTGAG

TACTTTTACTCCTAGCC TCCTCAG

CCTATTGATAGGACGTA GTGGAAGTG

M73

Testudines

6,342

Primer name Position Sequence (5¢–3¢)

Specificity

Table 2 (continued)

R

F

R

R

F

R

F

R

R

F

R

Direction

(79)

(79)

(79)

(79)

(79)

(78)

(78)

(78)

(69)

Primer reference

For sequencing only (79)

For sequencing only (79)

56–58

56–58

56–58

56

56

50–54

50–54

50–54

48 or 55

Annealing temperature (°C)

(78)

(78)

(79)

(78)

(78)

(78)

(78)

(78)

(69)

Studies

Yunnan box turtle

(78)

Yunnan box turtle, (74, 78) anoles

Yunnan box turtle

Yunnan box turtle

Turtle

Yunnan box turtle

Yunnan box turtle

Yunnan box turtle

Yunnan box turtle

Yunnan box turtle

Side-necked turtle

Used for

88 M. Vences et al.

CoxIH2

COIf-ot1

COIr-ot2

COIf-ot2

COIr-ot1

L7354

H7794

rTrp–1L

rCOI−1H

LCOI5973

HCOI6576

LCOI5982

HCOI6570

NA

Crocodylia

Crocodylia

Crocodylia

Crocodylia

Crocodylia

Squamata

Squamata

Squamata

Squamata

Squamata

Squamata

Squamata

Squamata

Serpentes

5,222

5,864

5,317

5,921

5,262

6,332

4,879

6,365

5,925

5,871

5,654

5,595

5,891

6,042

R

F

F

R

F

R

R

F

Direction

TCAGCCATACTACCTG TGTTCA

TGCTGGGTCGAAGAA GGTNGT

GGTATAACCGGAACA GCCCTNAGY

TAAACTTCAGGGTGA CCAAAAAATCA

GGTCAACAAATCATAAA GATATTGG

F

R

F

R

F

TAGTGGAARTGKGCTACTAC R

TAAACCARGRGCCTTCAAAG F

ATAATGGCAAATACTGCCCC

TACCAACACCTATTCTGATT

CGAAACYTAAACACTACCTT

CAGCAAGATGAAGGG AGAAGAT

CGCCGGTACAGGATGAAC

TTGGTATAGRATTGGA TCYCC

CCTAAGAAGCCAATTG ATATTATGC

GGCTACTGCCACTAA TAATCGC

CoxIL2

Crocodylia

5,478

Primer name Position Sequence (5¢–3¢)

Specificity

52

50

50

46–50

46–50

48 5×, 58 35×

48 5×, 58 35×

47–55

47–55

NA

(65)

(65)

(65)

(65)

(15)

(15)

(68)

(68)

50–46 touchdown (6)

50–46 touchdown (6)

50–46 touchdown (6)

Snake

Gecko

Gecko

Gecko

Gecko

Gecko, Komodo dragon

Gecko, Komodo dragon

Iguana, lizard

Iguana, lizard

Dwarf crocodile

Dwarf crocodile

Dwarf crocodile

Dwarf crocodile

50–46 touchdown (6)

Dwarf crocodile

Used for

Dwarf crocodile

(6)

Primer reference

(6)

50

50

Annealing temperature (°C)

DNA Barcoding Amphibians and Reptiles

(continued)

(75)

(65)

(65)

(65)

(65)

(15, 81)

(15, 81)

(68, 80)

(68, 80)

(6)

(6)

(6)

(6)

(6)

(6)

Studies

5 89

COI(−)bdeg

COI(+)b

Serpentes

Serpentes

TAAATAATATAAGCTTCT GACTGCTACCACC

ATTATTGTTGCYGCT GTRAARTAGGCTCG F

R

F

Direction

56.5

56.5–65

56.5–65

Annealing temperature (°C)

(83)

(82)

(82)

Primer reference

Snake

Snake

Snake

Used for

(83)

(82)

(82)

Studies

Position is given relative to the complete mitochondrial genome of Furcifer oustaleti (GenBank accession number: NC_008777). When multiple annealing temperatures are given it refers to alternative temperatures used in different studies for the same primer or primer combination

5,535

6,119

AAGCTTCTGACTNCTA CCACCNGC

COI(+)deg1

Serpentes

5,538

Primer name Position Sequence (5¢–3¢)

Specificity

Table 2 (continued)

90 M. Vences et al.

5

DNA Barcoding Amphibians and Reptiles

91

HmCO1/CO1AXen-H

Anura

Cox/KLPf/COI-smallF/COI-smallR/Coy COI-1/COI-2&COI-3/COI-6/COI-4/COI-5 CO1g-L/CO1h-L

Urodela

PP6&PP7/PP8&PP9 MVZ_201/MVZ_202 Desmognathus-forward/-reverse

Vertebrata

BirdF1/BirdR1/BirdR2 LepF1/LepRI VF1-d/FR1d-t1&VR1-d

universal

FishF2-t1&VF2-t1/FishR2-t1 LCO1490/HCO2198 COIf/COIa2/COIa

6000

5500

6500

« Folmer region »

Fig. 1. Some primers used to amplify COI in amphibians sorted according to their specificity (for details, see Table 1). Black triangles represent forward, empty squares reverse primers, respectively. Numbers on the axis refer to the position on the complete mitochondrial genome of Discoglossus galganoi (GenBank accession number: AY585339).

however, are straightforward and similar to those established in other vertebrates (see Note 2). We also provide an overview of selected primers that have thus far been used to amplify COI from amphibians and reptiles, and which should be helpful to design amplification strategies in future DNA barcoding studies targeting these animals. To obtain this compilation of primers, we focused on studies where the COI gene as a molecular genetic marker was targeted, in particular, the standard animal barcoding region, the so-called Folmer region (28).

2. Materials 2.1. DNA Extraction and Preservation (See Note 3)

1. For routine DNA barcoding, we recommend a salt extraction protocol. 2. Extraction buffer: 0.01 M Tris–HCl (pH 8.0), 0.1 M NaCl, 0.01 EDTA (pH 8.0) in dH2O.

92

M. Vences et al.

COI(+)b/COI(+)deg1/COI(-)bdeg

Serpentes

Squamata

LCOI5973/LCOI5982/HCOI6570/HCOI6576 rTrp–1L/rCOI–1H L7354/H7794

Testudines

CoxIL2/COIr-ot2/COIf-ot2/COIr-ot1/COIfCrocodylia ot1/CoxIH2 L-turtCOIc/H-COIint/L-COIint/L-turtCOI/HturtCOI/H-turtCOIc/H-turtCOIb L-330COI/H-610COI/H-715COI M72/M73

Reptiles

RepCOI-F/RepCOI-R

Vertebrata

VF2(t1)/FR1d(t1) FishF2(t1)/FishR2(t1) F200 g NaCl. Stir solution while adding salt; continue adding salt until no more goes into solution and it begins to collect on the bottom of the mixing vessel. 2. 95% Ethanol (see Note 1).

2.3. DNA Extraction: Automated and Manual Extractions

1. Autogen Prep 965 DNA extraction: Autogen 965 robot, and kit buffers including M1, M2, R3, R4, R5, R6, R7, R8, R9, and proteinase-K for use with animal extractions; 96-well deep well plates (Costar #3960). 2. Automated extractions using Qiagen BioSprint96: Biosprint robot, and BioSprint 96 DNA Blood Kit (940057), Buffer ATL (Qiagen 19076), and proteinase-K. 3. For both automated extraction methods: AxyMat silicone lids (Axygen) for 96-well digestion blocks, plexiglass (or other firm solid material) rectangles cut to fit the tops of the 96-well blocks, and 0.2% Tween® 20: 200 ml Tween® 20 in 100 ml H2O. 4. Manual extraction lysis buffer: 100 mM EDTA, 25 mM Tris pH 7.5, and 1% SDS. 5. 100 mg/ml proteinase-K dissolved in water. 6. Phenol, equilibrated to pH 7.5 with Tris–HCl pH 8.0. 7. Chloroform:isoamyl alcohol (24:1 ratio). 8. TE solution: 10 mM Tris, 1 mM EDTA pH 8.0; equilibrated to pH 7.6 with Tris–HCl pH 7.0. 9. Incubator or incubator shaker for tissue digestion, capable of maintaining a temperature of >50°C.

112

L.A. Weigt et al.

2.4. Polymerase Chain Reaction: Amplification and Purification

1. 10 mM deoxynucleotide (dNTP) mix. 2. 100 mM oligonucleotide Primers (IDT Technologies, USA). 3. Biolase Taq DNA Polymerase (BioLine). 4. 10× PCR Buffer for Bioline Taq. 5. 50 mM Magnesium chloride. 6. Liquidator 96-channel benchtop pipette (Rainin). 7. ExoSAP-IT (USB 78201) for purification. 8. 96-Well (0.2 ml volume) plastic PCR plates (Genemate T-3060-1). 9. Silicone plate mat (lid) for PCR plates (Genemate T-3161-1).

2.5. Polymerase Chain Reaction: Visualization

1. Agarose. 2. 1× TBE buffer: 0.9 M Tris base, 0.89 M boric acid, 0.02 M Na-EDTA; prepare by mixing 108 g Tris, 55 g boric acid, and 7.4 g Na-EDTA in a beaker together with 400 ml water, mix until dissolved, and add deionized water to 10 L. 3. Sample loading dye: 0.083% bromophenol blue; 0.083% xylene cyanol, and 10% glycerol. 4. DNA stain: Ethidium bromide (10 mg/ml) or SYBR SAFE (Invitrogen). 5. Optional: DNA size standard (“ladder”; Hi Lo DNA marker, Minnesota Molecular, Inc.). 6. Electrophoresis rig and power supply. 7. Gel imaging system/camera for use over UV light box.

2.6. Sanger Sequencing Components: BigDye Reactions

1. 5× Sequencing Buffer: 400 nm Tris–HCl pH 9.0, and 10 mM MgCl2. 2. Oligonucleotide Primers: Dissolved in water to 10 mM (Table 1 lists all primers used for fish). 3. BigDye® Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems). 4. 96-Well (0.2 ml volume) plastic PCR plates (Genemate T-3060-1). 5. Silicone plate mat (lid) for PCR plates (Genemate T-3161-1).

2.7. Sephadex Purification of Cycle-Sequencing Products

1. Sephadex® G50 (Sigma). 2. Hi-Di™ formamide (Applied Biosystems). 3. Multiscreen® HTS filter plates (Millipore MSHVN4550). 4. Multiscreen column loader (Millipore MACL09645). 5. Liquidator 96-chanel manual pipette (Rainin). 6. 96-Well (0.2 ml volume) semi-skirted plastic PCR plates (Genemate T-3085-1). 7. Septaseal rubber mats (ABI #4315933).

6

DNA Barcoding Fishes

113

Table 1 Primer table for fish PCR (and M13 sequencing primers) (from refs. 10, 22, 23) Barcode primer name

Barcode primer sequence 5¢ → 3¢

FISHCO1LBC

TCAACYAATCAYAAAGATATYGGCAC

FISHCO1HBC

ACTTCYGGGTGRCCRAARAATCA

FISHCO1LBCm13F

CACGACGTTGTAAAACGACTCAACYAATCAYAAAGATATYGGCAC

FISHCO1HBCm13R

GGATAACAATTTCACACAGGACTTCYGGGTGRCCRAARAATCA

16SAR

CGCCTGTTTATCAAAAACAT

16SBR

CCGGTCTGAACTCAGATCACGT

m13F

CACGACGTTGTAAAACGAC

m13R

GGATAACAATTTCACACAGG

2.8. Genetic Analyzer Components

1. ABI 3130XL genetic analyzer: Polymer POP-7; 36-cm capillary array run using the ABI Template Protocol “RapidSeq36_ POP7” with a run time of 2,280 s. 2. ABI 3730XL genetic analyzer: Polymer POP-7; 50-cm capillary array run using the ABI Template Protocol “LongSeq50_ POP7” with a run time of 4,000 s.

2.9. Data Processing and Quality Control

1. Sequencher vers. 4.10.1 (Gene Codes). 2. Geneious (BioMatters)—use of the Geneious program and the BioCode plugin is discussed elsewhere in this volume.

3. Methods 3.1. Tissue Sampling

1. Photography processing—fish orientation—left side, fish’s head on left; see Note 2. 2. Tissues to sample (in order of decreasing desirability)—muscle biopsy from right side, right eye, portion of right pectoral fin, other fin clip, gill tissue, swabs, and scales. 3. Muscle biopsy—from right side, caudal region, dorsal to lateral line; avoid heavily parasitized areas (e.g., gills and guts) and areas of important morphological characters (e.g., fins and lateral line area); from larvae and small specimens, it may be necessary to destructively sample a portion of the specimen, and consultation with taxonomists is advised so as to avoid critical morphological regions (e.g., the suction disk on clingfish). 4. Clean all tools prior to touching specimen using bleach solution or flame sterilization, etc. Scrape off scales from the area to be sampled and carefully dissect out a small portion of muscle—the amount of tissue to sample is dependent on the

114

L.A. Weigt et al.

size of the specimen and storage vessel (do not exceed a tissue:buffer ratio of 1:4 if possible) (see Note 3). 5. From EtOH-preserved and -stored specimens: DNA leached from the specimen can be extracted from the alcohol in the storage container (8). (After distillation, the used ethanol can be recycled without risk of contamination (7).) However, increased yields will be obtained via more substantial yet destructive tissue biopsy (as above in Subheading 3.1, step 2). 3.2. Tissue Storage

1. BioBanking: One of the significant contributions of the DNA barcoding enterprise is a repository of genetic materials. These are tied to voucher specimens in public collections and the identity and integrity of the specimens have been validated genetically. These materials can then serve as a starting point for subsequent molecular investigations. Therefore, it is important to maximize the utility of all collected materials. 2. Frozen storage: Freezing tissues (−20°C or lower) is the recommended method of preservation to maximize potential future uses of the material. Vapor-phase liquid nitrogen is ideal. Frequently this is not feasible, particularly in the field, so alternatives are presented. 3. Salt/DMSO buffer storage: Transportation of ethanol and other flammables has become an issue, and salt/DMSO buffer is an option in those cases. Place small tissue chunks in the buffer, taking care not to overwhelm the buffer with too much tissue—a good ratio of tissue:buffer is 1:4. 4. Ethanol (95%) storage: 70% ethanol should be avoided (see Note 1).

3.3. DNA Extraction: Autogen Prep 965

1. Prepare fresh lysis solution for every run by dissolving appropriate aliquots of proteinase-K provided in kit with each aliquot of Reagent M1. Standard concentration of proteinase-K in M1 lysis buffer for overnight digestion of animal tissue is 0.4 mg/ml. 2. For tissue lysis, place tissue into the appropriate well of a 96 deep-well plate (Costar #3960), and add 150 ml of Reagent M2 and 150 ml Reagent M1 containing the predissolved proteinase-K at the concentration of 0.4–1.0 mg/ml. Cover the plate with a silicone mat (Axygen) and one or more plexiglass plates cut to fit the block to minimize evaporation of buffer and prevent contamination between wells. The silicone mat and plexiglass plates are taped firmly to the block. 3. Incubate the samples overnight at 56°C with shaking. 4. Spin the plates briefly to remove condensed droplets from the lid. Load lysis plates on the AutoGenprep 965, with an equal number of output plates for DNA and tips following manufacturer’s instructions. A maximum of four 96-well plates can be run simultaneously on the machine (see Note 4).

6

DNA Barcoding Fishes

115

5. Load Reagents R3, R4, R5/R6/R7, R8, and R9 into the appropriate reservoirs following manufacturer’s instructions. 6. The standard resuspension volume is 0.05 ml of buffer R9. If a large quantity of DNA is expected, such as when extracting vertebrate tissues or large amounts of other tissue, we change this volume to 0.1 ml. 7. Run the protocol on the instrument. 8. Upon completion, using a 96-well pipettor (if available), portion the DNA extracts into the desired amounts for the working and archival stocks for BioBanking. 3.4. DNA Extraction: Qiagen Biosprint Magnetic Bead Protocol (96-Well Plate Protocol)

1. We follow the Qiagen protocol for Biosprint extractions. Following are our deviations from the published protocol and our observations on particular steps. 2. Before beginning, check buffer ATL for white precipitate and take steps to resuspend it (see Note 6). 3. The MagAttract particles settle out of solution very quickly. Before adding this mix to the master mix, vortex at high speed for 3 min. Use immediately. Vortex again if much time has elapsed (>2–3 min). 4. Prepare Master mix of AL Buffer = 100 ml; isopropanol = 100 ml; MagAttract Suspension G = 15 ml. Prepare master mix 10% greater than that required for the total number of sample purifications to be performed. 5. Cut 5–25 mg of each tissue into small pieces and place in a 96-well S-Block. Add buffer ATL and proteinase-K. 6. Seal the plate following the same method as the Autogen lysis above (Subheading 3.3, step 2). 7. Place sealed plate in an incubator/shaker and digest overnight at 56°C (see Note 7). 8. Following lysis with ATL, briefly centrifuge the S-Block containing the samples to remove drops from underneath the lid. 9. Vortex the master mix containing Buffer AL, isopropanol, and MagAttract Suspension G (see Note 6) for at least 1 min. Add 215 ml of this master mix to each sample in the S-Block. 10. Place blocks on instrument and start and run protocol. 11. Upon completion, using a 96-well pipettor (if available), portion the DNA extracts into the desired amounts for the working and archival stocks for genetic repository (see Note 8).

3.5. Manual Extraction: Phenol:Chloroform Protocol, Following Ref. 20

1. It is typically easiest to carry out the extraction out in 1.7–2-ml centrifuge tubes. 2. For lysis, prepare fresh stock of lysis buffer (from Subheading 2.3, item 5) and add proteinase-K to 1 mg/ml. 3. Combine 1 ml lysis buffer with ~1 cm2 of tissue.

116

L.A. Weigt et al.

4. Incubate with shaking overnight at 56°C. 5. Add no more than 700 ml of the lysed sample to the extraction tube. If a large amount of tissue is being extracted, it may be best to dilute the extraction with more extraction buffer, divide the sample into more than one tube, and extract each separately (see Note 5). 6. Phenol extraction: Add an equal volume of phenol to the lysed sample, and vortex vigorously to mix the phases. Spin in a microcentrifuge at top speed for 1–2 min to separate the phases. Remove aqueous phase (top layer) to new tube, being careful to avoid the phase interface. 7. Repeat the phenol extraction two more times. 8. Chloroform:isoamyl extraction: As in Subheading 3.5, step 6, use an equal volume of chloroform:isoamyl alcohol (instead of phenol) to remove any trace phenol. Repeat once more. 9. Precipitate the DNA with equal volume of isopropanol, and incubate at −20°C for 1 h or longer. 10. Spin sample with alcohol in microcentrifuge for 10 min at maximum speed to pellet DNA. 11. Before resuspension, set tube on counter at room temperature (RT), covered with a tissue, for 10 min or until all residual ethanol has evaporated. 12. Resuspend DNA pellet in a volume (usually, 50–200 ml) of TE (or a 1/10 dilution of TE) to achieve desired concentration. Vortex to mix. 3.6. PCR Methods: Amplification

1. Thaw and prepare reagents in proper concentration for the PCR reaction. Wait for each reagent to thaw completely and then mix thoroughly. Dilute primers to a 10 mM working stock for the PCR reaction. If starting with lyophilized primers, spin down before opening the tube and resuspend to 100 mM in molecular-grade water to form the stock solution. Dilute this stock 1:10 to make working stock. 2. Mix all reagents in volumes listed in the PCR recipe (Table 2) to form the master mix. Keep your reaction plate (or tubes) and master mix on ice. Vortex the master mix vigorously or pipette up and down to mix well (vortexing will cause liquid to be trapped on the cap of the tube, so follow with 15-s spin in a mini-centrifuge). 3. Aliquot 9 ml of master mix into each well of the 96-well plate (Genemate). 4. Aliquot 1 ml of each DNA template (undiluted) to each well. Change tips between samples.

6

DNA Barcoding Fishes

117

Table 2 PCR reaction cocktail PCR reagents

Each well (ml)

96-Well plate (ml)

ddH2O

6.4

640

10 mM dNTPs

0.5

50

10× buffer

1

50 mM MgCl2

0.4

40

10 mM primer F

0.3

30

10 mM primer R

0.3

30

Bioline Taq (5 U/ml)

0.1

10

Total

9

100

900

5. Add 1 ml nuclease-free water to well H12 to function as negative PCR control. 6. Place silicone plate mat over the top of the 96-well plate and secure it with a roller. Centrifuge the plate in a plate centrifuge or plate spinner for 10–15 s at approximately 3,950 rcf (= “centrifuge briefly”). 7. Place the plate in a thermal cycler block and run PCR with the following cycling parameters: 95°C for 5 min, 35 cycles of 95°C for 30 s, 52°C for 30 s, and 72°C for 45 s, with a final extension at 72°C for 5 min and hold indefinitely at 10°C (see Notes 9–11). 3.7. PCR Methods: Visualization via Agarose Gel Electrophoresis

1. Cast a 1.5% agarose gel for each PCR reaction plate (e.g., add 0.75 g agarose to 50 ml 1× TBE buffer and boil in the microwave until the agarose is dissolved) (see Note 12). 2. Cool the solution, then add 1 ml ethidium bromide (10 mg/ml) per 50 ml of agarose solution, mix well, and pour immediately into the casting tray. 3. Let gel set for approximately 30 min or until firm. Remove combs from the gel and place it into an electrophoresis rig filled with 1× TBE buffer. 4. Add 2 ml 2× loading dye to each well of an empty 96-well plate. 5. Add 2 ml of each PCR product to the loading dye plate. Use new tips for each transfer. 6. Mix PCR product and dye by pipetting up and down, and then load 4 ml of the mixture to each well of the gel (see Note 12).

118

L.A. Weigt et al.

Table 3 Exosap purification of PCR product Reagents

Each well (ml)

96-Well plate (ml)

ddH2O

1.5

150

ExoSAP-IT

0.5

50

Total

2

200

7. Run gel at 100 V for approximately 12 min or until the bromophenol blue and xylene cyanol dyes in the loading dye are clearly separated. 8. At the end of the run, remove the gel from the rig and place on a UV transilluminator. Use a gel-imaging system to capture a digital image of the gel. 3.8. PCR Purifications: EXOSAP-IT

1. We use a fourfold dilution of the ExoSAP-IT mix. In a 1.7-ml microcentrifuge tube, mix nuclease-free water and ExoSAP-IT in the volumes listed in the Table 3. Keep the enzyme mix on ice or cold block at all times (see Note 13). 2. Vortex the diluted mix vigorously or pipette up and down to mix well. 3. Centrifuge the plate briefly to bring down all condensation from the sides of the wells and lid to the bottom. 4. Aliquot 2 ml of ExoSAP-IT mix to each well of the 96-well plate. 5. Place the silicone plate mat back on the top of the 96-well plate, press it down with a roller, and centrifuge briefly. 6. Place the plate in a thermal cycler block and run with the following parameters: 37°C for 30 min and 80°C for 20 min, and hold on 10°C.

3.9. Sanger Cycle Sequencing Protocol

1. Thaw and prepare reagents in proper concentration for the cycle sequencing reaction. Wait for each reagent to thaw completely and then mix them thoroughly. Keep the BigDye on ice and in the dark since the BigDye is both light and temperature sensitive. 2. Mix all four reagents in the volumes as listed in Table 4. Create two master mixes, one containing the forward primer and one containing the reverse (see Note 14). 3. Vortex the mixes vigorously or pipette up and down to mix. Centrifuge briefly. 4. Aliquot 9 ml of master mix into each well of the 96-well plate, keeping both plate and master mix on ice.

6

DNA Barcoding Fishes

119

Table 4 Cycle sequencing reaction recipe Reagents

Each well (ml)

96-Well plate (ml)

ddH2O

25

625

5× SEQ buffer

1.75

175

BigDye

0.5

50

10 mM primer F or R

0.5

50

Total

9

900

5. Aliquot 1 ml of purified PCR product into each well of the 96-well plate. Change tips after each sample. 6. Place a silicone plate mat on the 96-well plate, press it down with a roller, and centrifuge briefly. 7. Place the plates in a thermal cycler and run the cycle sequencing program with the following parameters: 30 cycles of 95°C for 30 s, 50°C for 30 s, and 60°C for 4 min and hold at 10°C. 8. After the program has finished, store the plates at 4°C in a dark refrigerator until purification (see Note 14). 3.10. Sephadex Purification of Sanger Sequence Reactions

1. Measure dry Sephadex G50 using the multiscreen black column loader (predrilled uniform holes which measure and deliver the correct amount of sephadex) into a Multiscreen® HTS filter plate for each 96-well cycle sequencing reaction plate to be cleaned. 2. Add 300 ml molecular-grade water to the wells and allow to sit at room temperature for at least 2 h in order to completely hydrate the sephadex matrix. 3. Place the filter plate on top of a 96-well PCR plate (the “catch” plate) and tape together with laboratory tape. 4. Centrifuge at 750 × g force for 5 min to drain the excess water from the wells. Discard the water. The catch plate can be used again, but ONLY for this purpose, not PCR. 5. Add the entire volume of the sequencing reaction to the center of the Sephadex columns (not along the side walls) taking care not to touch the column surface or destroy the integrity of the column. 6. Attach a new 96-well PCR plate to the bottom of each Sephadex plate and secure with lab tape. Make sure that the orientation of the plates is the same: the A1 well of the Sephadex plate is over the A1 well of the catch plate.

120

L.A. Weigt et al.

7. Centrifuge at 750 × g force for 5 min to elute the cleaned sequencing product into the catch plate. 8. Dry the cleaned sequencing products on a heat block at 90°C for 10–15 min or in a Sorvall Speedvac. Cover the plate with Septa Seal mat and store at −20°C (see Note 15). 9. To prep for running on the genetic analyzer, add 10 ml Hi-Di™ Formamide to each well (under a fume hood). 10. Denature the DNA by heating the plate at 97°C for 3 min. 11. Cool the plate at 4°C for 3 min. 12. Load the plates into the Genetic Analyzer. 3.11. Genetic Analyzer Methods

1. ABI 3130XL: Running a 36-cm capillary array run using the ABI Template Protocol “RapidSeq36_POP7” with a run time of 2,280 s. 2. ABI 3730XL: Running a 50-cm capillary array run using the ABI Template Protocol “LongSeq50_POP7” with a run time of 4,000 s (see Note 16).

3.12. Data Processing and Quality Control Methods

1. Production of the final DNA barcode sequence from the raw sequencer output (the “traces”) involves several steps (forward and reverse traces for one set of 96 specimens are processed together). 2. The method we outline here uses Sequencher vers. 4.10.1 (Gene Codes, Corp). Alternatively, use of the Geneious program (BioMatters) and the BioCode plugin is discussed elsewhere in this volume. 3. Trace trimming—trimming is based on the phred quality scores of each base call (see Note 17). Trimming criteria, as implemented in Sequencher, are as follows: trim from the 5¢ and 3¢ ends until the first (or last) 20 bases contain fewer than 3 bases with a phred score 20 for unidirectional and bidirectional portions of the consensus). 11. Export consensus sequence for downstream analysis.

4. Notes 1. Properly collected plant tissue is essential for maximizing PCR and sequencing success. Key to this process is that material from which DNA is extracted must be dried as quickly as possible to prevent the degradation of the DNA. Field collections

244

A.J. Fazekas et al.

of specimens must be immediately split into two components: (a) the voucher and (b) a portion of the voucher (typically, leaf tissue) which is placed in a container with silica gel or similar drying agent. It is important that the portion taken for DNA is put into silica gel as rapidly as possible after harvesting from the field. This should ideally be done immediately, but if impractical, it should be done no later than at the end of the collecting day. Delays to drying material in silica gel can result in samples with reduced DNA quality and lower PCR success. 2. It is important to keep the freshly collected tissue samples in separate containers. Pooling different samples into a single Ziploc bag, for example, increases the chances of cross contamination. 3. We describe three types of containers that we have used in various settings, each with relative advantages and disadvantages. (a) Scintillation vials provide a separate enclosed environment for each sample. This can be useful in humid conditions, in which coin envelopes may absorb some moisture from the air, slowing the tissue drying process, or for tissue that has a high water content and dries more slowly inside a coin envelope rather than when in direct contact with the silica. (b) Coin envelopes are probably the simplest medium for sampling plant tissue. It is easier to insert a sample into an envelope than into the narrow opening of a scintillation vial. Multiple coin envelopes can be stored in an airtight container with silica gel, requiring less space than scintillation vials. The envelopes also keep the silica gel separate from the tissue, facilitating tissue subsampling. Tea bags can also be used in place of envelopes; they are more porous, facilitating the drying process, but are also slightly more fragile. (c) Small (~10 × 15 cm) plastic bags with silica can work well in the field, but are prone to punctures from thorns or prickles, and are somewhat permeable which does exhaust the silica over time. When the samples are dry, the plastic bags need to be handled carefully to prevent excessive breakage of the plant tissue. 4. In the case of specimens that are likely to take a long time to dry (such as samples with waxy leaves), tear the leaf sample into smaller fragments or chop with a sterile blade to increase the surface area available for contact with the silica gel. 5. The best samples for plant DNA extraction typically come from actively growing plant tissues; senescing, damaged, or infected tissues should be avoided. The usual choice of plant tissue is leaves, but shoot tips or flower buds or petals can also be used. For canopy tree species in which reaching leaf or flower material

11

DNA Barcoding Methods for Land Plants

245

is logistically challenging, an alternative approach is to use a leather punch to obtain samples of cambium tissue which avoids the need for tree climbing (27). 6. Sampling herbarium material for DNA extraction can be successful, but success is often variable and unpredictable. The quality of the extraction is most likely a function of the age of the specimen, the species in question, and the speed with which samples have been dried, which is often unknowable. The priority should be given to samples not much greater than 10 years old. However, the most critical criterion is that the samples should still be green in color. Brown coloration of the herbarium sample indicates that the tissue quickly oxidized after collection or was infected by mold, indicating that the DNA is most likely degraded and/or contaminated by fungal DNA. 7. Tissue samples from which DNA extractions are made should be prevented from rehydrating from the atmosphere. This can be achieved through a climate-controlled facility or in airtight containers (refresh the silica as necessary). Long-term experiments are still needed to provide empirical data on optimum storage procedures for tissue samples. 8. The sampling of silica-dried material into tubes for DNA extraction and the extraction process are probably the most important steps in the process of generating good-quality DNA barcode data. It is the step that is the easiest for contamination or sample mix-up to occur. Thus, it is very important to follow the steps outlined in Subheading 3.3 to prevent this. A poor-quality extraction will result in inefficient or failed PCR reactions. 9. The appropriate amount of plant material to sample for DNA extraction is 10–15 mg dry tissue. In this case, more is not better; using more than this amount of tissue will result in a poorly ground sample, overwhelm the buffers used in the extraction process, and result in low-yield or poor-quality DNA. This amount usually corresponds to ~0.5 cm2, but may be smaller depending on the leaf thickness. Plastic materials (such as sampling tubes) often have a static charge that will attract small particles of plant tissue. Fragments of plant material literally jump from one well to another, so care must be exercised when placing bits of leaves into the tubes. 10. Plant tissues that are linear in shape (e.g., grass leaves and stems, conifer needles) need to be broken into smaller pieces to achieve proper pulverization using the grinding beads. 11. When sampling plant tissue from herbarium samples in areas where an alcohol burner is prohibited, it is good practice to wipe the forceps after each sample with a kimwipe moistened with ethanol.

246

A.J. Fazekas et al.

12. It is important to keep freshly collected silica-dried material and older herbarium-sampled tissue in separate extraction plates, as they may require different extraction protocols. 13. A note on bryophytes: Extreme care is required when sampling bryophytes due to the common occurrence of mixed species samples being collected from the field. Tissue subsampling is best done at the same time as determinations are made. 14. A frequency higher than 28 Hz can destroy the tubes. We do not recommend homogenization for longer than a total of 1 min, with the exception of samples with very tough tissue in which case an additional run of 30 s can be applied. 15. After the plant tissue is ground to a fine powder, the tubes require careful handling. Centrifuging does not help significantly in removing powderized plant tissue from the lids or caps as the static charge is strong enough to keep them adhered to the interior surface of the tube’s walls and caps. Opening the caps should be done with extreme care to avoid cross contamination prior to addition of the lysis buffer. 16. In the non-kit-based protocols are provided, the entire volume of the CTAB lysate is not used. Unused lysate can be stored at −20°C as a backup until the extraction is determined as being successful as indicated by the results of first PCR reaction. The lysate can also be used as a source for additional extractions if more testing of the DNA is necessary. 17. The square-well blocks that are specified in the protocol have enough volume to collect all the wash buffers without needing to discard between washes. However, if a block with a smaller volume is used, it may be necessary to discard the wash buffer between steps 16 and 17 of Subheading 3.8. 18. Trehalose (which is also a potent PCR enhancer) acts as a cryoprotectant for Taq polymerase when PCR mixes are prepared in large volume batches and frozen for future use. 19. Many available PCR protocols for matK include 4% DMSO. Experiments based on several hundred reactions have demonstrated that a 5% Trehalose solution can replace DMSO without any significant difference in PCR success or sequence quality. 20. After DNA extraction, it is recommended to begin the first round of PCR for the rbcL DNA barcoding marker using the nearly universal primers rbcLa-F/rbcLa-R; a greater degree of PCR success and quality is obtained in bryophytes with the reverse primer rbcLajf634R. These primers generate a high rate of PCR success with DNA of good quality. Hence, this first PCR for rbcL acts as a test for DNA quality for a broad variety of taxa among angiosperms, gymnosperms, ferns, and mosses.

11

DNA Barcoding Methods for Land Plants

247

21. PCR cleanup is both expensive and time consuming, but can be avoided through use of the low concentrations of primers and dNTPs in the PCR mix and the subsequent dilution of the PCR product prior to cycle sequencing reaction. This protocol provides a high success rate for PCR and sequences for regions that are amplified by universal, highly conserved primers (plastid rbcL, trnH–psbA, and nuclear ribosomal ITS2). In contrast, the matK DNA barcoding region needs distinct conditions for successful PCR amplification. For matK, the concentration of the primers, dNTPs, and Taq polymerase cannot be significantly reduced. Based on experiments optimizing the PCR conditions for matK, we recommend a protocol with diluted DNA (0.3–0.5 ng/μl) and a smaller PCR reaction volume (7.5 μl). These conditions have yielded a higher rate of PCR success and increased sequence quality over the general PCR mix. 22. The volumes of the PCR and cycle sequencing reactions recommended here are very small. Thus, it is very important to follow the instructions in Subheadings 3.9 and 3.14 carefully. The foil or thermal-seal cover should be placed evenly and tightly over the PCR plate without wrinkles or holes to prevent evaporation during PCR cycling. 23. Centrifuging is required to collect the PCR components at the bottom of the well and eliminate any air bubbles that might have been trapped. It also aids in mixing the PCR components with the DNA sample, or cycle sequencing mix with PCR product. 24. Although rbcL is present in the vast majority of land plants, there are some groups, such as holoparasites, that no longer have a functioning copy of this gene. As a result, the primers most commonly used typically do not work in these groups. 25. The primers most widely used for PCR amplification of the plastid trnH–psbA intergenic spacer for DNA barcoding are those recommended by Kress et al. (7) or Kress and Erickson (10) (Table 4). They are, respectively, trnH2 (originally from ref. 28) and psbAF (originally from ref. 29) or trnH(GUG) and psbA (originally from ref. 30). In bryophytes, this region is often short ( New Folder under the main toolbar. The easiest way to import your traces into Geneious is to download them directly from the LIMS database (if you have previously attached your trace files to a sequencing plate). If downloading from the LabBench, traces will have all associated metadata. If you import traces from disk, you will have to manually set parameters such as read direction (see Note 12) and choose “Annotate from FIMS/LIMS Data” to link in other metadata (see Note 13). To download your traces from the LIMS, select the sequencing plates you want to download in the Biocode plugin search, and

282

M. Parker et al.

Fig. 16. Workflow in the assembler module.

13

Laboratory Information Management Systems for DNA Barcoding

283

Fig. 17. An example of downloading two sequencing plates (one for each direction).

then select Biocode > Download Traces from LIMS (Fig. 17). Typically, you would choose one forward- and one reversesequencing plate for a set of sequences. You may select a single plate if both your forward and reverse reads are contained on it. Click “OK “and Geneious will ask you to choose a folder, and begin downloading your sequences. Once complete, you will have all of the traces from the plates you entered and they will already have their read directions set and be annotated with the necessary data from the FIMS. If you know the names of your sequencing plates, you can download them directly without having to perform a search. Select the folder you want to download to in Geneious, then select Biocode > Download Traces from LIMS, and enter your plate names manually. 3.2.2. Batch Rename

If you want to change the names of your reads to reflect aspects of the FIMS data, from the main toolbar select Edit > Batch Rename to copy your choice of fields into the name column. This feature is also available in renaming assemblies.

3.2.3. Trimming

Geneious treats trimming as an annotation class so that information is not lost once a sequence is trimmed. The underlying raw data are maintained throughout downstream analyses for possible adjustment later in the pipeline. Assembly and other analyses automatically take the trims into account, and exclude these regions in all calculations. To trim sequences, highlight all of the sequences you are going to assemble, and from the main toolbar, select Sequence > Trim Ends. You can also add trim ends to the main toolbar by right clicking on the toolbar and turning on Trim Ends (Fig. 18). For most applications, the default of Error Probability Limit 0.05 is a good start. This option works by trimming the sequence to find the longest possible untrimmed region, which has an overall error probability Preferences and go to the Sequencing tab. To set per-folder or per-document parameters, select the folder or documents you want to change and, again, from the main toolbar select Sequence > Set Binning Parameters. The most specific parameters are used in favor of less specific ones. Per-document parameters are used over any per-folder or global parameters that are set. To help in the detection of frameshifts, you can set the number of stop codons as an optional binning parameter. The number of stop codons is calculated for the specified genetic code, and is defined as the minimum count of stop codons in the consensus sequence for all frames (i.e., we check frames 1, 2, 3 in the forward and reverse direction, count the number of stop codons in each, and then take the frame with the minimum number of stops).

286

M. Parker et al.

Fig. 19. The binning parameters dialog box.

As an example, when looking at assembly results, the bins could be used in the following way (if the parameters as set up as such). ●

High = There is probably no need to look at these assemblies.



Medium = These assemblies may need to be edited.



Low = Fail: These assemblies are likely beyond rescue and should be marked as failed.

To set up the parameters in this way, you need to have strict parameters for the high bin, for example 0 ambiguities, 658 consensus length, and 1.8 coverage (for the COI barcode). The medium bin can be quite relaxed depending on how many assemblies you want to examine. Users can create binning profiles and share these with collaborators or use them for standard operating procedures to ensure repeatability. 3.2.5. Assembly

Select all of the reads you are going to assemble (and a reference sequence or list if you have one) and then click the “Assembly”

13

Laboratory Information Management Systems for DNA Barcoding

287

Fig. 20. The assembly dialog box.

button in the main toolbar. To assemble pairs of reads by name, check “Assemble by name” and choose the appropriate delimiters (Fig. 20). The recommended Sensitivity is “Highest Sensitivity/ Slow.” It is also possible to choose “Custom Sensitivity,” and choose your own parameters (e.g., minimum overlap). If you have already trimmed your sequences, select “Use existing trim regions”; otherwise, specify trim options. “Save assembly report” and “Save results in a new subfolder” should both be selected. After clicking “OK,” a new subfolder called “Assemblies” will be generated and assemblies will be added to it as the operation runs. When the operation is finished, an Assembly Report and list of Consensus Sequences will also be added to the folder. Geneious generates a new subfolder each time an assembly is run. The assembly report (Fig. 21) provides a record of which reads were assembled successfully and which reads failed. For example, click the blue hyperlink next to the red “X” to select all reads which failed to assemble and use the “Mark as Failed in LIMS” tool to mark these reads for resequencing. 3.2.6. Viewing And Editing Assemblies

As with the traces, assemblies are each assigned a bin based on various quality criteria. By default, Geneious uses a Highest Quality consensus which rarely generates ambiguities because the highest quality base call is used automatically. However, ambiguities are generated in situations, where the qualities of conflicting bases are similar.

288

M. Parker et al.

Fig. 21. The assembly report (showing assembled and unassembled reads).

The procedure for checking disagreements in an assembly is as follows. ●

Turn on Allow Editing in the Viewer toolbar.



Select an assembly to display an overview. Disagreements are shown as small black marks on the sequences and the trimmed regions can be seen.



Highlight disagreements or ambiguities in the Display tab of the control panel to the right of the viewer. Ctrl+D on Windows and Linux or Command+D on Mac OS jumps between highlighted bases. If this is the first assembly you have looked at, you should zoom in to a level you are comfortable with. Geneious remembers this zoom level for the next assembly (Fig. 22).



If you agree with a call or an ambiguity in the consensus, then you can go to the next disagreement because the call has already been made.



If you disagree, you can resolve the conflict by editing either of the traces or by editing the consensus (editing the consensus is a shortcut for changing all base calls at that position).



Continue editing through the disagreements until you have looked at all of them. Save the assembly and repeat for the next assembly.

13

Laboratory Information Management Systems for DNA Barcoding

289

Fig. 22. A sample assembly view in Geneious.

If you decide that some assemblies are not good enough despite having assembled correctly, then you can mark these as failed at this point. Select the assemblies that have failed and go to “Mark as Failed in LIMS.” It is a good idea to move the failed assembly to a new subfolder (e.g., named “fail”). 3.2.7. Alignment of Consensus Sequences

An alignment of consensus sequences is a useful tool for checking and correcting assembly accuracy, especially near the ends of traces, where there might be poor coverage. To generate an alignment of consensus sequences: ●

Select all of the assemblies you want to align and click the “Alignment” button in the toolbar.



In the Alignment dialog box (Fig. 23), click “Consensus Align,” select “Generate alignment of consensus sequences only,” and choose an alignment algorithm (e.g., MUSCLE, MAFFT, and ClustalW).



Click “OK” and an alignment is generated as a new document (Fig. 24).

The alignment retains all information from the original assemblies. Clicking the small blue arrow button to the left of each name brings you to the associated assembly (Fig. 24). Geneious currently does not propagate changes in the alignment back to the original assembly, but you can use the alignment for downstream steps so that alignment edits are not lost.

290

M. Parker et al.

Fig. 23. Alignment dialog box.

Fig. 24. Nucleotide alignment consensus dialog box.

To view the alignment translation, follow these steps. ●

In the options to the right of the alignment view, change the Colors option to “By Translation.”



Turn off the Highlighting option.



Open the Complement and Translation section and set up the appropriate translation options, such as Genetic Code and Frame. We recommend that “Translation” is set “Relative to Consensus.” You can also set the amino acid Color scheme here (e.g., MacClade).

13

Laboratory Information Management Systems for DNA Barcoding

291



You should also turn off Annotations so that editing history annotations do not interfere with the layout.



Check the alignment for frameshifts and stop codons (binning should have identified these previously).

Clicking on Help in the toolbar while viewing an alignment displays documentation on editing and shortcut keys. 3.3. Verify Taxonomy and Loci

To help verify taxonomy annotated in the FIMS and identify contaminants (high-quality sequences, but wrong targets), Geneious can run a specialized batch BLAST against the NCBI public DNA sequence database. This can be run on any selection of contigs and alignments of contigs. If you have performed an alignment as above, then you should use the alignment to make sure that you are using the edited consensus sequence. ●

Select an assembly, a list of assemblies, or an alignment.



From the main toolbar, select Biocode > Verify Taxonomy (Fig. 25). This brings up the standard BLAST options. It is required that “Fully annotate hit summaries” is turned on but the rest of the options can be modified as necessary. Click “OK” to begin the search. This can take quite a long time to run due to BLAST.

Fig. 25. Verify taxonomy dialog box.

292

M. Parker et al.

Fig. 26. A sample verify taxonomy table; see “Note columns” for explanation of headers.



When the process is complete, a “Verify Taxonomy Results” document will be produced (Fig. 26). This displays a table, which has a row for each of the queries comparing them with each of their top hits returned from BLAST. As with traces and assemblies, customizable binning options are available for efficient reporting on the results (see Note 15).



Rows can be selected in the table by clicking/dragging and holding shift/ctrl/command while clicking. Click on “Go To Queries” to jump to the assemblies associated with the selected rows. Click “Show Other Hits” to see additional hits that were downloaded for the selected row. “Show Other Hits” is only enabled when one row is selected. Double clicking on a row also shows other hits.

The Verify Taxonomy Results may reveal that some sequences do not match the expected taxonomy. If you decide that the sequencing was a failure (possibly due to contamination), you can go back to the assemblies and “Mark as Fail in LIMS” and list the reason in the notes section. Also, as mentioned above, it is always a good idea to move any failed assemblies to a new subfolder (e.g., named “fail” or “contaminants”). 3.4. Mark Sequences as Pass in LIMS

Once you have verified taxonomy, assured that all sequence quality parameters are acceptable, and trimmed the primers, select either the assembles themselves from the Assembly folder or the aligned consensus sequences from the alignments and select “Mark As Pass

13

Laboratory Information Management Systems for DNA Barcoding

293

in LIMS” under the Biocode button. This action writes the following data to the LIMS database: ●

The extraction ID.



The consensus sequence (with sequence quality values).



The parameters used to trim and assemble the reads.



The average coverage of the assembly.



The number of disagreements in the assembly.



A record of any edits made to the sequences in the assembly.



The assembly bin.

By marking a sequence as Pass, this operation saves the consensus sequence of your assembly to the LIMS. This is the sequence that you submit to public sequence databases. You should make sure that the sequence is of sufficient quality and that you have completed all edits before you Mark as Passed. 3.5. Searching the Database

Biocode searches return four types of documents as follows. ●

Tissue sample documents—each of these represents a tissue sample in the field database. Tissue documents contain collection information, and optionally taxonomy and photographs.



Plate documents—these represent a plate in the lab database, and contain a diagram of the wells, as well as the plate’s thermocycle and attached GEL images if available.



Workflow documents—these contain a linked set of reactions performed on an extraction.



Sequence documents—sequences entered into the LIMS when traces/assemblies are marked as pass/fail.

You can perform either a basic search or an advanced search. Basic searches are performed by entering text into the search box, and return all documents that have a field with a similar value to the text entered (Fig. 27). You can restrict searches to particular types of documents by unchecking some of the checkboxes to the right of the search box. Advanced searches explicitly search against particular fields. They are performed by clicking the More Options button. Click the + and − buttons to add and remove fields from the search.

Fig. 27. Search box.

294

M. Parker et al.

Fig. 28. The cherry picking window.

Choose the fields you want to search using the leftmost dropdown, and choose the search condition using the drop-down box to its right (see Note 16). 3.6. Cherry Picking

The Cherry Picking function (Fig. 28) is available in the Biocode drop-down menu and allows you to select reactions from one or more plates, based on the criteria that you specify (e.g., failed reactions for second attempts or extractions based on taxonomy for additional genes). You can use these selected reactions to create a new plate (or plates) or have them returned to you as a list. To perform Cherry Picking, select the plates containing the reactions you want to pick and click on Cherry Picking in the Biocode toolbar menu. Choose your destination, and then choose the criteria to select your reactions. You can add additional criteria using the orange “+” button on the right.

3.7. Preparing Your Documents for Genbank Submission

Genbank has stringent requirements for submitted sequences. It is important that you correctly prepare your sequences before you begin the submission process. This section outlines the fields you need for your submission, and how to attach them to your Geneious documents. All fields that are a part of your Genbank submission need to be either entered in the submission dialog or annotated on your sequences. In order to receive the BARCODE keyword in Genbank, the following fields need to be annotated on your sequences: ●

Specimen Voucher/ID.



Sequence ID.



Target locus.



Collector.



Collection date.



Identified By.



Organism.

For non-barcode submissions, requirements vary depending on what type of sequences you are submitting (e.g., AFLPs, SNPs.

13

Laboratory Information Management Systems for DNA Barcoding

295

Fig. 29. The document notes tab, from a group of selected assembly documents (showing primer information attached).

nDNA, etc.). We recommend that you check with the NCBI Trace Archive: http://eutils.ncbi.nih.gov/Traces/trace.fcgi?cmd= show&if=rfc. 3.7.1. Attaching Data to Your Sequences

The easiest way to attach data to your sequences is to have included it in your FIMS database. When you download the traces from LIMS (or annotate the sequences with FIMS/LIMS data; see Note 13), the information will be automatically attached to your sequences. If you are using the submission tool without a FIMS or LIMS or you have extra information you want to attach, then you can use Document Notes. Document Notes appear as the rightmost viewer for any selected document(s), and enable you to store arbitrary information on your sequences (Fig. 29). When you click on the notes tab, you will see a list of the notes currently added to your documents, displayed in name/value pairs. To add a note, click the “Add Custom Note” button in the toolbar to see a list of predefined note types. “The Genbank Submission” note type contains the fields most commonly used by submitters. Any notes (and note fields) added to your documents will be able to be attached to your Genbank submissions. If you do not see a note type that meets your needs, you can generate your own by clicking “Edit note types.” Click the “Generate Note Type” button in the note types dialog, and click the orange + buttons to add some note fields. We recommend choosing “Text” as the field type for Genbank fields. Once you have generated your note type, add it to your selected documents by selecting it from the “Add Custom Note” drop-down menu.

296

M. Parker et al.

Any values you enter in the viewer are applied to all selected documents when you click save. 3.8. Genbank Submission

The Genbank Submission plugin allows you to submit your contigs to Genbank once you have completed all edits.

3.8.1. Preparing Your Submission

If you did not install the Genbank submission plugin when you set up the Biocode plugin, do so now (http://software.mooreabiocode. org/index.php?title=Download). Once the plugin is installed, select the contigs and/or sequences that you want to submit to Genbank, and click Tools > Submit to Genbank. Fill in the options and fields (Fig. 30) according to the following guidelines. ●

The submission name is a free-form field and does not affect the results of your submission, so can be filled in as desired.



Click Edit Publisher Details to edit your author/publication information. This information is preserved between submissions; so for many cases, it does not need to be changed between submissions.



The next set of options matches fields annotated to your documents. You may choose a field from the drop-down to map a field on your documents to a Genbank submission field. All fields displayed in the main dialog are required. If you want to add optional fields, click the “Additional Source Fields” button. You can choose the fields you want from the drop-down menu, and click the + button to add more fields.



If you have selected documents with traces and want to include them in your submission (required for BARCODE keyword), click the “Include Traces” checkbox. The required fields are variable, so the options you see will change depending on what values are selected. You can use the additional fields button to add optional fields to your submission.



Check the “Include Primers” checkbox to include primers. If you are submitting sequences annotated from the LIMS, then primers will have been annotated on your sequences as document notes. If not, you can annotate the primers yourself by clicking the notes tab when viewing the selected documents and choosing “Sequencing Primer” in the “Add Custom Note” drop-down.



If you have selected assemblies, you can choose the options used for building the consensus sequence (the passed consensus is what is submitted to Genbank).



If you have chosen BARCODE as your experimental strategy, then you are able to enter the target locus (gene) of your sequences. This will be included in your submission as gene annotations spanning the entire length of your sequences.

13

Laboratory Information Management Systems for DNA Barcoding

297

Fig. 30. The Genbank submission options. Some of these options may not be displayed, depending on your selected documents.

3.8.2. Validation and Submission

You may either generate a submission file or upload the submission directly to NCBI (you need a BankIT FTP account to do this: see Note 17). You can make your choice at the top of the submission options dialog. If you are updating an existing submission, you can choose the update option and enter the BankIt ID in the field provided. Otherwise, choose “Upload New Submission” and a new submission ID is generated for you.

298

M. Parker et al.

Fig. 31. The Genbank validation dialog.

Your submission is validated using tbl2asn, and you will be shown any problems before the submission is commenced. The validation result window has two tabs. (1) The Validation errors/ warnings tab shows you a list of errors that may prevent your submission from being accepted. (2) The Discrepancies tab shows you potential errors that you may have made, based on common errors made in Genbank submissions. It is recommended that you thoroughly review the information in both tabs before proceeding. If you have chosen to automatically upload your submission, further validation will take place on the server once the upload is complete. Geneious informs you whether your submission has been Accepted, Accepted with Warnings, or Rejected (Fig. 31). You should also receive an e-mail from Genbank detailing your submission. Once your submitted sequence(s) have been assigned accession numbers, you should receive a further e-mail from Genbank with the details. 3.9. Getting Help

The Biocode plugins are complex, and while learning how to use them can seem like a daunting task, help is available. Your first port of

13

Laboratory Information Management Systems for DNA Barcoding

299

call should be the Geneious introduction video (http://www.biomatters.com/assets/demonstrations/biocode.html), which walks you through the plugins, and an instruction manual is available at http:// software.mooreabiocode.org. A user community and technical support are available from http://connect.barcodeoflife.org/group/lims. Here, you can engage with the wider community, get help from experienced users, and make suggestions about how to improve the plugins. If you have any questions or suggestions that you do not want to post to the community, you can e-mail at [email protected].

4. Notes 1. EXCEL FIMS For single users, or users who cannot set up a field information management systems (FIMS) server, the EXCEL FIMS is the easiest way to connect to a specimen/tissue database. Geneious will read data from an excel workbook, and convert the rows into specimen/tissue records. It is assumed that all molecular processes start with tissue and this is the entry point into the LIMS. Your excel workbook must conform to the following: Your workbook should have only one sheet. If you have more than one sheet, only the first sheet will be read. Each column corresponds to a data field in your database. You can have as many columns as you like, but you must have at least a specimen ID column and a tissue sample ID column. The first row of the table should be the names of the columns. The other rows should contain the data. Right click on the biocode icon in the Geneious service panel (on the left-hand side of the main window), and click login. Choose “EXCEL FIMS” in the uppermost drop-down select box, and enter the location of the EXCEL file. Choose the columns that contain the specimen and tissue ID’s from the drop-downs (if you use only one ID for specimens and tissues, enter the same column in both drop-downs). Also enter the taxonomy fields in your excel file, in order of highest to lowest, using the + and − buttons to add and remove fields. You can also specify plate name and well columns in your sheet if you keep your tissues in plates. Just check the “The FIMS database contains plate information” checkbox, and enter your plate and well fields. You will then be able to make a direct copy of the tissue plate when making new extraction plates.

300

M. Parker et al.

2. TAPIR FIMS TAPIR, or TDWG Access Protocol for Information Retrieval, is a standard protocol for sharing specimen data. The TAPIR FIMS connection reads in tissue data from a TAPIR server. Reliably integrating museum collection management systems (CMS) or FIMS to a LIMS can be difficult. Often, data needs to be exported and then reimported into each system. The TAPIR protocol is an attempt to standardize the way in which collection databases communicate, and to remove the difficulties associated with collaborative collection management. Setting up a TAPIR provider with LIMS extension These instructions assume the use of the TapirLink software (written in PHP), a collections management database with tissue records, stored in a TapriLink-compatible relational database, and the free version of Geneious (biomatters software), with the Biocode plug-in installed. Step 1: Set up TAPIR. There are several TAPIR software installation tools (http:// wiki.tdwg.org/twiki/bin/view/TAPIR/TapirSoftware). This tutorial assumes that you are using the TapirLink software available (http://wiki.tdwg.org/twiki/bin/view/TAPIR/ TapirLink). TapirLink requires a Web server running PHP, with an appropriate module to connect to your FIMS database. We recommend Apache, although this is not required. Download the TapirLink installation archive, and follow the installation instructions included. Step 2: Incorporate the LIMS extension into the TAPIR provider. While setting up your TAPIR provider (or afterwards), be sure to include the LIMS extension as an additional schema (available at http://biocode.berkeley.edu/schema/lims_extension. xml). You will be asked to map your local database fields to the LIMS extension fields. Step 3: Point the Geneious LIMS system to your TAPIR provider link. Right click on the biocode icon in the service tree on the left-hand side of the main Geneious window, and click “login.” Choose “TAPIR” from the field database drop-down at the top of the connection dialog, enter the address of your TAPIR server. Choose an LIMS database (see Remote LIMS; Note 3 and Local LIMS; Note 4), and click OK. 3. Google Fusion Tables FIMS Google Fusion tables is ideal for groups that want to have a collaborative shared FIMS database, but do not want to set up their own server. You may enter data directly into Fusion

13

Laboratory Information Management Systems for DNA Barcoding

301

Tables, or upload excel spreadsheets (see http://www.google. com/fusiontables/public/tour/index.html). As with the Excel FIMS, it is assumed that all molecular processes start with tissue and this is the entry point into the LIMS. Right click on the biocode icon in the Geneious service panel (on the left-hand side of the main window), and click login. Choose Google Fusion Tables in the uppermost dropdown. When viewing a fusion table on the Web, the url will contain the phrase dsrcid=XXXX (with XXXX being a number). This number is your table’s ID. Enter the ID in the space provided, and click Update. Choose the columns that contain the specimen and tissue ID’s from the drop-downs (if you use only one ID for specimens and tissues, enter the same column in both drop-downs). Also enter the taxonomy fields, in order of highest to lowest. You may press the autodetect button to do this for you, or use the + and − buttons to add and remove fields. You can also specify plate name and well columns in your table if you keep your tissues in plates. Just check the “The FIMS database contains plate information” checkbox, and enter your plate and well fields. You will then be able to make a direct copy of the tissue plate when making new extraction plates. 4. MySQL FIMS If you are already using a MySQL database for your FIMS but do not want to set up a TAPIR server (see Note 2), then you can connect directly to the MySQL database. Right click on the biocode icon in the Geneious service panel (on the left-hand side of the main window), and click login. Choose MySQL Database in the uppermost drop-down. Enter your server URL, port, username, password, and database in the fields provided, and click Update. Choose the columns that contain the specimen and tissue ID’s from the drop-downs (if you use only one ID for specimens and tissues, enter the same column in both drop-downs). Also enter the taxonomy fields, in order of highest to lowest. You may press the autodetect button to do this for you, or use the + and − buttons to add and remove fields. You can also specify plate name and well columns in your table if you keep your tissues in plates. Just check the “The FIMS database contains plate information” checkbox, and enter your plate and well fields. You will then be able to make a direct copy of the tissue plate when making new extraction plates. 5. Remote mySQL LIMS The remote LIMS is intended for labs and research groups, where the lab data needs to be shared between and edited by a large number of users. To set up a remote LIMS server, you will need MySQL, available from http://www.mysql.com. Download and install the MySQL server software on a server

302

M. Parker et al.

which is accessible by everyone who needs to use the LIMS. The next step is to create a blank schema. You can store multiple LIMS database on one server, but each database must have its own schema. Create a schema, and then run the following script to create a blank LIMS database: http://www. biomatters.com/assets/plugins/biocode/labbench_latest_ mysql.sql. You will need to create at least one user account with read/ write access to this schema. To connect to the LIMS database, you need to choose “Remote Server” in the biocode login screen within Geneious. 6. Local LIMS This is intended for single or small-scale users, and is a database that exists within your local copy of Geneious. To create a new database, click on the “Add Database” button, enter a name for your new database, and click “Ok.” A new, empty database will be created for you. If you have already created a database, select it from the drop-down (other users of Geneious will not be able to connect to any LIMS databases that you create as a local database). To create a LIMS database that you can use to share data with other people, please see Remote LIMS (Note 5). The Local LIMS databases are stored within your local Geneious user directory, so they will be backed up if you choose to do a Geneious backup (choose the “Back Up” button from the main Geneious toolbar). It is suggested that you back up your Geneious data regularly to avoid losing data. 7. Workflow considerations One extraction can belong to many workflows (as extractions are often used as stock for many reactions). You can have any number of failed reactions in a workflow, and you can have any number of passed reactions in a workflow. The passed or failed status of a workflow for a given reaction type is taken from the most recent reaction of that type. While each workflow can only contain reactions of a single locus, each locus can have any number of workflows (useful if multiple people are working independently on the same locus for the same extraction). Workflows are created with reactions. Any reaction (apart from extractions) that has an empty workflow field when saved will have a new workflow created for it. That means that it is particularly important that you fill in the workflow field correctly for all reactions that you save. Fortunately, this is easily accomplished in the “Bulk editor” (see Subheading 3.1.1 and Fig. 7). Clicking autodetect workflows in the tools drop-down will automatically fill in the workflow field for any reactions that have an available workflow (i.e., one with a matching extraction and locus).

13

Laboratory Information Management Systems for DNA Barcoding

303

Fig. 32. A graphical illustration representing how four 96-well plates are converted into a single 384-well plate.

If more than one matching workflow exists, the most recent one will be chosen. If no matching workflow exists, the workflow field will remain blank, and a new workflow will be created when you save the reaction. Reactions are entered into the Lab Bench Database by plates. Plates come in a number of sizes, 48, 96, or 384 wells, and also in a grouping of any number of reactions. Creating a plate opens that plate in the plate viewer. 8. Converting between 96- and 384-well plates It is possible to create a new 384-well plate from a group of 96-well plates, and to create a group of 96-well plates from a 384-well plate (Fig. 32). Each 96-well plate corresponds to one quadrant in the 384-well plate. To create a 384-well plate, select up to four 96-well plates in Geneious, and click “New Reaction.” Select the “Create plate from existing document” checkbox, and choose 384-well plate. A panel will appear at the bottom of the dialog which will allow you to choose to which quadrant each 96-well plate corresponds. 9. Creating custom cocktails Cocktails are a recipe for the ingredients that will go into a reaction (excluding the primer). You can choose from a list of existing cocktails, or create your own. To create your own cocktail, click “Edit Cocktails,” then click new in the dialog, and enter the volumes and concentrations in the fields provided (Fig. 33). There is space for you to store one extra ingredient (both concentration and volume). Any additional information about your cocktail can be stored in the notes field. For safety reasons, you cannot modify or delete cocktails once they are created. You can create a copy of an existing cocktail by selecting it in the view, and then clicking “Add.”

304

M. Parker et al.

Fig. 33. “Edit Cocktails” window.

The new cocktail will have all the same volumes and concentrations as the one you selected. 10. Creating thermocycler profiles To create custom thermocycler profiles, click “View/Add Thermocylces” in the New PCR plate toolbar. When the “Edit Thermocycles” window opens, click the “Add” button on the lower left-hand corner of the window (Fig. 34). A New Thermocycle window will open and here you will be able to customize temperatures and cycles using the dialog boxes and “Edit Cycles” buttons (Fig. 35). 11. Creating Primers To create a new primer in Geneious, click “Sequence” along the top of the main toolbar. In the Sequence drop-down menu, click “New Sequence.” The New Sequence window will open and at the bottom of the window choose “Primer” from the Type drop-down menu. Then, enter your sequence and indentifying information in the dialog boxes and click ok (Fig. 36). Primers set on reactions will be saved to the lab bench database so that they can be viewed by others without you needing to send them your primer library. If you have a large number of primers, you may want to organize your primers by type. You can do this by storing your primers in a folder structure in the Geneious service tree (for example, you could store all your primers for a particular locus in the same folder, or store your primers by taxonomy). This

13

Laboratory Information Management Systems for DNA Barcoding

Fig. 34. “Edit Thermocycles” window from a new PCR plate.

Fig. 35. A “New thermocycle” profile entry.

305

306

M. Parker et al.

Fig. 36. “New Sequence” window for adding custom primers.

Fig. 37. Primer database accessed from the “Choose” button in the “Edit Wells” window.

13

Laboratory Information Management Systems for DNA Barcoding

307

folder structure will be displayed when you choose your primers when editing wells in your plate. To choose a primer (or primers) for your wells in the plate editor, select the wells you want to edit, and click “edit selected wells.” Select the primer you want to add to the reaction (Primer fields display a list of primers in your local database), and click the “Choose” button (Fig. 37). You can choose any primer from your database, and it will be applied to all the selected wells. Only primers you have set on wells are stored in the LIMS database. Primers that have not been set on wells exist only in your local copy of Geneious and cannot be seen by others accessing the LIMS. 12. Importing .abi files from disk, setting read directions, and batch renaming To import traces from a disk, locate the .ab1 or .scf files on disk and then click and drag them from the file manager on to the new folder in Geneious. Alternatively, you can import from inside Geneious using File > Import > From File… in the menu. Once you have imported the raw trace files, it is currently necessary to tell Geneious which reads are in the forward or reverse direction. To set read directions, select all of either the forward or reverse reads from the ones you have imported and select Biocode > Set Read Direction in the toolbar. Choose either Forward or Reverse for the read direction and click OK. It is only necessary to mark either the forward or reverse reads; Geneious will work out the rest by process of elimination (this is so that the correct read is reversed during assembly and downstream steps are able to identify the direction of the reads). After performing this task, an extra column will be added to the reads named “Is Forward Read” with a value of true or false. If your forward and reverse reads are in different folders, it is easiest to import all of the reads from one folder, then set the read direction for those, and then import the second folder. If you want to change the names of your reads to reflect some aspect of the FIMS data, from the main toolbar select Edit > Batch Rename to copy your choice of fields into the name column. This feature is also available in renaming assemblies. You can also use Edit > Batch Rename… to add _F or _R to the names of your reads if the names do not have any indication of direction (not required). If you have imported both forward and reverse reads into Geneious before setting read direction, you can use Search or Filter in the top right corner of the Geneious window to locate a particular direction of read based on names.

308

M. Parker et al.

Fig. 38. “Annotate with FIMS/LIMS data” screen.

13. Annotating with FIMS/LIMS data You can either enter a forward and reverse plate or use the annotated plate and well if you are updating sequences you have previously annotated. To aid downstream analysis and submission, it is extremely useful to annotate sequences with the associated data from the FIMS. This must be done pre-assembly (with the reads) because forward and reverse reads can come from different sequencing plates. Annotating is the first step in the assembly pipeline that utilizes the FIMS/LIMS database, so you will need to connect to the Biocode service before proceeding. To do this, right click on Biocode in the source panel on the lefthand side and select login. Select all of the reads which you imported and go to Biocode > Annotate with FIMS/LIMS Data in the toolbar. If you have plate data in your FIMS database and you do not wish to enter reaction information for your data in the LIMS, choose “Biocode > Annotate with FIMS data only…” (see below), and enter the name of your FIMS plate. You need to enter the forward and reverse sequencing plate names (from the LIMS) which correspond to your reads and identify which part of the sequence names identify the well location. If both forward and reverse reads are on a single plate, then you can leave the reverse plate field blank or enter the same name twice. Click OK and the operation will add many new columns to the table for each of the reads (Fig. 38). These include things like Specimen ID, Taxonomy, and Collector. The values should be identical for each forward and reverse pair of reads. Often, there will be reads which do not have entries in the FIMS due to sequencing results coming through from wells

13

Laboratory Information Management Systems for DNA Barcoding

309

Fig. 39. Example of Mean Coverage.

which were essentially empty. This operation will tell you about any of these and the extra columns will be left blank. Annotating with FIMS data only Please note that you will not get primer information for your sequences using this method, so you may have to annotate those yourself if you want to use the sequences to generate a genbank submission. Tip: To get the empty well reads out of your way, you can easily select them all by sorting the table by one of the FIMS attributes (e.g., Tissue ID) and then selecting the ones with no value. You can then either delete them or create a new subfolder called “empties” and move them into there. If you want to change the names of your reads to reflect some aspect of the FIMS data, you can use Edit > Batch Rename… to copy your choice of fields into the name column. 14. Mean Coverage Mean coverage is one of the binning criteria for assemblies and is also available as a column in the table. It is also the least intuitive value, so here is a description. Coverage is the number of sequences that cover a given position in an alignment/assembly. Mean coverage is, therefore, the mean of this value across all positions in the alignment/assembly (Fig. 39). For this alignment above, the first two positions have a coverage of 1. The next five positions have a coverage of 2 and the last three have coverage 1 again. Mean coverage is, therefore, (2 × 1 + 5 × 2 + 3 × 1)/10 = 1.5. The mean coverage will be between 1 and the number of sequences in the alignment/ assembly. For a pairwise assembly, that means 2 is full coverage and 1 is no coverage. 15. Taxonomic Verification Binning Similar to the bin column that has been used for reads and assemblies, Bin columns in the Verify Taxomony Results window summarize properties of the verification process by assigning each result a High, Medium, or Low value (in the form of a smiley). Query: The name of the query assembly. Query Taxon: The taxonomy of the query from the FIMS. The verify operation fills in higher taxonomy by searching

310

M. Parker et al.

NCBI taxonomy. If the taxon could not be found in NCBI, this will be noted and result will be marked as Low bin. Hit Taxon: The taxonomy of the top hit from BLAST. Levels in the taxonomies are marked as green or red depending on whether they match with the query. Keywords: A user-defined list of keywords which are expected in the hit definition from BLAST. These are highlighted red or green depending on whether they are found in the definition. Hit Definition: The definition of the top hit returned from BLAST with matching keywords highlighted. Hit Length: Length of the hit alignment from BLAST, highlighted according to binning parameters (red, orange, or green). Hit Identity: Identity of the hit alignment from BLAST, highlighted according to binning parameters (red, orange, or green). Assembly Bin: The bin that was assigned to the assembly according to the previously mentioned binning parameters. You can sort by any of the columns as usual and rearrange/ resize them. 16. Useful example searches Last Modified (LIMS) | Greater Than | 01 May 2010—all work done after the beginning of May. Plate Name (LIMS) | Contains | “Plate1”—all plates which have the phrase “Plate1” somewhere in their name. Locus | Contains | “COI”—all COI workflows and plates. 17. Most users should use the Geneious Bankit FTP account when submitting sequences. Larger research groups or sequencing centers may wish to create their own submission account, which can be done by contacting [email protected].

Fig. 40. Genbank Submission Account.

Chapter 14 DNA Extraction, Preservation, and Amplification Thomas Knebelsberger and Isabella Stöger Abstract The effectiveness of DNA barcoding as a routine practice in biodiversity research is strongly dependent on the quality of the source material, DNA extraction method, and selection of adequate primers in combination with optimized polymerase chain reaction (PCR) conditions. For the isolation of nucleic acids, silicagel membrane methods are to be favored because they are easy to handle, applicable for high sample throughput, relatively inexpensive, and provide high DNA quality, quantity, and purity which are prerequisites for successful PCR amplification and long-term storage of nucleic acids in biorepositories, such as DNA banks. In this section, standard protocols and workflow schemes for sample preparation, DNA isolation, DNA storage, PCR amplification, PCR product quality control, and PCR product cleanup are proposed and described in detail. A PCR troubleshooting and primer design section may help to solve problems that hinder successful amplification of the desired barcoding gene region. Key words: DNA barcoding, DNA extraction, DNA preservation, PCR amplification, Agarose gel electrophoresis, PCR cleanup

1. Introduction The extraction of genomic DNA requires careful sample preparation, followed by tissue lysis and isolation of the nucleic acids. The lysis of the tissue samples is performed by applying enzymatic digestion, commonly with Proteinase K, which degrades proteins and rapidly inactivates nucleases that might otherwise degrade DNA during isolation and purification. After digestion, nucleic acids are separated from all other remaining cellular components. The condition of the biological source material plays a pivotal role for the quality, quantity, and purity of the extracted DNA. Therefore, appropriate tissue or sample storage after collecting the biological source material in the field is required. Besides the use of recently collected and preserved organisms, museum

W. John Kress and David L. Erickson (eds.), DNA Barcodes: Methods and Protocols, Methods in Molecular Biology, vol. 858, DOI 10.1007/978-1-61779-591-6_14, © Springer Science+Business Media, LLC 2012

311

312

T. Knebelsberger and I. Stöger

specimens represent a convenient source for species-wide samplings. But nucleic acids might be highly degraded due to either extensive exposure to killing agents, like acetate, ethyl alcohol, or cyanide (1–3), or sample storage under inappropriate conditions; fixatives, like formaldehyde or other aldehyde mixtures often used in museums to preserve biological material, degrade DNA, which makes the extraction of utilizable nucleic acids quite challenging (4–8). For routine DNA extraction, the methods based on silica-gel membrane technology have proven to yield DNA of high quantity and quality (9). In comparison to other extraction technologies, these methods are easy to handle, relatively inexpensive, and allow high sample throughput (96-well format). After enzymatic digestion of the samples, nucleic acids are adsorbed to a silica-gel membrane in the presence of highly concentrated chaotropic salts. Fragment lengths up to 20 kb can be recovered in high purity usable for downstream applications as well as for long-term storage. Other DNA extraction technologies, like salting out precipitation or anion exchange methods, yield DNA fragment lengths up to 150 kb but are very expensive and time consuming. Fast and easy DNA extractions can be performed by inhibitor binding by sorbent technology or by the use of chelating resin, but both methods deliver DNA of poor quality and further applications might be hindered due to compounds still present in the DNA solution. Even though for DNA barcoding standardized DNA extraction protocols can be used for a broad range of taxa, some groups are still left problematic. Especially for taxa containing high quantities of polysaccharides, mucopolysaccharides, polyphenols, resins, or other secondary metabolites, substances known for binding firmly to nucleic acids during DNA extraction procedure and/or interfering with subsequent reactions, specialized protocols were suggested (10–15). DNA isolation methods for small taxa, for example nematodes (16), tardigrades (17, 18), copepods (19), collemboles (20), or mites (21–23), as well as DNA isolation methods for fungi (24–26) or plants (12, 27–32) may help to recover DNA of sufficient quantity and quality. Besides the extraction of DNA, high-quality long term storage of the DNA samples, in order to conserve the genetic resources, verify the already existing results, or conduct further analyses, is challenging. Although in biological research millions of DNA samples are currently being processed and DNA and tissue banks were founded all around the world, the process of DNA degradation during storage is barely investigated and only few studies are available dealing with this subject (e.g., 33–38). Currently, there is no common sense about the optimal DNA storage conditions, but several commercial products are already available which allow storage of small amounts of dehydrated DNA at room temperature. These products are based on the natural principle of anhydrobiosis which can be found in Tardigrades using a mixture of dissolvable

14

DNA Extraction, Preservation, and Amplification

313

compounds, e.g., trehalose, that stabilize DNA for storage at room temperature (35). For the long-term storage of higher amounts of DNA, a combination of both appropriate preserving agents and low storage temperatures (−20°C or lower) is needed to minimize the loss of DNA quality during storage. In DNA barcoding, DNA extracts are used for amplification of a specific predefined gene region by using the technique of the polymerase chain reaction (PCR) which utilizes short, user-defined DNA sequences called oligonucleotide primers. In the first step, the DNA double helix is denaturized by heating into single-stranded template DNA, where the primers are able to bind then. The thermostable enzyme DNA Polymerase starts to extend the primers by adding single Deoxynucleotide triphosphates (dNTPs) producing new doublestranded DNA. This process is performed in a Thermocycler and has to be repeated several times to increase the number of the target fragments exponentially. The quality of the PCR products is commonly checked by agarose gel electrophoresis. Before sequencing, PCR products have to be purified to eliminate the remaining PCR ingredients. The use of suitable primers is essential for amplification success. Barcoding primers should correspond to rather conservative sites with low substitution rates to apply them to a broad range of taxa. Such “universal” primers amplifying an approximately 650-bp-long fragment of the mitochondrial cytochrome oxidase subunit I (COI) gene were first defined by Folmer et al. (39) and then suggested as barcoding primers for the whole animal kingdom (40). In the course of time, it turned out that these primers are not applicable for all animal taxa and more and more group-specific ones were additionally suggested. In case of sponges (www.spongebarcoding. org), plants (41), or fungi (42), even new or additional barcoding regions were defined to get a resolution on species level. Although a tremendous variety of primers is available now, the design of new primers or the adjustment of existing primers remains still necessary for successful and effective DNA barcoding.

2. Materials 2.1. DNA Extraction

1. DNeasy Blood and Tissue Kit single columns and DNeasy 96 Blood and Tissue Kit (Qiagen): Buffers ATL, AL, AW1, AW2, AE, and Proteinase K are included in the kit. 2. NucleoSpin® Tissue Kit single columns and NucleoSpin® 96 Tissue Kit (Macherey-Nagel): Buffers T1, B1, B2, BW, B5, BE, PB, BQ1, and Proteinase K are included in the kit. 3. NucleoSpin® Plant II Kit single columns and NucleoSpin® 96 Plant II Kit (Macherey-Nagel): Buffers PL1, PC, PW1, PW2, PE, and RNase A are included in the kit (see Note 1).

314

T. Knebelsberger and I. Stöger

4. CTAB buffer: 100 ml of 1 M Tris–HCl (pH 8.0), 280 ml of 5 M NaCl, 40 ml of 0.5 M ethylene-diamine-tetraacetic-acid (EDTA), 20 g cetyltrimethyl-ammonium-bromide (CTAB). 5. TE buffer: 10 mM Tris–HCl, 1 mM EDTA pH 8.0. 6. Chloroform-isoamyl-alcohol: Chloroform:Isoamyl-alcohol, 24:1 (can be ordered at different companies). 2.2. DNA Preservation

1. 2 M trehalose stock solution: Dissolve 7.6 g trehalose (d-(+)Trehalose-dihydrate (Sigma Aldrich)) in 10 ml molecular water. 2. Qiasafe tubes and plates (Qiagen).

2.3. DNA Amplification

2.4. PCR Cleanup

1× TBE buffer: TBE is used as running buffer and for dissolving the agarose powder. To prepare the buffer, 10× TBE can be ordered (Rothiphorese 10× TBE buffer, Roth) and diluted to a 1× TBE buffer solution. Alternatively prepare one liter of 10× TBE buffer by mixing 108 g Tris, 55 g boric acid, and 7.4 g Na-EDTA in a beaker together with 500 ml demineralized water and heat for 20 min at 60°C (use a magnetic stir bar). Filter the buffer and transfer the solution to a bottle and fill up to 1 l. 1. NucleoSpin® Extract II Kit (Macherey-Nagel): Buffers NT3, NT, and NE are included in the kit. 2. Ethanol precipitation: Ethanol 100% and 70%, 3 M sodium acetate.

3. Methods 3.1. DNA Extraction: Source Material

Fresh material, if not immediately used for DNA extraction, should be directly frozen (−20°C or at lower temperatures) or fixed and preserved in 96% pure ethanol (large specimens can be subsampled) and stored at least at −20°C. Preserving agents, like DESS (43) or RNAlater (RNAlater RNA Stabilization Reagent, Qiagen), have also been proven to prevent DNA degradation during sample storage (see Note 2). Vascular plants, algae, and fungi may be better rapidly dried on silica gel and then stored in a dry, cool and dark place. Ancient material from museum collections exhibit dramatic DNA degradation (Fig. 1). Whenever possible, use recent material for DNA extraction.

3.2. DNA Extraction: Sample Preparation

1. Decontaminate workbench from any DNA using agents, like DNA Exitus Plus (BioChem) (see Note 3). 2. Wear new gloves. 3. Decontaminate all other tools (forceps, scalpels, etc.) from any DNA by flame (Bunsen burner) between processing each sample.

14

DNA Extraction, Preservation, and Amplification

315

Fig. 1. DNA extracts of Phalera bucephala (Lepidoptera) performed with Qiagen Blood and Tissue Kit on agarose gel. Dried specimens were taken from a museum collection collected in 2007 (1 ), 1971 (2 ), and 1935 (3 ). Number (1 ) contains DNA fragments up to 20 kb, whereas numbers (2 ) and (3 ) show dramatic DNA degradation with fragments between 100 and 300 bp.

4. Use 1.5-ml microcentrifuge tubes (e.g., Eppendorf ) for lysis (not included in commercially available kits); in case of plate extractions (96-well format), special lysis plates (deep-well plates) are provided (see Note 4). 5. Transfer a predefined (see DNA extraction protocols) amount of tissue into lysis tubes or deep-well plates (see Note 5). 6. Air dry EtOH-preserved tissue until alcohol is evaporated completely by placing the tubes with open caps in a Thermomixer

316

T. Knebelsberger and I. Stöger

at 40°C. Dried, fresh, or frozen material can be used directly for DNA extraction (see Note 6). 7. Clean workbench again before extraction procedure and wear new gloves. 8. Use filter tips and change pipette tips between each reagent. Do not touch the surface with tip. Keep pipette always in upright position. 3.3. DNA Extraction: Protocol Overview

For DNA isolation from animal and fungal samples, DNeasy Blood and Tissue Kit (Qiagen) (Subheadings 3.4 and 3.5) or NucleoSpin® Tissue Kit (Macherey-Nagel) (Subheadings 3.6 and 3.7) is recommended. The latter might be preferred especially for DNA extractions of arthropod samples. For plant samples, NucleoSpin® Plant II Kit (Macherey-Nagel) (Subheadings 3.8 and 3.9) achieves optimal DNA yields. All these kits are available in convenient single preparation as well as in 96-well plate format for high-throughput extractions. Alternatively, in case of mollusc, fungal, and algal taxa containing high amounts of mucopolysaccharides, better results might be achieved using the proposed CTAB protocol (Subheading 3.10) (see Note 7).

3.4. DNA Extraction: Protocol 1

DNA isolation from animal and fungal tissue with DNeasy Blood and Tissue Kit (Qiagen) single columns (see Note 8). 1. Adjust water bath or Thermomixer to 56°C (see Note 9). Add ethanol (96–100%; not provided with the kit) to buffers AW1 and AW2 as indicated on the bottles. 2. Place up to 25 mg of tissue in a 1.5-ml microcentrifuge tube (not provided). 3. Add 180 ml Buffer ATL and 20 ml Proteinase K and mix by inverting or vortexing. Briefly centrifuge samples at 3,000 × g. Incubate samples at 56°C for a few hours or overnight until tissue is lysed (see Note 10). 4. Vortex for 15 s. Briefly centrifuge samples at 3,000 × g. Add 200 ml Buffer AL to the lysate. 5. Immediately add 200 ml ethanol (96–100%), and mix by vortexing. Briefly centrifuge samples at 3,000 × g. 6. Transfer the mixture from step 5 into the DNeasy Mini spin column placed in a 2-ml collection tube (provided). Centrifuge at 6,000 × g for 1 min. Discard flow through and collection tube. 7. Place the DNeasy Mini spin column in a new 2-ml collection tube. Add 500 ml Buffer AW1 and centrifuge for 1 min at 6,000 × g. Discard flow through and collection tube. 8. Place the DNeasy Mini spin column in a new 2-ml collection tube. Add 500 ml Buffer AW2 and centrifuge for 3 min at 20,000 × g. Discard flow through and collection tube.

14

DNA Extraction, Preservation, and Amplification

317

9. Place the DNeasy Mini spin column in a clean 1.5-ml microcentrifuge tube (not provided) and add 100 ml Buffer AE directly onto the DNeasy membrane (see Note 11). Incubate at room temperature for 1 min, and then centrifuge for 1 min at 6,000 × g to elute DNA. 10. Repeat step 9 with the same microcentrifuge tube. A new microcentrifuge tube can be used for the second elution step to prevent dilution of the first eluate. 11. DNA should be immediately stored at −20°C. 3.5. DNA Extraction: Protocol 2

DNA isolation from animal and fungal tissue with DNeasy 96 Blood and Tissue Kit (Qiagen) (see Note 8). 1. Adjust water bath or Thermomixer to 56°C (see Note 9). Add ethanol (96–100%; not provided with the kit) to Buffers AL, AW1, and AW2 as indicated. If multichannel pipettes are used (recommended), sterilized reservoirs are required. 2. Place up to 20 mg of tissue in each collection microtube (96-well format, provided). 3. Add 180 ml Buffer ATL and 20 ml Proteinase K to each sample. Seal the collection microtubes properly with the provided cap strips. Mix by inversion. Briefly centrifuge up to 1,500 × g. Incubate samples at 56°C for a few hours or overnight until tissue is lysed (see Note 10). 4. Ensure that the microtubes are still properly sealed and mix by inversion for 15 s. Briefly centrifuge racks at 1,500 × g. Carefully remove the caps. Add 410 ml Buffer AL–ethanol to each sample. Seal collection microtubes with new cap strips and mix by inversion for 15 s. 5. Briefly centrifuge racks up to 1,500 × g. Place two DNeasy 96 plates on top of S-Blocks (provided). 6. Remove the first cap strip from the collection microtubes and carefully transfer the lysate to the DNeasy 96 plates. Continue with the next eight samples and so on until all samples are transferred. Seal DNeasy 96 plates with AirPore Tape Sheets (provided). Centrifuge for 10 min at 6,000 × g. 7. Remove the tape, and check that all of the lysate has passed through the membrane in each well of the DNeasy 96 plates. If lysate remains in any of the wells, centrifuge for further 10 min. 8. Remove the tape and carefully add 500 ml Buffer AW1 to each well. Seal each DNeasy 96 plate with a new AirPore Tape Sheet. Centrifuge for 5 min at 6,000 × g. 9. Remove the tape. Carefully add 500 ml Buffer AW2 to each well. Centrifuge for 15 min at 6,000 × g. Do not seal the plate

318

T. Knebelsberger and I. Stöger

with AirPore Tape Sheet in this step to allow evaporation of residual ethanol. 10. Place each DNeasy 96 plate in the correct orientation on a rack of Elution Microtubes RS (provided). 11. To elute the DNA, add 150 ml Buffer AE to each sample (see Note 11), and seal the DNeasy 96 plate with new AirPore Tape Sheet. Incubate for 1 min at room temperature (15–25°C). Centrifuge for 2 min at 6,000 × g. 12. Repeat step 11 with another 150 ml Buffer AE. Use appropriate cap strips (provided) to seal the Elution Microtubes RS for storage. 13. DNA should be immediately stored at −20°C. 3.6. DNA Extraction: Protocol 3

DNA isolation from animal and fungal tissue with NucleoSpin® Tissue Kit (Macherey-Nagel) single columns (see Note 8). 1. Prepare Buffer B3 by transferring buffer B1 into Buffer B2 (see Note 12). Dissolve Proteinase K (lyophilized) by adding the volume of Proteinase Buffer (PB) that is indicated on the Proteinase K label (see Note 13). Add appropriate volume (see label on Buffer B5 bottle) of ethanol (96–100%; not provided with the kit) to Buffer B5 before use. Adjust water bath or Thermomixer at 56°C (see Note 9). After tissue lysis, prepare a 70°C water bath and warm Buffer BE (elution buffer) to 70°C before use. 2. Place up to 25 mg of tissue in a 1.5-ml microcentrifuge tube (not provided). 3. Add 180 ml of Buffer T1 and 25 ml of Proteinase K to each sample and mix by inverting or vortexing. Briefly centrifuge samples at 3,000 × g. Incubate samples at 56°C for a few hours or overnight until tissue is lysed (see Note 10). 4. Vortex for 15 s. Briefly centrifuge samples at 3,000 × g. Add 200 ml of Buffer B3. Vortex and incubate at 70°C for 10 min. 5. Briefly centrifuge samples at 3,000 × g. Add 210 ml of 96–100% ethanol and vortex immediately. 6. Briefly centrifuge samples at 3,000 × g. Transfer the mixture from step 5 into the NucleoSpin column placed in a 2-ml collection tube (provided). Centrifuge at 11,000 × g for 1 min. Discard flow through. 7. Add 500 ml of Buffer BW to the spin column. Centrifuge at 11,000 × g for 1 min. Discard flow through. 8. Add 600 ml of Buffer B5 to the spin column. Centrifuge at 11,000 × g for 1 min. Discard the flow through. Centrifuge again at 11,000 × g for 1 min to remove the residual Buffer B5.

14

DNA Extraction, Preservation, and Amplification

319

9. Place the NucleoSpin column in a clean 1.5-ml microcentrifuge tube (not provided) and pipette 100 ml of Buffer BE (warmed to 70°C) onto the NucleoSpin membrane (do not touch the membrane) (see Note 11). Incubate at room temperature for 1 min, and then centrifuge at 11,000 × g for 1 min to elute DNA. 10. Repeat step 9 with the same microcentrifuge tube. A new microcentrifuge tube can be used for the second elution step to prevent dilution of the first eluate. 11. DNA should be immediately stored at −20°C. 3.7. DNA Extraction: Protocol 4

DNA isolation from animal and fungal tissue with NucleoSpin® 96 Tissue Kit (Macherey-Nagel) (see Note 8). 1. Dissolve Proteinase K (lyophilized) by adding the volume of Proteinase Buffer (PB) that is indicated on the Proteinase K label and store at −20°C (see Note 13). Add appropriate volume (see label on Buffer B5 bottle) of ethanol (96–100%; not provided with the kit) to Buffer B5 before use. Adjust water bath or Thermomixer at 56°C (see Note 9). After tissue lysis, preheat incubator at 70°C and warm Buffer BE (elution buffer) to 70°C before use. 2. Place up to 20 mg of tissue into each well of a Round-well Block (provided). 3. Add 180 ml Buffer T1 and 25 ml of Proteinase K to each sample. Seal wells properly with cap strips and mix by inverting for 10–15 s. Spin briefly for 15 s at 1,500 × g. Incubate samples at 56°C for a few hours or overnight until tissue is lysed (see Note 10). 4. Ensure that the microtubes are still properly sealed and centrifuge the Round-well Block for 15 s at 1,500 × g. Carefully remove cap strips. Add 200 ml of Buffer BQ1 and 200 ml of 96–100% ethanol to each sample. Close the wells with new cap strips and mix by inversion for 10–15 s. Centrifuge racks for 10 s at 1,500 × g. 5. Place each NucleoSpin plate onto a MN Square-well Block (provided). 6. Remove the first cap strip from the first eight wells and carefully transfer the lysate to the NucleoSpin plates. Continue with the next eight samples and so on until all samples are transferred. Seal each NucleoSpin plate with adhesive PE foil (provided). Centrifuge for 10 min at 5,600 × g. 7. Remove foil, and check that all of the lysate has passed through the membrane in each well of the NucleoSpin plates. If lysate remains in any of the wells, centrifuge for further 10 min.

320

T. Knebelsberger and I. Stöger

8. Carefully add 500 ml of Buffer BW to each well. Seal each plate with a new adhesive PE foil. Centrifuge for 2 min at 5,600 × g. 9. Remove adhesive PE foil. Carefully add 700 ml of Buffer B5 to each well. Seal the plate with a new adhesive PE foil. Centrifuge for 4 min at 5,600 × g. 10. Remove adhesive PE foil. Place NucleoSpin plates on an opened rack with tube strips and incubate for 10 min at 70°C in an incubator to evaporate residual ethanol. 11. To elute DNA, dispense 100 ml of prewarmed (70°C) Buffer BE (elution buffer) to each well directly onto the membrane of the NucleoSpin plates (see Note 11). Incubate at room temperature for 1 min. Centrifuge for 2 min at 5,600 × g. 12. Repeat step 11 with another 100 ml Buffer BE. 13. Remove the NucleoSpin plate and seal tube strips. DNA should be immediately stored at −20°C. 3.8. DNA Extraction: Protocol 5

DNA isolation from plant tissue with NucleoSpin® Plant II Kit (Macherey-Nagel) single columns (see Note 8). 1. Add appropriate volume (see label on Buffer PW2 bottle) of ethanol (96–100%; not provided with the kit) to Buffer PW2 before use. Dissolve RNase A (lyophilized) by adding the volume of molecular water that is indicated on the RNase A label and store at −20°C (see Note 14). Adjust water bath or Thermomixer at 65°C (see Note 9). After tissue lysis, preheat incubator at 70°C and warm Buffer PE (elution buffer) to 70°C before use. 2. Homogenize 50 mg (up to 100 mg) wet-weight or 10 mg (up to 20 mg) dry-weight (lyophilized) plant material (see Note 15). 3. Transfer the resulting powder to a new 1.5-ml microcentrifuge tube (not provided) and add 400 ml Buffer PL1. Vortex the mixture thoroughly. Add 10 ml RNase A solution and mix sample thoroughly. Incubate the suspension for 10 min at 65°C. 4. Briefly centrifuge samples at 3,000 × g. Place a NucleoSpin® Filter column (violet ring) into a 2-ml collection tube and load the lysate onto the column. Centrifuge for 2 min at 11,000 × g, collect the clear flow through (see Note 16), and discard the NucleoSpin® Filter. 5. Add 450 ml Buffer PC and mix thoroughly by vortexing. 6. Briefly centrifuge samples at 3,000 × g. Place a NucleoSpin® Plant II Column (green ring) into a new 2-ml collection tube and load a maximum of 700 ml of the sample. Centrifuge for 1 min at 11,000 × g and discard flow through (see Note 17).

14

DNA Extraction, Preservation, and Amplification

321

7. Add 400 ml Buffer PW1 to the NucleoSpin® Plant II Column. Centrifuge for 1 min at 11,000 × g and discard flow through. 8. Add 700 ml Buffer PW2 to the NucleoSpin® Plant II Column. Centrifuge for 1 min at 11,000 × g and discard flow through. 9. Add another 200 ml Buffer PW2 to the NucleoSpin® Plant II Column. Centrifuge for 2 min at 11,000 × g in order to remove wash buffer and dry the silica membrane completely. 10. Place the NucleoSpin® Plant II Column into a new 1.5-ml microcentrifuge tube (not provided). Pipette 50 ml Buffer PE (preheated to 70°C) onto the membrane. Incubate the NucleoSpin® Plant II Column for 5 min at 70°C. Centrifuge for 1 min at 11,000 × g to elute the DNA. 11. Repeat step 10 with another 50 ml Buffer PE (preheated to 70°C) and elute into the same tube. 12. DNA should be immediately stored at −20°C. 3.9. DNA Extraction: Protocol 6

DNA isolation from plant tissue with NucleoSpin® 96 Plant II Kit (Macherey-Nagel) (see Note 8). 1. Add appropriate volume (see label on Buffer PW2 bottle) of ethanol (96–100%; not provided with the kit) to Buffer PW2 before use. Dissolve RNase A (lyophilized) by adding the volume of molecular water that is indicated on the RNase A label and store at −20°C (see Note 14). Adjust water bath or Thermomixer at 65°C (see Note 9). After tissue lysis, preheat incubator at 70°C and warm Buffer PE (elution buffer) to 70°C before use. 2. Homogenize 50 mg (up to 100 mg) wet-weight or 10 mg (up to 20 mg) dry-weight (lyophilized) plant material (see Note 15) in each tube of the tube strips. 3. Add 500 ml Buffer PL1 and 10 ml RNase A to each sample. Close tubes using cap strips (provided). Mix vigorously by shaking for 15–30 s. Centrifuge briefly for 30 s at 1,500 × g. Incubate samples at 65°C for 30 min. 4. Centrifuge the samples for 20 min at 5,600 × g. Remove cap strips. 5. Predispense 450 ml Binding Buffer PC to each well of an MN Square-well Block. Add 400 ml cleared lysate of each sample and mix by repeated pipetting up and down. Mix at least three times. 6. Place NucleoSpin® Plant II Binding Plate on an MN Squarewell Block. Transfer samples from the previous step into the wells of the NucleoSpin® Plant II Binding Plate. Do not moisten the rims of the individual wells while dispensing the samples. 7. Place the NucleoSpin® Plant II Binding Plate stacked on an MN Square-well Block in the rotor buckets. Centrifuge at 5,600 × g for 5 min.

322

T. Knebelsberger and I. Stöger

8. Add 400 ml PW1 to each well of the NucleoSpin® Plant II Binding Plate. Optional: Seal plate with a gas-permeable foil. Centrifuge again at 5,600 × g for 2 min. Place NucleoSpin® Plant II Binding Plate on a new MN Square-well Block. 9. Add 700 ml PW2 to each well of the NucleoSpin® Plant II Binding Plate. Optional: Seal plate with a gas-permeable foil. Centrifuge again at 5,600 × g for 2 min. 10. Add 700 ml PW2 to each well of the NucleoSpin® Plant II Binding Plate. Optional: Seal plate with a gas-permeable foil. Centrifuge again at 5,600 × g for 10 min for complete removal of residual Buffer PW2. 11. Place NucleoSpin® Plant II Binding Plate on the rack with tube strips. Dispense 100 ml Buffer PE (preheated 70°C) directly onto the membrane of each well of the NucleoSpin® Plant II Binding Plate. Incubate at room temperature for 2 min. Centrifuge at 5,600 × g for 2 min. 12. Repeat step 10 with another 100 ml Buffer PE (preheated to 70°C) and elute into the same rack with tube strips. 13. DNA should be immediately stored at −20°C. 3.10. DNA Extraction: Protocol 7

DNA isolation from tissue containing high amounts of mucopolysaccharides with CTAB method. 1. Adjust water bath or Thermomixer at 55°C (see Note 9). Mark and precool 1.5-ml microcentrifuge tubes containing 25 ml 3 M ammonium acetate + 600 ml 70% ethanol per sample at 4°C. Mark another two sets of tubes according to the number of samples. Precool 70% ethanol. 2. Place tissue sample to one set of marked tubes. Ground sample (5–20 mg tissue) if necessary or let the residual ethanol evaporate. 3. Add 300 ml CTAB buffer and 0.6 ml b-mercaptoethanol per sample (see Note 18). Perform this step under a fume hood. 4. Add 10 ml Proteinase K (20 mg/ml, Qiagen) to each sample and mix carefully. 5. Incubate samples at 55°C for a few hours or overnight until tissue is lysed (see Note 10). 6. Add 300 ml chloroform-isoamyl-alcohol (24:1) and mix well by shaking tubes for 2 min. Perform this step under a fume hood. Proteins are precipitated now. 7. Centrifuge for 10 min at 11,000 × g. Pipette the supernatant into another set of clean tubes. 8. Add another 300 ml chloroform-isoamyl-alcohol (24:1) to the new set of tubes including the supernatant from step 7 and mix

14

DNA Extraction, Preservation, and Amplification

323

well by shaking tubes for 2 min. Perform this step under a fume hood. Remaining proteins are precipitated now. 9. Centrifuge for 10 min at 12,000 × g. Pipette the supernatant to the set of clean tubes containing cold 25 ml 3 M ammonium acetate + 600 ml 70% ethanol. The supernatant includes DNA which is precipitated in this step. 10. Centrifuge for 10 min at 12,000 × g. Pour or pipette off the liquid, being careful not to touch or lose the DNA pellet (see Note 19). 11. Add 250 ml cold 70% ethanol and mix to wash the DNA pellet. 12. Centrifuge for 10 min at 12,000 × g. 13. Pour or pipette off the liquid. Dry pellet in the incubator for 5–10 min (at 60°C). 14. Dissolve the pellet in 50 ml TE buffer. 15. DNA should be immediately stored at −20°C. 3.11. DNA Preservation: Protocol 1

According to the DNA extraction protocols, exclusively use buffers to elute or dissolve DNA (see Note 20). For storage, DNA isolates ought to be portioned into two (or more) aliquots. One aliquot serves as backup for long-term storage at −80°C. The other aliquot(s) can be kept as working solution at −20°C to be used for PCR amplification (see Note 21). The best method to preserve high DNA quality of the backup aliquot is the use of QIAsafe DNA Tubes (Qiagen), which contain a mixture of dissolvable compounds that stabilize DNA (see Note 22). For sample storage, proceed according to the following steps. 1. Pipette up to 50 ml of the DNA solution (not more than 30 mg DNA) on the colored DNA-protecting matrix of the QIAsafe DNA Tubes. 2. Dry samples with a vacuum concentrator (1 h at 55°C for 20 ml DNA solution) or under a laminar flow hood at room temperature (12 h for 20 ml of DNA solution). 3. Seal completely dried samples and preferably store QIAsafe DNA Tubes at −80°C (see Note 23). 4. To recover DNA, dissolve pellet of dried protection matrix including DNA with appropriate volume of molecular water. DNA solution can immediately be used for PCR or other applications.

3.12. DNA Preservation: Protocol 2

QIAsafe DNA Tubes are relatively cost intensive. Therefore, we suggest a further professional method to store DNA using the preserving agent trehalose. 1. Transfer 90 ml DNA extract to sterile and temperature-stable storage tubes (e.g., Rotilabo®-microcentrifuge tubes, Roth).

324

T. Knebelsberger and I. Stöger

2. Add 10 ml of trehalose stock solution (2 M) to obtain a final concentration of 200 mM in the DNA sample (see Note 24). 3. Dry samples with a vacuum concentrator at 55°C (this may last for a few hours). 4. Store sealed samples at −80°C (see Note 25). 5. To recover DNA, dissolve pellet with appropriate volume of molecular water. DNA solution can immediately be used for PCR or other applications. 3.13. DNA Amplification: PCR Ingredients

1. DNA Polymerase: Recombinant Taq DNA Polymerase (e.g., Qiagen) is commonly used for standard PCR. It is a thermostable enzyme of the thermophilic bacterium Thermus aquaticus and is, therefore, able to synthesize DNA at high temperatures (see Note 26). Usually, 0.025 U of Taq DNA Polymerase are used per ml of the PCR reaction. Hot Start Polymerases can be used to prevent the amplification of unspecific PCR products before the PCR program is started. They are inactive at lower temperatures and are activated during the first heating step of the PCR program. 2. PCR buffer: For optimal DNA Polymerase reaction activity, PCR buffers are used containing Tris–HCl, KCl, and, optional, MgCl2. Buffers are provided by the supplier together with Taq Polymerase. It is important to use Polymerase and PCR buffer from the same manufacturer. 3. Oligonucleotide primers: PCR primers are short, singlestranded DNA fragments (usually, 20–30 nucleotides). PCR requires one forward and one reverse primer to assign the favored fragment of the DNA. Primers are usually delivered in desalted mode. Resuspend primers in molecular water to a stock concentration of 100 pmol/ml; prepare aliquots of working solutions with a concentration of 10 pmol/ml ready to use for PCR. There should be a surplus of primers in the reaction mix, but too much of them may lead to unspecific reactions. For standard PCR, 0.5 ml of each primer working solution (10 pmol/ml) is enough. For primer design, see Note 27. 4. Deoxynucleotide triphosphates (dNTPs): dNTPs (dATP, dTTP, dGTP, and dCTP) are the nucleotide bases added by the DNA Polymerase during synthesis of the template strand. They are available as single ingredients or as a dNTP mix (e.g., Fermentas). There should always be a slight surplus of dNTPs in the reaction mix. For PCR, a final concentration of 2 mM dNTPs (which means 2 mM of each type of nucleotide!) is applicable. In case of the nucleotide premix (all nucleotides in a total concentration of 10 mM), the solution has just to be diluted to the desired concentration. In case of single nucleotides, add 20 ml (100 mM stocks) of each dNTP to 920 ml molecular water.

14

DNA Extraction, Preservation, and Amplification

325

5. Additives: Additives, like MgCl2, trehalose, DMSO, Q-solution, etc., can enhance PCR efficiency (see Note 28). Use additives only if standard protocols do not work. Too high concentrations of MgCl2, for instance, increase the amount of unspecific products due to unspecific amplification of the Polymerase. 6. Molecular water: Use only ultra pure and nuclease-free water for PCR. Water is used to fill the mix of ingredients up to the desired volume, which is normally 10–25 ml. 7. Template DNA: This is the original genomic DNA material. Use 1–2 ml DNA solution (obtained from extraction) with a concentration between 20 and 100 ng/ml. Usually, PCR also works well with lower concentrations (below 2 ng/ml) (see Note 29). 3.14. DNA Amplification: Principle Steps

PCR amplification is carried out in a Thermocycler using a specific temperature profile. It involves initial denaturation of the template DNA, followed by a specific number of cycles, including denaturation, primer annealing, and elongation steps. The program is finished by an extended final elongation step (see Note 30). 1. Initial denaturation: Melting of double-stranded DNA in two single-stranded templates by disrupting the hydrogen bonds between complementary nucleotides. This step is usually performed at a temperature of 94°C for about 5 min. If the template DNA is GC rich, the interval should be extended up to 10 min. Heating the lid is recommended and normally an option of every Thermocycler. 2. Denaturation: Similar to the initial denaturation, this step leads to melting of the double-stranded DNA into single strands for primer annealing. Amplified DNA with high GC content needs increased denaturation time (3–4 min). 3. Annealing: In most cases, temperatures between 50 and 65°C allow successful annealing of primers to the single-template DNA strands. Typically, the optimal annealing temperature (Ta) is 3–5°C below the melting temperature of the primers (Tm). Tm can be calculated by several computer programs (see primer design below). If nonspecific PCR products are produced in addition to the desired product, temperature can be optimized by stepwise raising in increments of 1–2°C. 4. Elongation: In this step, the DNA Polymerase synthesizes a new DNA strand complementary to the template strand by adding dNTPs. The optimal elongation temperature is dependent of the Polymerase itself and the length of the desired fragment. In case of Taq DNA Polymerase, the highest synthesis rates can be performed at 70–75°C. For fragments up to 1,000 bp, optimal elongation time is between 1 and 2 min.

326

T. Knebelsberger and I. Stöger

For longer fragments, more elongation time is needed (vice versa for smaller fragments). 5. Number of cycles: Now, the steps 2–4 are repeated several times (cycles). The number of cycles depends on the amount of template DNA. If the initial DNA quantity is low, up to 40 cycles can be performed. For higher amounts of template, 30–35 cycles may last. 6. Final elongation: After the last PCR cycle, a final elongation is performed to ensure that all remaining single DNA strands are fully extended. It is usually performed at 72°C for 5–10 min. 7. Cooling (optional): After the final elongation step, samples can remain in the Thermocycler if reactions are performed overnight. For cooling overnight, use a temperature of 15°C. This temperature neither damages PCR products nor strains the heating block too much. Subsequently to amplification, the PCR products can be stored for a while in the fridge (4°C) until further processing. 3.15. DNA Amplification: PCR Performance

It is recommended to prepare a so-called master mix for all samples that should be processed. The master mix contains all required ingredients, except the template DNA (Subheadings 3.16 and 3.17, see also Note 31). Volumes of ingredients are calculated according to the number of processed samples including a positive and a negative control (see Note 32) plus about 5–10% more just to make sure that it is enough. (It is very annoying if there is no master mix left for the last few samples!) For PCR preparation, notice the following steps: 1. Select samples for PCR. Calculate volumes of required ingredients according to the protocol. 2. Wear new gloves and use filter tips for pipetting steps to avoid contamination. 3. Thaw required ingredients and DNA samples. Vortex all ingredients, except DNA samples. Briefly centrifuge all ingredients and DNA samples before opening. Keep everything on ice. 4. Prepare master mix with all ingredients, except Taq DNA Polymerase. Start with water or buffer first. Use PCR form as checklist (make check marks). Find below protocols for master mix preparation. 5. Transfer 1–2 ml of DNA template to 0.2-ml PCR reaction tubes (single tubes, 8-stripes or 96-well PCR plates, dependent on the number of samples). 6. Take Taq DNA Polymerase out of the freezer and centrifuge briefly. Add Polymerase to master mix and pipette up and down carefully to mix (put Polymerase immediately back to the freezer!). Dispense master mix to PCR reaction tubes (see Note 33).

14

DNA Extraction, Preservation, and Amplification

327

7. Cap PCR tubes properly (in case of plates, seal with foil) and mark for identification (use thermostable markers, e.g., Stabilo OH Pen). 8. Briefly centrifuge the PCR tubes, place them into the Thermocycler, and start the required program. After PCR, check products on agarose gel (see Subheading 3.20). 3.16. DNA Amplification: PCR Master Mix Protocol 1

Protocol for one sample using Taq DNA Polymerase (e.g., Qiagen) with a 25 ml PCR reaction volume; dispense 24 ml of the master mix to 1 ml of the template DNA (see Note 34): 1. Molecular-grade water: 15.875 ml. 2. 10× PCR buffer: 2.5 ml. 3. MgCl2: 2.0 ml. 4. dNTPs, 2 mM each: 2.5 ml. 5. Primer forward, 10 pmol/ml: 0.5 ml. 6. Primer reverse, 10 pmol/ml: 0.5 ml. 7. Taq Polymerase 5 U/ml: 0.125 ml. 8. DNA: 1.0 ml.

3.17. DNA Amplification: PCR Master Mix Protocol 2

Protocol for one sample using Hot Start DNA Polymerase (e.g., Phire-Polymerase, New England BioLabs) with a 20 ml PCR reaction volume; dispense 19 ml of the master mix to 1 ml of the template DNA (DMSO is provided by the supplier of the Polymerase). 1. Molecular water: 11.2 ml. 2. 5× PCR buffer: 4.0 ml. 3. DMSO: 1.0 ml. 4. dNTPs, 10 mM each: 0.4 ml. 5. Primer forward, 10 pmol/ml: 0.5 ml. 6. Primer reverse, 10 pmol/ml: 0.5 ml. 7. Phire-Polymerase: 3.6 ml. 8. DNA: 1.0 ml.

3.18. DNA Amplification: PCR Temperature Scheme Protocol 1

Standard temperature profile using Taq DNA Polymerase. 1. Initial step: 94°C—2 min. 2. Denaturation: 94°C—30 s. 3. Annealing: 50°C—30 s. 4. Elongation: 72°C—1 min. 5. Cycles: Repeat steps 2–4 for 35 times. 6. Final elongation: 72°C—10 min. 7. Cooling: 15°C—as long as required (often called “forever” in programs of Thermocyclers).

328

T. Knebelsberger and I. Stöger

3.19. DNA Amplification: PCR Temperature Scheme Protocol 2

Temperature profile using Hot Start Polymerase (see Note 35). 1. Initial step: 98°C—30 s. 2. Denaturation: 98°C—5 s. 3. Annealing: 65°C—5 s. 4. Elongation: 72°C—35 s. 5. Cycles: Repeat steps 2–4 for 35 times. 6. Final elongation: 72°C—1 min. 7. Cooling: 15°C—as long as required (often called “forever” in programs of Thermocyclers).

3.20. PCR Product Quality Control by Agarose Gel Electrophoresis (see Note 36)

1. Prepare loading dye and Molecular Size Marker (100 bp DNA Ladder Plus, Fermentas) according to the manufacturer’s instructions. Loading dye is used in 1× concentration in this protocol. 2. Prepare the tray with appropriate combs. 3. In case of a usual 100-ml gel, weigh 1.0 g agarose powder and add 100 ml 1× TBE buffer. Boil the mixture in a microwave until the agarose powder is completely dissolved (see Note 37). Add 2 ml (or one drop) ethidiumbromide or 10 ml GelRed and shake carefully. Immediately pour the mix to the prepared tray and wait until the agarose gel is solid which takes about half an hour. 4. Apply the gel to an adequate electrophoresis chamber filled with 1× TBE buffer. Gel should be completely dipped. Remove the combs. Mix 2 ml of each PCR product with 2 ml loading dye (prepared in a microtiter plate according to the number of samples) and pipette up and down a few times to mix. Load the PCR samples into the pockets (see Note 38). 5. Connect voltage (90 V) and let samples run for about 30 min. Afterwards, the double-stranded PCR products can be viewed in ultraviolet light. Take a photo to select samples for the cleanup. Sharp bands indicate successful amplification of the desired DNA fragment (Fig. 2). When PCR fails, no bands are present. In case of suspicious PCR results, see PCR troubleshooting in Note 39.

3.21. PCR Cleanup: ExoSAP-It

PCR purification eliminates the remaining PCR ingredients and single-stranded DNA fragments which may inhibit the adjacent sequencing reaction. Cleanup can be carried out in three different ways: enzymatic digestion (this section), ethanol precipitation (Subheading 3.22), or via commercial kits (Subheading 3.23) that are based on column methods (similar to the extraction methods of Qiagen and Macherey-Nagel). After the cleanup, the PCR products are ready to use for sequencing reactions.

14

DNA Extraction, Preservation, and Amplification

329

Fig. 2. Amplified COI fragments on agarose gel. Lanes 1–4 show very intense and sharp PCR products. In lanes 5–8, DNA amplification failed, and only unconsumed primers are visible.

Enzymatic digestion using ExoSAP-It (GE Healthcare, formerly Amersham Biosciences) (see Note 40): 1. Mark the appropriate number of tubes. Do cleanup of only the samples that worked well in the PCR reaction. Transfer 5 ml of each of the selected PCR products to new PCR tubes (single tubes, 8-stripes, or plates). 2. To clean up 5 ml of PCR product, 1.0 ml molecular water and 0.5 ml ExoSAP-It are needed. Produce a master mix of water and ExoSAP-It according to the number of samples plus about 5–10% more just to make sure that it is enough. 3. Mix 5 ml PCR product with 1.5 ml master mix, then place it to the Thermocycler, and start the program with the following temperature scheme: 37°C—40 min. 80°C—15 min. 15°C—as long as required. In the first step, Exonuclease I degrades residual single-stranded DNA and primers. Shrimp Alkaline Phosphatase hydrolyzes remaining dNTPs. Those ingredients would otherwise interfere with the sequencing reaction. In the second step, ExoSAP-It itself is inactivated. After this procedure, the cleaned up PCR product can be directly used for sequencing reaction (see Note 41). 3.22. PCR Cleanup: NucleoSpin® Extract II Kit

Column-based method using NucleoSpin® Extract II Kit (Macherey-Nagel): 1. Dilute buffer NT3 with the correct amount of ethanol (96–100%). Mark the appropriate number of 1.5-ml microcentrifuge tubes (not provided).

330

T. Knebelsberger and I. Stöger

2. For sample volumes of 90%, 5–10 years

>90%, up to 200 years

Species resolution

95–98%

91–95%

Technology

Sanger (ABI)

Sanger (ABI) NextGen sequencing (i.e., 454) Single pyrosequencing (PSQ)

Applicability

Barcode library construction Routine barcoding

Museum and preserved samples Processed material (i.e., food products, pharmaceuticals) Environmental barcoding

15

DNA Mini-barcodes

341

among similar aged samples can be largely accounted for by the variation in preservation methods. Many specimens, particularly insects, are pinned for storage allowing the soft tissue to desiccate and decompose (9). This dehydration combined with exposure to the environment can lead to a reduction of DNA quality and increased incidences of DNA fragmentation (6). Field collected specimens and pathology tissue samples are commonly exposed to formaldehyde, which has severe consequences on the DNA (6). In a formalin-exposure study utilizing amphibian specimens, a clear negative correlation was found between exposure time and PCR success (6). Furthermore, a recent study by Baird et al. (10), examined the effect of formalin preservation on DNA barcoding. This work used tissue samples of four invertebrate species commonly used in freshwater biomonitoring programs as well as archival specimens of macroinvertebrates. The authors concluded that exposure to formalin followed by long-term storage can dramatically reduce the ability to obtain a full-length barcode; however, some mini-barcodes can still be recovered from these samples (10). In many cases, damaged DNA can be extracted from a sample but the DNA is broken into small fragments due to hydrolysis of the DNA backbone (6, 11). This fragmented DNA is unable to be amplified using standard barcoding primers. Models derived from preserved specimen and tissue samples by capillary electrophoresis predict a rapid initial decline in average DNA fragment size in the first 5 years followed by a more gradual change in the following time period (6). The same problems that plague preserved samples affect fossils and ancient DNA. Our understanding of evolutionary processes is hindered due to DNA fragmentation, cross-linking due to condensation (12) and pyrimidine oxidation which prevents extension during the amplification process (11). Furthermore, these ancient records often produce DNA extracts that are a combination of bacterial, fungal, and human contaminants (11) complicating the ability to achieve a useable standard barcode. Mini-barcodes (e.g., 100–300 bp) have been found effective for species-level identification in DNA-damaged samples and in situations, where it is difficult to obtain a full-length barcode (Table 2). Additionally, components, such as average nucleotide composition, patterns of strand asymmetry, and a high frequency of hydrophobic amino acid encoding codons can be accurately predicted from a short barcode sequence (13). Furthermore, it has been shown that mini-barcodes may provide measures at both the intra-specific and intra-generic levels of sequence variability and divergence in some cases when compared to full length barcodes (3). Full-length 650 bp COI barcodes can exhibit up to 98% species resolution, with smaller regions 100 bp and 250 bp producing correspondingly lower rates of identification success (3, 4) (Fig. 1), but when employed in ecological or environmental contexts where the number of species per genus is often low, they can produce

Age of sample (years, unless specified)

2–21

1–14

>1–23

Average 80

N/A

Average 15

N/A

22 960 ± 120 (OxA-15348) and 15 810 ± 75 (OxA-14930) uncalibrated Radiocarbon years

N/A

Sample type

Oven-dried (museum)

Ethanol-preserved (museum)

Formalin-preservation

Arsenic/borax-preserved and air-dried (museum)

Dried and formalin-fixed tissue (forensic samples)

Museum

Silica-dried (museum)

Permafrost

Decayed carcass

Ots213

P6 loop of trnH intron

Plantae

Actinopterygii: Salmoniformes

P6 loop of trnH intron

COI

C01

COI

COI

COI

COI

Gene

Plantae

Insecta: Diptera

Reptilia

Aves: 17 orders

Insecta: Trichoptera

Insecta: Hymenoptera

Insecta: Lepidoptera

Taxonomic group

170–375

13–158; average 43.2

13–158

227–469

175–245

190 and 310

130

135

134 and 221

(17)

(15)

(15)

(20)

(18)

(8)

(10)

(3)

(3)

Mini-barcode size (bp) References

Table 2 Different research projects that used DNA mini-barcodes for biodiversity analysis and species identification

342 M. Hajibabaei and C. McKenna

15

DNA Mini-barcodes

343

Fig. 1. Comparison of DNA barcode size versus proportion of species identified reveals the efficiency of mini-barcodes in resolving species (adapted from ref. 4). Sequence read lengths typically obtained from three commonly used next-generation sequencing technologies as well as Sanger sequencing are shown on the graph. It is clear that 454 pyrosequencing and Sanger are currently optimal technologies for mini-barcode and full-barcode recovery.

rates of identification that are very high (4). In silico studies have been utilized to corroborate the empirical tests of the rates of identification success for DNA barcodes, but also point to the need to carefully design experiments in environmental contexts where primer bias may affect the results (3, 14). This discovery has led to an increase in the use of minibarcodes. The consistency of mini-barcodes to distinguish between species has been explored in plants (15), fish (3, 4, 14, 16, 17), reptiles (18), birds (4, 8, 19), arthropods (3, 4, 9, 20–23) fungi (4, 20), and mammals (4, 14). Additionally, multiple overlapping mini-barcodes have been used to reconstruct the full COI barcode (9, 19). Applications of mini-barcodes include food Web analysis (21), distinction of cryptic species (16), biodiversity studies (3, 11, 15, 17, 22, 23), and effective law enforcement for the conservation of wildlife (18). Short DNA regions can also be utilized via new parallel highthroughput sequencing platforms (aka next-generation sequencing), such as pyrosequencing-based 454 Roche sequencer allowing a comprehensive and inexpensive means for barcoding applications (15, 23) (Fig. 1). With these technologies, the need for traditional cloning is eliminated as simultaneous amplification of several thousands to millions of 10–400 bp DNA molecules is achieved during the emulsion PCR process (7).

344

M. Hajibabaei and C. McKenna

2. Materials We recommend using molecular biology laboratory material, including gloves, disposable pipette tips, PCR-grade tubes/strips, or 96-well microtiter plates. 2.1. Silica-Based DNA Extraction

1. NucleoSpin 96 Tissue Kit (Macherey-Nagel). 2. Ethanol (99.9%). 3. Matrix ImpactII P1250 pipette (Matrix Technologies). 4. Centrifuge with deep-well plate rotor (25R, Beckman Coulter). 5. Incubator (Fisher Scientific).

2.2. PCR Amplification

1. 10× PCR Buffer for Platinum Taq DNA polymerase (Invitrogen). 2. 10 mM deoxynucleotide (dNTP) mix (New England Biolabs). 3. Oligonucleotide primers (Forward and Reverse), 100 mM stock, and 10 mM working solutions (Integrated DNA Technologies). 4. Platinum Taq DNA Polymerase (Invitrogen). 5. 50 mM Magnesium chloride (MgCl2) (Sigma-Aldrich). 6. Molecular biology grade distilled water. 7. Thermocycler (Mastercycler EP Gradient, Eppendorf).

2.3. PCR Amplification Check Using E-gel 96 Gel

1. Mother E-base (Invitrogen).

2.4. PCR Amplification Check Using Handmade Agarose Gel

1. Molecular biology grade Agarose (Sigma-Aldrich).

2. 2% E-gel 96 gels (Invitrogen). 3. Gel documentation system (AlphaImager 3400, Alpha Innotech Corporation).

2. 1× TBE buffer: 0.9 M Tris base, 0.89 M Boric acid, 0.02 M Na-EDTA. 3. Ethidium Bromide (10 mg/ml). 4. DNA size standard (“ladder”; New England Biolabs). 5. Submarine electrophoresis apparatus and power supply (Thermo EC, Fisher Scientific). 6. Gel documentation system (AlphaImager 3400, Alpha Innotech Corporation).

2.5. Sanger Sequencing Reaction

1. BigDye® Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems). 2. 5× Sequencing Buffer: 400 nm Tris–HCl pH 9.0 and 10 mM MgCl2.

15

DNA Mini-barcodes

345

Table 3 PCR Primer sets commonly used for mini-barcode amplification Amplicon size (bp)

References

COI; Universal eukaryotes

130

(4)

COI; Reptile, snake

175

(18)

COI; Reptile, snake

245

(18)

16S; plant

166

(15)

Name

Primer sequence (5¢–3¢)

Target

Uni-MinibarF1

TCCACTAATCACAARGATATT GGTAC

Uni-MinibarR1

GAAAATCATAATGAAGGCATGAGC

Minibar-F1

TGA TTY TTT GGH CAC CCR GAA GT

Minibar-R1

AAT ATR TGR TGG GCY CAD AC

Minibar-F2

GGT AGY GAT CAA ATC TTT AAY GT

Minibar-R2

GGG TAG ACD GTT CAV CCT GTT CC

Primerpair-1

AGCATTAGCTCTCCCTGA AGCCATACGGCGGTGAAT

3. Sequencing Oligonucleotide Primers. For bi-directional sequencing, two sequencing reactions are performed, each require a single Forward or Reverse primer. See Table 3 for a list of primers.

3. In Silico Approaches for Mini-barcode Analysis 3.1. Mini-barcode Primer Selection

Primers should be designed based on an alignment of sequences from a reference barcode library source, such as GenBank or BOLD (http://www.barcodinglife.com/views/idrequest.php). Keeping in mind physical and structural properties, such as G + C content, annealing temperature and self complementary, oligos should target highly conserved regions flanking stretches of 100–300 bp in barcode region. To facilitate high-throughput sequencing applications, M13 tails may be attached to the forward and reverse primers. Although the addition of these tails generally do not bias PCR results, it is best to verify this empirically before using tailed primers in real applications (4). Primer suitability can be confirmed using a number of freely available programs, such as Primer3 (24) and IDT OligoAnalyzer (25). Most applications of mini-barcodes allow for a selection of a target taxonomic group and subsequent primer design. For example, Hajibabaei et al. (3) developed primers for mini-barcoding specific Lepidoptera species. In this case, the availability of reference sequences from closely related taxa to the target species will increase the chances of developing robust

346

M. Hajibabaei and C. McKenna

primers. For protein-coding genes, such as COI, conservation of primer binding sites at amino acid level will aid the selection of target primers. However, when dealing with unknown specimens, or in cases where identification speed is required, universal primers for mini-barcodes can be utilized. Such a universal mini-barcode system has been developed for COI and tested on a number of taxonomic groups (4). Additionally, universal primers are important for sequencing mini-barcodes from environmental samples containing different organisms (23). Table 3 summarizes different commonly used primers for mini-barcoding. 3.2. Bioinformatics Analysis to Estimate Mini-barcode Performance

Before using a mini-barcode in empirical tests, it is important to determine whether sequence information obtained from a given mini-barcode sequence can provide species-level resolution in a given taxonomic group. Available taxon-specific barcode data can be downloaded from barcode libraries, such as BOLD (http:// or GenBank. www.barcodinglife.com/views/idrequest.php) Alternatively, sequences obtained in full-length barcode analysis in a given taxonomic group can be used as template for primer design for mini-barcoding. The full-length region can then be divided into short subsets from 5¢ to 3¢ end (for example see Subheading 3). Software such as MEGA (26) provides a simple bioinformatics tool to partition data for this purpose. By comparing each putative mini-barcode segment (i.e., the first 100 bp of 5¢ region) to the full-length barcode, through simple statistics, such as number of variable and parsimony informative sites and intra-specific and intra-generic divergences (3) one can determine the mini-barcode fragment with optimal information. Subsequent neighbor-joining (NJ) analysis can measure resolution and ultimately tests the accuracy of the short DNA fragments for the practice of species identification for the target taxonomic group (3, 7).

3.3. Testing the Performance of a Putative COI Mini-barcode for Species Identification

1. Select a target sequence library from BOLD or GenBank or use a set of taxonomically identified sequences. Typically, 100–500 taxa are optimal for this analysis (see Note 1). 2. Align sequences using automated alignment tools (ClustalW, MUSCLE). These tools are available stand-alone or embedded in sequence analysis software, such as MEGA. 3. Inspect alignment visually using a program, such as MEGA. Look for obvious signs of misaligned sequences, such as indels (insertions/deletions). Use amino acid translation to guide the alignment verification. 4. Save the aligned sequences in Mega format (.meg). If you use MEGA for alignment, this can be done from alignment viewer. If you use other programs for alignment, save your alignment in FASTA format (.fasta, .fas) and import the file in MEGA.

15

DNA Mini-barcodes

347

5. Open the file for phylogenetic analysis in MEGA. Use “Sequence Data Explorer” tool in MEGA to select minibarcode fragments for analysis. This can be done in “Data” using “Select & Edit Genes/Domains”. 6. Select your putative mini-barcode(s) from the full-length library by specifying the beginning and the ending nucleotides. Make sure these positions do not include positions of your forward and reverse primers. You can select multiple minibarcodes as long as the positions are not overlapping. For overlapping mini-barcodes, you will require the use of a separate analysis for each overlapping mini-barcode. 7. Select your target mini-barcode by checking the box next to it in the same menu (Select & Edit Genes/Domains). Once done, check the alignment in “Sequence Data Explorer” to ensure only your target mini-barcode is highlighted. 8. Use “Highlight” menu in “Sequence Data Explorer” to highlight/measure Variable and Parsimony Informative sites as simple statistics. These numbers can be compared with numbers obtained from a full-length barcode or other mini-barcodes. 9. Use “Phylogeny” tools to assemble an NJ tree for your target mini-barcodes. Use this tree to inspect species-level resolution. Compare this NJ tree to the one obtained from full-length barcode library. Evaluate cases where the mini-barcode does not provide species-level resolution. If these cases are among your putative target taxa for mini-barcoding, it is important to consider a longer or alternative fragment for mini-barcoding. Note that DNA degradation in museum samples often dramatically decreases the chances of obtaining fragments longer than 150–250 bp. 10. Once your optimal mini-barcode fragment(s) is selected, proceed to design forward and reverse PCR primers using 20–30 bases flanking mini-barcode fragment(s). 3.4. DNA Extraction

Many protocols are available for DNA extraction from a number of different tissue types (see Chapter III. B. on DNA extraction (III) for details). Subsequently, several of these procedures have been incorporated into relatively inexpensive and effective commercial kits, such as the silica-based Nucleospin Tissue Kit (MachereyNagel, Düren, Germany) (see Note 2). Most of our mini-barcode tests have utilized this kit to obtain DNA from specimens (including museum samples). Additionally, a recent study by Shokralla et al. (27) has shown that modern PCR enzymes are capable of amplifying genetic information from preservative ethanol for noninvasive sampling or when no tissue specimen is available. This suggests that DNA extraction may be unnecessary in many protocols and in many cases DNA can be obtained from valuable museum specimens

348

M. Hajibabaei and C. McKenna

noninvasively. Because mini-barcodes are much smaller than full-length barcodes, simple DNA extraction and releasing approaches, that may not provide full-length intact barcode DNA, are suitable for mini-barcoding. 3.5. PCR Amplification from Dried Museum Samples

Mini-barcodes can be amplified in a simple PCR reaction containing dNTPs, primers (Forward and Reverse), MgCl2, 10× PCR Buffer, Taq polymerase, and template DNA. A master mix (excluding template DNA) should be prepared for all samples that will be processed. The volume of ingredients is to be calculated based upon the number of samples to be amplified with additional 5% to account for pipette error. No special condition or additive is required for amplification of mini-barcodes and different versions of Taq polymerase should be able to amplify mini-barcode sequences (see Note 3). The thermocycler program will vary depending on the annealing temperature and extension time. For the universal primer set, (4) a touch up PCR program was used: 95°C for 2 min, followed by five cycles of 95°C for 1 min, 46°C for 1 min, and 72°C for 30 s, followed by 35 cycles of 95°C for 1 min, 53°C for 1 min, and 72°C for 30 s, and finally a final extension at 72°C for 5 min (see Note 5). Negative control reactions with no DNA template as well as a positive control reaction should always be included (see Note 4). To determine PCR success, it is necessary to visualize PCR products on an agarose gel. For high-throughput analysis a precast 2% E-gel 96 agarose (Invitrogen, Burlington, ON, Canada) can be used. Alternatively, casting one’s own gel and utilizing a molecular size marker (e.g., 100 bp DNA Ladder, New England Biolabs) can give more comprehensive results. Positive amplification (clean bands) can then be bidirectionally sequenced using standard BigDye chemistry (Applied Biosystems, Foster City, CA) on sequencers, such as a 3730xl DNA analyzer (Applied Biosystems, Foster City, CA, USA). The sequence reads are to be trimmed and edited resulting in clean contigs for phylogenetic analysis using tools, such as CodonCode (CodonCode Corporation) BioEdit (27) and MEGA (25). 1. Select DNA samples for PCR amplification (see Note 2) Calculate the volume of each necessary reagent (see Table 4 for a detailed PCR recipe). 2. Prepare master mix with calculated volumes. It is often easy to use the calculation as a check list to ensure all reagents have been added. Briefly vortex and centrifuge. 3. Briefly spin down Taq polymerase before adding to the master mix. Do not vortex, mix by gentle pipette up and down. Return Taq polymerase to freezer. 4. Dispense master mix to PCR reaction tubes (0.2 ml single tubes, strips, or 96-well plates).

15

DNA Mini-barcodes

349

Table 4 A typical PCR recipe for amplification of mini-barcodes Reagent

Initial concentration

Final concentration

Volume per reaction (ml)

H2O





17.5

PCR buffer (Platinum)

10×



2.5

MgCl2

50 mM

2.0 mM

1

dNTPs

10 mM

0.2 mM

0.5

Primer (Forward)

10 mM

0.2 mM

0.5

Primer (Reverse)

10 mM

0.2 mM

0.5

Platinum Taq polymerase

5 U/ml

2.5 U

0.5

Template DNA Final volume

2 25

5. Add 1–2 ml DNA template to each reaction tube. Ensure that pipette tip is changed each time. 6. Cover PCR tubes securely (caps in the case of strips or seal plates with foil) and label using permanent marker. 7. Centrifuge the plate (or PCR tubes/strip in plate holder) about 1 min at 1,000 × g ensuring that the centrifuge is wellbalanced. 8. Place tubes, strips, or plate firmly into thermocycler. Double check the covers to minimize evaporation before beginning the required program. Ensure that the thermocycler lid is secure and that the program begins. 9. Perform PCR amplification check using gel-electrophoresis (Protocols 3 and 4) to determine positive samples. 10. (Optional) Use a PCR purification method to clean samples prior to sequencing reaction (i.e., QIAQuick (Qiagen, Duesseldorf, Germany)). 3.6. PCR Amplification Check Using Pre-cast 2% E-gel 96 Agarose (Invitrogen, Burlington, ON, Canada)

1. Remove gel from package and using thumb and reasonable force to remove comb. Use packaging as base for gel while applying sample (see Note 6). 2. Load 14 ml of ddH2O to wells holding 12-multichannel pipette on slight angle. 3. Load 4 ml of PCR product to wells using 12-multichannel pipette. 4. Slide gel into the electrode connections on E-BaseTM. Ensure that the E-gel display screen says “EG” and change the time to

350

M. Hajibabaei and C. McKenna

appropriate amount (~4 min). Press and release pwr/prg button, the red light should turn to green. 5. Remove gel from base and acquire image using a UV transilluminator and digital camera if necessary. 6. Discard gloves, tips and gel in hazardous waste. 3.7. PCR Amplification Check Using 1.5% Hand-Made Agarose Gel

1. Prepare gel-casting tray with appropriate combs relative to number of samples and secure open edges using gasket system or tape. 2. Weigh 1.5 g agarose powder and add this to 100 ml 1× TBE buffer in a glass beaker, preferably with lid loosely tightened (this calculation may vary for different gel sizes; check your electrophoresis apparatus to verify). 3. Boil mixture in microwave (~2 min) until the powder has completely dissolved and solution is uniform. 4. Allow to cool slightly before adding 3 ml ethidium bromide. Swirl beaker gently to mix (see Note 6). 5. Allow more cooling, you should be able to touch bottom of beaker without burning hand. 6. Steadily pour agarose solution into gel tray, do not move while doing this as it may create bubbles. Ensure that there are no bubbles around combs, if so gently remove these using clean pipette tip. 7. Place gel in electrophoresis chamber filled with 1× TBE buffer, add buffer until gel is fully submerged. Gently remove combs. 8. Load 3–5 ml ladder in the first lane of the agarose gel. Mix by pipette 4 ml PCR product with 2–3 ml loading dye on parafilm or in another plate. 9. Apply PCR samples to gel wells. 10. Connect voltage (100 V) and allow samples to run for 20–30 min. Using UV transilluminator of Geldoc system visualize gel and acquire image.

3.8. Sanger Sequencing Reaction

1. Prepare BigDye terminator (Applied Biosystems, Foster City, CA) master mix to appropriate dilution for size of product (i.e., 1/16 dilution) and according to sample number (see Table 5 for details). 2. Aliquot BigDye mix into PCR tubes, strips, or plate. Add 1.5–2 ml of PCR product as template. If using a plate, securely seal plate using foil or strip caps to prevent evaporation. Perform cycle-sequencing reaction for each primer direction (forward and reverse), an optimized thermocycler protocol can be found on the Canadian Center for DNA Barcoding (CCDB) Web site (http://www.ccdb.ca/) Protocols: Sequencing.

15

DNA Mini-barcodes

351

Table 5 A typical Sanger sequencing recipe with BigDye (1/16 dilution) Reagent

One reaction (ml)

Dye terminator mix 3.1

0.25

5× ABI sequencing buffer

1.875

10% trehalose

5

10 mM primer

1

H2O

0.875

Total volume

9

PCR product

1–2

Total volume

~10

3. Perform cycle sequencing clean-up using method, such as AutoDTRTM 96 (EdgeBio, MD, USA). A detailed protocol can be found on the Canadian Center for DNA Barcoding (CCDB) Web site (http://www.ccdb.ca/) in Protocols: Sequencing. 4. After clean-up, submit reactions for sequence analysis using an automated DNA sequencer (e.g., Applied Biosystems 3730xl DNA Analyzer).

4. Notes 1. Sequence selection is critical as it influences the analysis of the utility of a putative mini-barcode. For barcoding purposes, species-level discrimination is most important. Hence, sequences used for mini-barcode selection should include maximum number of species. Congeneric species are good targets for this analysis. Additionally, when possible, multiple sequences from each species should be included so that conspecific variation has been taken into consideration in calculations. 2. Always perform sampling and DNA extraction procedures in a dedicated pre-pcr area. Clean work surface with ethanol or a product, such as ELIMINase Decontaminant. All tissuehandling instruments should be sterilized (preferably by flaming) between samples. 3. A high-fidelity Polymerase, such as Platinum Taq (Invitrogen, Burlington, ON, Canada) is recommended as it requires less optimization and works better with small quantities of template DNA.

352

M. Hajibabaei and C. McKenna

4. Always include at least one PCR reaction without template as a negative control to check for reagent DNA contamination. Consider using a positive control (a previously amplified DNA sample) to test the efficiency of the PCR reagents. 5. PCR protocols listed are for a thermocycler with a rapid thermal ramping (e.g., Eppendorf MasterCycler EP). This allows for more efficient annealing and quicker completion of PCR amplification, optimizations will need to be made if a model with slower ramping is used. 6. Ethidium bromide is toxic. Always wear nitrile gloves when utilizing Ethidium Bromide. Discard gloves, tips, and used gels in appropriate hazardous container after usage. Consult MSDS and a laboratory health and safety manual for safe handling/disposal before using.

Acknowledgements This work was supported by grants from Genome Canada through the Ontario Genomics Institute, Environment Canada and NSERC to MH. References 1. Hebert PDN, Cywinska A, Ball SL, deWaard JR (2003) Biological identifications through DNA barcodes. Proc R Soc Lond B Biol Sci 270:313–321 2. Hajibabaei M, Singer GAC, Hebert PDN, Hickey DA (2007) DNA barcoding: how it complements taxonomy, molecular phylogenetics and population genetics. Trends Genet 23:167–172 3. Hajibabaei M, Smith MA, Janzen DH et al (2006) A minimalist barcode can identify can identify a specimen whose DNA is degraded. Mol Ecol Notes 6:959–964 4. Meusnier I, Singer GAC, Landry JF et al (2008) A universal DNA mini-barcode for biodiversity analysis. BMC Genomics 9:214 5. Wandeler P, Hoeck PEA, Keller LF (2007) Back to the future: museum specimens in population genetics. Trends Ecol Evol 22:634–642 6. Zimmermann J, Hajibabaei M, Blackburn DC et al (2008) DNA damage in preserved specimens and tissue samples: a molecular assessment. Front Zool 5:18 7. Hajibabaei M, Singer GAC, Clare EL, Hebert PDN (2007) Design and applicability of DNA arrays and DNA barcodes in biodiversity monitoring. BMC Biol 5:24

8. Patel S, Waugh J, Millar CD, Lambert DM (2009) Conserved primers for DNA barcoding historical and modern samples from New Zealand and Antarctic birds. Mol Ecol Resour 10:431–438 9. Dean MD, Ballard JW (2001) Factors affecting mitochondrial DNA quality from museum preserved Drosophila simulans. Entomol Exp Appl 98:279–283 10. Baird DJ, Pascoe TJ, Zhou X, Hajibabaei M (2011) Building freshwater macroinvertebrate DNA barcode libraries from reference collection material: formalin preservation versus specimen age. J North Am Benthol Soc 30:125–130 11. Poinar HN, Schwarz C, Qi J, Shapiro B et al (2006) Metagenomics to paleogenomics: largescale sequencing of mammoth DNA. Science 311:392–394 12. Evans T (2007) DNA damage. NEB Expressions 2(1):1–3 13. Min XJ, Hickey DA (2007) DNA barcodes provide a quick preview of mitochondrial genome composition. PLoS One 2:e325 14. Ficetola GF, Coissac E, Zundel S et al (2010) An in silico approach for the evaluation of DNA barcodes. BMC Genomics 11:434

15 15. Sonstebo JH, Gielly K, Brysting AK et al (2010) Using next-generation sequencing for molecular reconstruction of past arctic vegetation and climate. Mol Ecol Resour 10:1009–1018 16. Saitoh K, Uehara S, Tega T (2008) Genetic identification of fish eggs collected in Sendai Bay and off Johban, Japan. Icthyol Res 56:200–203 17. Baumstegier J, Kerby JL (2009) Effectiveness of salmon carcass tissue for use in DNA extraction and amplificaton in conservation genetic studies. N Am J Fish Manag 29:40–49 18. Dubey B, Meganathan PR, Haque I (2010) DNA mini-barcoding: an approach for forensic identification of some endangered snake species. Forensic Sci Int Genet 5:181–184 19. Lee PLM, Prysjones RP (2008) Extracting DNA from museum bird eggs, and whole genome amplification of archive DNA. Mol Ecol Resour 8:551–560 20. Houdt JKJ, Breman FC, Virgilio M, Meyer MD (2009) Recovering full DNA barcodes from natural history collections of Tephritid fruitflies (Tephritidae, Diptera) using minibarcodes. Mol Ecol Resour 10:459–465 21. Rougerie R, Smith AM, Fernandez-Triana J et al (2010) Molecular analysis of parasitoid linkages (MAPL): gut contents of adult parasitoid wasps reveal larval hosts. Mol Ecol 20: 179–186

DNA Mini-barcodes

353

22. Smith MA, Fisher BL (2009) Invasions, DNA barcodes and rapid biodiversity assessment using ants of Mauritius. Front Zool 6:31 23. Hajibabaei M, Shokralla S, Zhou X, Singer GAC, Baird DJ (2011) Environmental barcoding: a next-generation sequencing approach for biomonitoring applications using river benthos. PLoS One 6:e17497. doi:10.1371/journal. pone.0017497 24. Rozen S, Skaletsky H (2000) Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol 132:365–386 25. Owczarzy R, Tataurov AV, Wu Y et al (2008) IDT SciTools: a suite for analysis and design of nucleic acid oligomers. Nucleic Acids Res 36 (web server issue) 26. Tamura K, Dudley J, Nei M, Kumar S (2007) MEGA4: molecular evolutionary genetics analysis (MEGA) software version 4.0. Mol Biol Evol 24:1596–1599 27. Shokralla S, Singer GAC, Hajibabaei M (2010) Direct PCR amplification and sequencing of specimens’ DNA from preservative ethanol. Biotechniques 48:233–234 28. Hall TA (1999) BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl Acids Symp Ser 41:95–98

Chapter 16 Ways to Mix Multiple PCR Amplicons into Single 454 Run for DNA Barcoding Ryuji J. Machida and Nancy Knowlton Abstract Metagenetic analysis using second-generation sequencing offers a novel methodology for measuring the diversity of metazoan communities. Among commercially available second-generation sequencers, the 454 GS FLX Titanium (Roche Diagnostics) offers by far the longest read length and can produce one million sequences from a single run. Compared to the large number of sequences produced from single run, however, number of samples these machines can process is rather low. In this chapter, we describe the use of MID adapters to mix multiple PCR amplicons into a single 454 run. This strategy is rather easy to use and up to 132 samples can be multiplexed into a single 454 run. If a large number of samples are going to be mixed into a single 454 run, however, high cost might be next bottleneck. In this context, we also discuss other ways of multiplexing, including the use of fusion primers and Parallel Tagged Sequencing and weigh their advantages and disadvantages. Key words: Metagenetics, Amplicon sequencing, Multiplexing, Second-generation sequencer

1. Introduction In the aquatic environment, metagenomics, metagenetics, and metatranscriptomics are increasingly used to compare, monitor, and assess the diversity of communities and their dynamics, e.g., (1–3). Among these strategies, metagenetic analysis (based on PCR amplified gene-based sequencing) is becoming a standard and feasible strategy in moderate-scale laboratories. In metagenetic analysis, after a target gene is amplified from DNA extracted from environmental samples, sequences of those amplicons are determined by second-generation sequencing technologies. Currently, the 454 GS FLX Titanium machine (Roche Diagnostics) offers by far the longest read length (average 400 bp) among commercially available second-generation sequencers and can produce one million sequences from single run. However, because of structural difference W. John Kress and David L. Erickson (eds.), DNA Barcodes: Methods and Protocols, Methods in Molecular Biology, vol. 858, DOI 10.1007/978-1-61779-591-6_16, © Springer Science+Business Media, LLC 2012

355

356

R.J. Machida and N. Knowlton

from Sanger sequencing, the number of samples that these machines can process in single run is rather low (maximum of 16 samples using a gasket to subdivide a picotiter plate). For many applications, multiple samples are needed for statistical comparisons, and subdivision of the plate results in a reduction of sequences obtained (to about a half in the case of 16 subdivisions). To increase throughput of number of samples, several alternative strategies have been introduced, all of which use sample-specific nucleotide tags to distinguish the source bioinformatically after sequencing. In this chapter, we describe one of the multiplexing protocols, the MID adapter, in detail and compare its advantages and disadvantages with two additional multiplexing strategies—fusion primers and Parallel Tagged Sequencing (PTS) (4–6). Fusion primers contain samplespecific tags and 454 sequencing primers A and B at the 5′ portion of the oligonucleotide in addition to target-specific primers (Fig. 1-1). Therefore, after the amplification by PCR, amplicons can be pooled without further manipulation. In contrast, with the MID adapter strategy, PCR is performed using ordinary primers. Then the MID adapters, which contain sample-specific tags and 454 sequencing primers A and B, are ligated to the PCR amplicons (Fig. 1-2). In PTS, hand-made, sample-specific adapters are ligated to the PCR amplicons. Then adapters, which contain 454 sequencing primers A and B, are ligated to the prepared PCR amplicons with samplespecific adapters (Fig. 1-3). These methods are different in the time required for library preparation, multiplexing scalability, possibility of PCR amplification bias, and capacity of directional sequencing, all of which play important role in designing an experiment. Note that prices and sequencing capacities described in this chapter are based on the information available on December 2010 for machines in the USA.

2. Materials 2.1. PCR Products and Its Purification

1. PCR products (see Note 1). 2. MinElute Gel Extraction Kit (Qiagen), or Agencourt AMPure XP (Beckman Coulter Genomics). 3. TE buffer.

2.2. MID Adapters

1. GS FLX Titanium Rapid Library MID Adaptors Kit (454 Life Sciences). This kit includes 12 kinds of MID adapters. Each MID adapter contains enough amounts for six reactions.

2.3. Library Construction

1. NEBNext Quick DNA Sample Prep Reagent Set2 (New England Biolabls).

Fig. 1. Multiplexed 454 library preparation workflow for fusion primers, MID adapters, and Parallel Tagged Sequencing (4–6). Capital letter A and T in the figure indicate the extended adenine and thymidine in the 5¢ end of amplicons and adapters, respectively. Capital letter P together with bar indicate the phosphorylated 3¢ end. Fusion primers contain samplespecific tag and 454 sequencing primer A or B at the 5¢ portion of the oligonucleotide in addition to the target-specific primer. Therefore, amplification results using fusion primer are not always same as those obtained by target-specific primers. In contrast, advantage of the fusion primer is directional sequencing. In the library made by fusion primer, only one strand of the library has 454 sequencing primer A where the sequencing of 454 will start. In contrast, 3¢ end of both strand have 454 sequencing primer A in the libraries made by MID adapter and PTS. This is the reason why directional sequencing can be performed only by fusion primer (After the library preparation, all of the libraries are denatured to single-strand and proceeded to next step of 454 sequencing, emersion PCR.).

358

R.J. Machida and N. Knowlton

2.4. Library Purification

1. Agencourt AMPure XP (Beckman Coulter Genomics).

2.5. Library Quantification

1. TBS-380 Fluorometer (Turner Biosystem). 2. RL Standard, which included in GS FLX Titanium Rapid Library Preparation Kit (454 Life Sciences). 3. TE buffer.

2.6. Library Pooling

1. MinElute PCR Purification Kit (Qiagen).

3. Methods 3.1. PCR Amplicon Preparation

1. If an undesired fragment is coamplified by PCR, excise the band from the gel and purify using the MinElute Gel Extraction Kit (Qiagen). Otherwise, purify PCR products with Agencourt AMPure XP (Beckman Coulter Genomics). Elute in 17 μl of TE buffer. DNA in the amount of 500 ng or less in 17 μl of TE buffer is recommended for the following procedure. If the amount exceeds 500 ng in 17 μl, adjust the concentration by diluting the product by TE buffer and use only 500 ng for following reactions (see Notes 2 and 3). Keep some amount of the PCR products for MID adapter ligation efficiency verifications (see Subheading 3.4).

3.2. Phosphorylation and dA-Tailing

1. Add 2.5 μl NEBuffer 2 (10×), 2.5 μl ATP, 1.0 μl dNTP Mix, 1.0 μl PNK, 1.0 μl Taq DNA Polymerase (all included in NEBNext Quick DNA Sample Prep Reagent Set2) in a centrifuge microtube. Mix by pipetting (see Note 4). 2. Add 8.0 μl of the mixture to the 17 μl purified PCR amplicon. Vortex briefly and spin down. 3. Incubate the sample in a thermal cycler with the following program: 25°C for 20 min, 72°C for 20 min, and hold at 4°C.

3.3. MID Adapter Ligation

1. Add 1.0 μl of MID adapter to the reaction tube. A different kind of MID adapter should be used for each of the PCR amplicons. Vortex briefly and spin down. Add 1.0 μl of Quick T4 DNA Ligase (included in NEBNext Quick DNA Sample Prep Reagent Set2) to the reaction tube. Vortex briefly and spin down. 2. Incubate the reaction tube 10 min at 25°C.

3.4. Library Purification

1. Purify the product by Agencourt AMPure XP (Beckman Coulter Genomics) following the manual provided by the manufacturer (see Note 5). Extract with 52 μl TE buffer. Use 50 μl of the extract for the following sample preparation and

16

Ways to Mix Multiple PCR Amplicons into Single 454 Run for DNA Barcoding

359

2 μl for MID adapter ligation efficiency verification by running the gel together with untagged PCR products. 3.5. Library Quantification

1. Quantify the library using TBS-380 Fluorometer (Turner Biosystem) following 454 Rapid Library Preparation Method Manual (6).

3.6. Library Pooling

1. Pool the prepared libraries in molar ratios reflecting the proportion of sequence reads desired from each sample. In general, more than 500 ng of pooled library in no more than 100 μl is required for following emPCR. If the volume exceeds 100 μl, concentrate the library using MinElute PCR Purification Kit (Qiagen).

4. Comparison of MID Adapters, Fusion Primers and Parallel Tagged Sequencing 4.1. Comparison of Time Required for Library Preparation

4.2. Scalability for Multiplexing of Large Number of Samples

Figure 1 illustrates workflow of 454 amplicon library preparations using fusion primers, MID adapters, and PTS (4–6). One major factor controlling the time needed to prepare 454 libraries is the number of samples to be multiplexed. Holding the number of samples constant, using fusion primer has the shortest and PTS has the longest time requirement for preparing multiplexed libraries for 454 sequencing. In general, preparation of MID adapter requires an additional day and PTS libraries require an additional 2–7 days compared to fusion primers (Table 1). In contrast to the time requirement for library preparations, PTS has large advantages in its scalability over fusion primers and MID adapters (Table 1). Using fusion primers, the sample-specific tag constitutes a part of the primer. Therefore, it is required to prepare fusion primers for as many as samples as will be mixed in single 454 run. Generally, fusion primers contain 50 or more bases, and the cost is roughly $64.5 USD per primer (45 cents per base plus $42 USD for HPLC purification). Therefore, fusion primers are not always ideal from the perspective of costs when there is a need to multiplex many samples into single 454 run. MID adapters require less time to prepare libraries compared to PTS, but again the cost to multiplex a large number of samples can be high. Twelve kinds of MID adapters are available from 454 Life Sciences, which together cost $1,500 USD. Alternatively, 120 kinds of MID adapters are available from Integrated DNA Technologies (http://www.idtdna.com), and each MID adapter costs $235 USD. Therefore, preparing many kinds of MID adapters might require a large initial investment. Using the MID adapter strategy, the major costs to prepare the library is not only the MID adapter itself, but also library preparation kit. In this chapter, we

360

R.J. Machida and N. Knowlton

Table 1 Comparison of multiplexing strategies for 454 library preparation Multiplexing strategies

Time required for library preparation

Scalability

Amplification bias

Directional sequencing

Fusion primers

Short

Low

Yes

Yes

MID adapters

Middle

Low

No

No

Parallel Tagged Sequencing

Long

High

No

No

described use of the NEBNext Quick DNA Sample Prep Reagent Set2 (New England Biolabs) and the kit cost $400 USD for 10 reactions (or $1,600 USD for 50 reactions). This cost for library preparation might share a large portion if many samples are going to be multiplexed. In contrast to fusion primers and MID adapters, it is easy to set up 96-well plate reactions using PTS. Sample specific tags used in PTS are oligonucleotides that can be purchased as ordinary primers, and most of the other required chemicals are relatively low cost. Together, this makes the scalability of PTS much higher than use of fusion primers or MID adapters. 4.3. Possibility of Amplification Bias

MID adapters and PTS both use target-specific primers, which is ordinary primer to amplify the target sequence for PCR (Fig. 1). In contrast, fusion primers contain sample specific tag and 454 sequencing primer A and B at the 5¢ portion of the oligonucleotide in addition to the target-specific primer (Fig. 1). Therefore, amplification results using fusion primer are not always same as those obtained by target-specific primers. Additionally, different tag sequences are used to amplify different samples to distinguish the source of sequences. Therefore, those tag regions also have potential to produce amplification bias between fusion primers.

4.4. Directional Versus Nondirectional Sequencing

By using fusion primers, the 5¢ end of amplicon is determined directionally when the 454 sequencing primer A is used for sequencing reaction (Fig. 1). In contrast, either the 5¢ or the reverse complement of the 3¢ end of amplicons are randomly sequenced using 454 sequencing primer A in MID adapter and PTS. In case of short amplicons, sequences cover most of amplicon region; therefore, sequences obtained from either end in nondirectional sequencing can be compiled as single data set because of the large overlap in sequences those obtained from both ends. However, in case of long amplicon size, when only a partial sequence of amplicons will be determined, two sets of sequence data both from 5¢ and 3¢ ends will be obtained, which are not comparable each other.

16

Ways to Mix Multiple PCR Amplicons into Single 454 Run for DNA Barcoding

361

From this standpoint, directional sequencing is a clear advantage of using fusion primers, although this problem might diminish when the sequence length capacity is extended in 2011, as announced by 454 Life Sciences (http://www.454.com).

5. Notes

1. Length of PCR products, including primers and adapters, needs to be shorter than 500 bp, although the sequence length capacity might be extended in 2011 as announced by 454 Life Sciences (http://www.454.com). 2. The original Rapid Library Preparation Method Manual (6) specifies an elution volume of 16 μl instead of 17 μl. We have increased the volume to compensate for the changed reaction volume, which is reduced by omitting T4 DNA polymerase in the next step. 3. Higher than 500 ng may facilitate chimera formation during the subsequent ligation step (4). 4. The target is not fragmented DNA; therefore, we omit T4 DNA polymerase. 5. The target is not fragmented DNA; therefore, we omit small fragment removal, which is described in the original Rapid Library Preparation Method Manual (6).

Acknowledgments We thank David Erickson and W. John Kress for inviting this submission. References 1. Venter JC, Remington K, Heidelberg JF et al (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304:66–74 2. DeLong EF, Preston CM, Mincer T et al (2006) Community genomics among stratified microbial assemblages in the ocean’s interior. Science 311:496–503 3. Sogin ML, Morrison HG, Huber JA et al (2006) Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc Natl Acad Sci USA 103:12115–12120

4. Meyer M, Stenzel U, Hofreiter M (2008) Parallel tagged sequencing on the 454 platform. Nat Protoc 3:267–278 5. 454 Life Sciences (2009) Amplicon fusion primer design guidelines for GS FLX titanium series Lib-A chemistry. 454 Technical Bulletin TCB No.013-2009 6. 454 Life Sciences (2010) Rapid library preparation method manual: GS FLX Titanium Series

Part IV Applications of DNA Barcode Data

Chapter 17 The Practical Evaluation of DNA Barcode Efficacy * John L. Spouge and Leonardo Mariño-Ramírez Abstract This chapter describes a workflow for measuring the efficacy of a barcode in identifying species. First, assemble individual sequence databases corresponding to each barcode marker. A controlled collection of taxonomic data is preferable to GenBank data, because GenBank data can be problematic, particularly when comparing barcodes based on more than one marker. To ensure proper controls when evaluating species identification, specimens not having a sequence in every marker database should be discarded. Second, select a computer algorithm for assigning species to barcode sequences. No algorithm has yet improved notably on assigning a specimen to the species of its nearest neighbor within a barcode database. Because global sequence alignments (e.g., with the Needleman–Wunsch algorithm, or some related algorithm) examine entire barcode sequences, they generally produce better species assignments than local sequence alignments (e.g., with BLAST). No neighboring method (e.g., global sequence similarity, global sequence distance, or evolutionary distance based on a global alignment) has yet shown a notable superiority in identifying species. Finally, “the probability of correct identification” (PCI) provides an appropriate measurement of barcode efficacy. The overall PCI for a data set is the average of the species PCIs, taken over all species in the data set. This chapter states explicitly how to calculate PCI, how to estimate its statistical sampling error, and how to use data on PCR failure to set limits on how much improvements in PCR technology can improve species identification. Key words: Barcode efficacy in species identification, Probability of correct identification, DNA barcode

1. Introduction Species are becoming extinct, making conservation of biodiversity a major challenge. The first step to preserving biodiversity is assessment, but there are not enough taxonomists to catalog species

*For software relevant to this chapter, see http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html. ncbi/barcode/

W. John Kress and David L. Erickson (eds.), DNA Barcodes: Methods and Protocols, Methods in Molecular Biology, vol. 858, DOI 10.1007/978-1-61779-591-6_17, © Springer Science+Business Media, LLC 2012

365

366

J.L. Spouge and L. Mariño-Ramírez

throughout the world. DNA barcodes therefore provide the basis of a promising alternative strategy because they require only collection of DNA and not the immediate taxonomic identification of specimens. Although barcodes have many other uses, e.g., identification of novel species, taxonomic classification, and phylogeny, their application to cataloguing biodiversity justifies restricting this chapter to the measurement of a barcode’s efficacy in identifying known species. In its essence, a barcode is any standardized subset of DNA from a taxonomic specimen (1, 2). The subset may vary, depending on readily recognizable features of a specimen (e.g., is the specimen a vertebrate? a plant? an insect? etc.). If computers could identify the species of a specimen from its barcode, then the barcode would provide a database key for retrieving taxonomic information pertinent to the specimen. A computer catalog of species on Earth then becomes a technical possibility. Early studies indicated that the sequence of cytochrome c oxidase 1 (CO1) gene could correctly identify many species (3), so selection of CO1 as a primary barcode followed naturally (4–10). Although the selection of a DNA barcode has been natural for some species, it has been problematic for others, particularly plants (11–14) and insects (15, 16). The lack of a clear consensus for a barcode in those species has stimulated interest in the objective, quantitative measurement of the efficacy of a barcode in identifying species. Consensus on an actual barcode for some species remains tentative, but nonetheless, a consensus on measuring barcode efficacy has emerged (14, 15, 17). This chapter summarizes the consensus and indicates how to construct studies to evaluate the relative merits of competing barcodes. For practical methods, the reader is invited to view http://www.ncbi.nlm.nih.gov/ CBBresearch/Spouge/html.ncbi/barcode/, a Web site providing information on computer programs pertinent to barcodes. Web pages are supposed to be self-explanatory, so to avoid undue brevity, the second section in this chapter provides some rationale for the computer programs for evaluating barcodes. The third section provides a practical summary of the entire chapter.

2. The Measurement of the Efficacy of Species Identification

To fix our terminology, the term “marker” connotes any contiguous region of DNA (coding or non-coding), whereas the term “barcode” connotes the aggregate of the one or more markers in the “standardized subset of DNA” referred to in the Introduction. Presently, all barcode markers are marker genes like CO1, matK, etc.

17

The Practical Evaluation of DNA Barcode Efficacy

367

In slowly evolving organisms like plants, however, intergenic spacers (DNA regions flanked by two genes) are still worthy of consideration as potential markers, because they usually diverge faster than genes, while their ends are still conserved, providing primers for PCR (17, 18). As described below, however, multiple sequence alignments (MSAs) of intergenic markers might complicate the workflow in a barcode database. To have practical meaning, any measurement of the efficacy of species identification must mirror the performance of a database based on the prospective barcode. In practice, users query the database with a barcode retrieved from a specimen; the database returns the species identification as output, with the assignment “unknown” for any species apparently not yet in the database. Because this chapter restricts itself to discussing the identification of known species, it assumes that each query to the barcode database represents a specimen belonging to a species already in the database. 2.1. The Database

The first step in estimating the efficacy of several prospective barcodes is to assemble the corresponding databases. To ensure the proper controls, specimens not having sequences in every marker database should be eliminated from consideration (14), because if the databases do not contain exactly the same specimens, there might be unappreciated but influential biases. Consider, e.g., a hypothetical experiment that extracts from GenBank all sequences corresponding to two prospective markers, Marker A and Marker B. If Marker A has been the default marker of choice, whereas Marker B has been considered as the last hope for resolving species after Marker A has failed, the GenBank entries for Marker B might be biased toward a subset of particularly difficult specimens. Thus, on GenBank data, Marker B might have fewer correct species assignments than Marker A, even though Marker B is in fact better at resolving species than Marker A. Moreover, relative to a barcode database, GenBank taxonomy is undependable, and undependable taxonomy improperly influences conclusions by occasionally penalizing correct species identification. In addition, GenBank entries do not usually identify individual taxonomic specimens. GenBank data are therefore particularly unsuited to studying barcodes based on more than one marker, because the sequences from different markers cannot be associated with a single specimen. Although studies based on GenBank data have obvious scientific interest, they do not have the same status as a controlled taxonomic study. In summary, the choice of database affects conclusions, so care must be taken that the database reflects the scientific aims of a study. Figure 1 shows some pertinent results for trnH-psbA, a potential barcode marker in plants. By using pairwise alignment and various evolutionary distances in the procedures described below, the best overall probability of correct identification (PCI) in Fig. 1 is about 0.50, which is noticeably lower than the overall PCI of

368

J.L. Spouge and L. Mariño-Ramírez

Fig. 1. Overall PCIs for trnH-psbA. Figure 1 graphs the overall PCI (on the X-axis) from assigning plant species with trnH-psbA sequences collected from GenBank. (The corresponding FASTA file can be obtained at http://www.ncbi.nlm.nih.gov/CBBresearch/ Spouge/html_ncbi/html/bib/116.html). Assignment used a nearest neighbor algorithm and one of six separations (on the Y-axis). The six separations were: (1) Global Distance; (2) Global Similarity; and four evolutionary distances: (3) Jukes-Cantor (38); (4) Kimura (2-Parameter) (39); (5) Jin (using a gamma distribution with parameter 1) (40); and (6) Tamura (41). The pairwise sequence alignment used either the HOX70 scoring matrix A C G G 91 −114 −31 −123 A C −114 100 −125 −31 , G −31 −125 100 −114 91 T −123 −31 −114 with a gap of length k receiving a penalty D(k ) = 400+30k, or the NCBI DNA scoring system (1 for a match, −3 for a mismatch, with a gap of length k receiving a penalty D(k ) = 5+2k ). Perhaps surprisingly, the overall PCIs for the two scoring systems were visually indistinguishable. Global Distance is the global alignment score; Global Similarity is the actual global alignment score divided by the maximum possible global alignment score for sequences of the same length (42). The green part of the horizontal bars gives the unambiguously correct fraction of species assignments, where every specimen had as nearest neighbors only specimens from the same species; the yellow part, the ambiguously correct fraction where every specimen had as nearest neighbors specimens a mix from both the same and other species (with the red border indicating the average fraction of the ambiguously correct fraction matching specimens from different species); and the red part, the unambiguously incorrect fraction where every specimen had only nearest neighbor specimens from other species.

17

The Practical Evaluation of DNA Barcode Efficacy

369

0.69 from a controlled taxonomic study (14), suggesting that the GenBank entries for trnH-psbA might contain biases, relative to a controlled taxonomic study. The corresponding FASTA sequence file (see the Supplementary Materials) in fact contained genetic crosses (denoted by “x”) and tentative species assignments (denoted by “sp.”, “cf.”, “aff.”), which were obscure, until the Web tools mentioned above found them. 2.2. Species Assignment Algorithm

Once an appropriate database has been selected, the computer must assign a species to each barcode query (or declare its failure to assign). The next step, therefore, is to select a computer algorithm for assigning each specimen and its barcode sequence to a species. No algorithm seems to improve noticeably on assigning to a specimen the species of its nearest neighbor within a barcode database (19, 20). Thus, many algorithms begin by estimating a “separation” between the barcode sequences in two specimens. (The term “separation” is preferable to “distance”, which connotes some specific mathematical properties not necessary to barcodes.) Separation can be based on: (1) sequence alignment similarities, (2) sequence alignment distances, (3) evolutionary distances (which usually require prior alignment of the barcode sequences), or (4) alignment-free distances. Studies have compared different measures of separation, but they are too limited to draw definitive conclusions about which separation provides the best species assignments. There are, however, some distinctly bad measures of separation. Like any assignment method, species assignment should use all available information. BLAST is a popular sequence comparison tool (21, 22), but as a measure of separation it can mislead, because it compares two sequences with local alignment, which matches and scores only the two most similar subsequences within two sequences (see Fig. 2, which diagrams some of the differences between local and global alignments). Global alignment, which matches the entire length of sequences, is better for measuring the separation of barcode marker sequences. In intergenic markers particularly, BLAST has the possible weakness of matching only small subsequences, because alignments within intergenic spacers often contain large gaps. Short subsequences can exhibit convergent evolution (homoplasy) (23), so on the one hand a BLAST local alignment might make distant species appear spuriously close. On the other hand, a global alignment might resolve the species by highlighting dissimilarities across the whole marker. In the context of barcodes, therefore, a global alignment (e.g., with some close relative of the Needleman–Wunsch Algorithm (24)) is generally preferable to a local alignment (e.g., with the Smith–Waterman Algorithm (25) or BLAST). Other types of alignments exist, but there is little reason to expect them to assign species notably better than global alignment.

370

J.L. Spouge and L. Mariño-Ramírez

Fig. 2. Two types of alignment, global and local. (a) shows a global alignment of two sequences (black lines). Global alignment is an alignment along the complete length of the sequences, so it bridges a gap in the second sequence (white space), to include all pairs of similar subsequences (red rectangles). (b) shows a local alignment of the same two sequences. Local alignment aligns only the pair of most similar subsequences in the sequences, so it does not bridge the gap in the second sequence and does not include the smaller subsequence alignment (now shown in gray). Local alignment can be misleading when identifying species with barcodes because it does not incorporate all available sequence information.

MSAs might be more problematic for intergenic markers than for marker genes like CO1, because intergenic MSAs usually contain many gaps, disrupting the alignment columns representing evolutionary relationships. In practice, the Barcode of Life Database (http://www.boldsystems.org) stores sequences in a global MSA, by using the program HMMer (26) to align sequences before comparing the corresponding barcode marker genes. In fact, many publicly available tools (e.g., MUSCLE (27) or MAFFT (28)) could create barcode MSAs interchangeably with HMMer. The point of using MSAs in a large barcode database, however, is that MSA can be much faster than pairwise sequence alignment. (If there are N barcodes in a database, pairwise alignment requires time proportional to N 2 .) Although bioinformatics should adapt to the needs of biology and not vice versa, the selection of an intergenic marker as a barcode might exclude MSAs in the workflow of large barcode databases, causing awkward (but probably not insuperable) difficulties. As separations, the relative merits of global alignment similarity, global alignment distances, or evolutionary distances based on a global alignment have not yet been clearly established, although the differences in species assignment are probably small. Alignment distances and similarities model insertions and deletions in sequences, which are not as well understood as nucleotide substitutions used in evolutionary distances. As a separation, p-distance (the proportion p of alignment pairs containing differing nucleotides) is particularly simple and well-known to taxonomists (20), but in fact no separation based on global alignment has shown any clear superiority in species assignment over the others. Other species assignment algorithms should be mentioned (29, 30). Many probabilistic algorithms, in particular those producing phylogenetic trees (31, 32), are now a commonplace in taxonomy.

17

The Practical Evaluation of DNA Barcode Efficacy

371

Unfortunately, most probabilistic computations are much slower than the nearest neighbor algorithms above. Because they do not noticeably improve identification, they have not found a place in automatic species identification. Alignment-free algorithms are simple and provide faster computation than alignment-based methods (20, 33), but presently, they have not been widely adopted in species identification. 2.3. Probability of Correct Identification

With an appropriate database and species assignment algorithm in hand, a scientist interested in barcode efficacy must measure the algorithm’s success in identifying species. Any reasonable measure of barcode efficacy should reflect the probability that a database based on the prospective barcode identifies a specimen’s species correctly. Consensus has therefore emerged on “the probability of correct identification” (PCI) as the appropriate measurement of barcode efficacy (14, 15, 17). The ambiguities in the definition of PCI accommodate legitimate scientific disagreement about success in species identification, so the concept of PCI actually embraces a broad class of measures. Consider a particular data set, and assume that PCI can be defined for each species within the data set. The overall PCI for the data set is the average of the species PCIs, taken over all species in the data set. If a few data subsets are particularly important (e.g., angiosperm, basal, and gymnosperm subsets within a plant data set), the PCI for the subsets can be reported separately. In principle, the PCI for each species could be weighted to reflect the species’ importance or the number of specimens representing it in the data set. In practice, however, scientists have not weighted averages when calculating overall PCI. Thus, to calculate the overall PCI of a data set, we now require only a species PCI, a probability to quantify success in identifying each fixed species. To calculate a species PCI, one can perform a leave-one-out procedure, sometimes called “the jackknife” in statistics (34). Remove each specimen in a species in turn from the database, and consider the separation of the removed specimen from the specimens of the same species remaining in the database. (The leave-one-out procedure cannot sensibly be applied if a species has only a single specimen in the database. Because a singleton species must therefore be omitted from the average in the overall PCI, it usually represents wasted experimental effort. It does, however, provide a “decoy,” which provides a realistic impediment to correct species assignment.) Scientists legitimately disagree over the definition of “success” in species identification. Some scientists might consider “success” theoretically, as a monophyly, where every specimen in the species is closer to all specimens in the species than to any other specimen (14). On success, the species PCI is 1; on failure, it is 0. Other scientists might consider success more pragmatically, as a correct assignment of the species, where each specimen in the species

372

J.L. Spouge and L. Mariño-Ramírez

has as its nearest neighbor(s) only specimens in the species (15). Again, if so, the species PCI is 1; if not, it is 0. The following additional conditions can contribute to success or failure, as desired: ties outside the species for a nearest neighbor, assignment of specimens from other species to the species in question, etc. Some authors have advanced less stringent criteria for success (e.g., for k > 1, the specimen’s nearest neighbors must contain at least one other specimen from the same species) (33). The species PCI has also been calculated as the fraction of specimens within a species whose nearest neighbor gives the correct assignment (17). Any specific choice might be appropriate in different circumstances, depending on the scientific aim. Some authors experimented with placing additional conditions on “success” as defined above, e.g., sequence difference (p-distance) thresholds, such as 2% or 3% (15). Detection of unknown species with sequence identity thresholds seems artificial, however (35). The notion of “species” could be redefined by DNA thresholds (1, 2, 36, 37), but such redefinitions generate many conflicts with traditional taxonomy (15). 2.4. PCR Failure

PCI should estimate the success in correctly identifying a known species. Under present technology, species identification with a DNA barcode requires the following criteria: 1. At least part of the barcode sequence must be present in the specimen. 2. Laboratory procedures must physically extract it from the specimen. 3. PCR primers must amplify it. 4. It must be sequenced. 5. It must diverge sufficiently, to distinguish species. 6. It must not diverge excessively, so specimens from a single species remain similar and identifiable. Thus, PCI must account for PCR failure, if it is to estimate identification success under present technology. Recall that the overall PCI is the average of the PCI for each individual species. The Appendix discusses PCR failure for a barcode based on several markers. For simplicity, this subsection considers here only a barcode based on a single marker. We revise the species PCI to account for PCR failure, as follows. According to the procedures in the preceding subsection (which ignore PCR failure), let the species have PCI p ; and let s be the fraction of specimens from the species with a successful PCR. (Note that s is estimated from all specimens, whereas p is estimated solely from specimens with a successful PCR.) A reasonable procedure might average the “PCR-adjusted species PCI” p ′ = ps over all species to produce a

17

The Practical Evaluation of DNA Barcode Efficacy

373

“PCR-adjusted overall PCI.” The PCR-adjusted overall PCI faithfully reflects the efficacy of species identification with present technology, whereas the overall PCI (which ignores specimens where PCR failed) reflects the efficacy of species identification with a perfect PCR technology. Technology reduces PCR failure rates, so arguments have been advanced that PCR failure should be ignored (14). The PCI after any technological advance, however, is bounded below by the PCR-adjusted overall PCI (which reflects present PCR technology); similarly, it is bounded above by the overall PCI (which ignores specimens with failed PCR). The bounds demonstrate that technological advance by itself does not preclude a sober assessment of future prospects. Like any numerical result from a definite procedure with a sensible meaning, the PCR-adjusted overall PCI is useful, and its deliberate omission merely undermines rational discussion about the relative merits of potential barcodes. 2.5. Statistical Sampling Error

The overall PCI is the (unweighted) average of the species PCIs. Let us make a reasonable approximation that species PCIs are mutually independent across all species. Any database is a sample of all possible species, so the overall PCI from the database is an estimate of the “true” overall PCI p . As such, it has a sampling error, calculable with the binomial distribution. Let n be the number of species contributing to the overall PCI. Under mild assumptions (given below), a binomial estimate pˆ is normally distributed with mean p and standard deviation

(

)

p (1 − p ) / n . Thus, the confidence

(

)

interval ⎡ pˆ − z pˆ 1 − pˆ / n , pˆ + z pˆ 1 − pˆ / n ⎤ contains the true ⎣⎢ ⎦⎥ overall PCI p with a confidence determined by z in conjunction with the normal distribution. The larger z is, the broader the interval becomes, and the greater the probability that the interval contains the true value of p . As approximate examples, z = 2 yields an 95% confidence interval; z = 2.6 , 99%, etc. (As a useful rule of thumb, the normal approximation holds, if n ≥ 20 and the confidence interval does not include 0.0 or 1.0.) Confidence intervals are worth calculating, because they are often surprisingly broad. As an aside, the confidence intervals for the overall PCI are crucial to evaluating the relative merits of tentative barcodes, but they have little direct bearing on one’s confidence in the species assignment of a specific specimen, for the following reason. Most taxonomists probably prefer a barcode for which assignment errors are confined to a few species, rather than to have the same errors spread across many species. (If nothing else, alternative strategies might be available for assigning a small number of problematic species.) Overall PCI faithfully reflects taxonomists’ barcode preferences, but the evaluation of a specific species assignment poses a different problem, requiring a different solution.

374

J.L. Spouge and L. Mariño-Ramírez

3. The Summary of the Workflow Selection of a DNA barcode has been problematic for some species, but there is now a general consensus on the measurement of barcode efficacy. The procedure for measuring barcode efficacy can be broken into several steps. First, assemble databases corresponding to the prospective barcodes. The choice of database must be given careful consideration because it can noticeably influence a study’s conclusions. To ensure proper controls, specimens not having a sequence in every marker database should be eliminated from consideration. Because GenBank taxonomy might be undependable, and because most GenBank sequences do not specify a corresponding taxonomic specimen, studies based on GenBank data do not have the same status as a controlled taxonomic study, particularly for barcodes based on more than one marker. Second, select a computer algorithm for assigning species to barcode sequences. No algorithm seems to improve noticeably on assigning to a specimen the species of its nearest neighbor within a barcode database. A global alignment (e.g., with Needleman– Wunsch algorithm, or some similar algorithm) is recommended, to take advantage of all the information in a barcode sequence. By contrast, BLAST is a local alignment program, which might match only small subsequences within two sequences. Thus, the use of BLAST runs an unnecessary risk when evaluating any prospective barcode, particularly one with an intergenic marker. As long as alignments are in essence global, alignment similarities, alignment distances, and evolutionary distances like p-distance, Kimura 2-Parameter Distance, etc., seem to have approximately equal efficacies in identifying species. Consensus has emerged on “the probability of correct identification” (PCI) as the appropriate measurement of barcode efficacy. The overall PCI for a data set is the average of the species PCIs, taken over all species in the data set. If a few data subsets are particularly important (e.g., angiosperm, basal, and gymnosperm subsets within a plant data set), the PCI for the subsets can be reported separately. To calculate a species PCI, remove in turn each specimen in the species from the database, and consider its separation from the remaining specimens (under, e.g., p-distance). Various definitions of identification success within a species are possible: (1) every specimen in the species is closer to all other specimens in the species than to any other specimen; (2) each specimen in the species has another specimen in the species as its nearest neighbor; (3) more stringent versions of the two foregoing definitions, where ties outside the species for a nearest neighbor, or assignment of other species to the species in question, also connote failure; (4) less stringent criteria for success (e.g., for k > 1 , the specimen’s nearest

17

The Practical Evaluation of DNA Barcode Efficacy

375

k neighbors must contain at least one other specimen from the same species; or (5) probabilistic measures of success, like the fraction of specimens within a species displaying one of the foregoing definitions of success. Scientific purpose makes different definitions of “successful assignment” appropriate to different circumstances. To estimate success under present technology, PCI must account for PCR failure. Although the case of a barcode with several markers has been relegated to the Appendix, the case of a barcode with only one marker poses no difficulties. Simply estimate the rate of PCR failure within each species by using all specimens, not just the ones with completely successful PCRs. Multiplication of a species PCI by the PCR success rate within the species yields a “PCRadjusted” species PCI, which can then be averaged over species to yield a PCR-adjusted overall PCI. The overall PCI after technological advance is bounded below by the PCR-adjusted overall PCI; similarly, it is bounded above the overall PCI (which derives from PCR successes only). Thus, present technology bounds prospects for an overall PCI. A database provides a statistical sample of all possible data. The overall PCI calculated from a database is therefore a statistical estimate of the true overall PCI, and as such, it yields an estimate with a statistical error. The errors are sometimes surprisingly large, and the differences in barcode efficaciousness correspondingly small. For software relevant to this chapter, see http://www.ncbi. nlm.nih.gov/CBBresearch/Spouge/html.ncbi/barcode/.

Acknowledgment This research was supported in part by the Intramural Research Program of the NIH, NLM, NCBI.

Appendix For a barcode with several markers, each of which can have a failed PCR, specimen identification ultimately relies on the markers with a successful PCR. To quantify the identification process, number the markers {1, 2,...,m}, and consider any subset M of {1, 2,...,m}. For a particular specimen, let the probability that M is the subset of markers with PCR success be denoted by s M , and let the PCI for the barcode based on the marker subset M be pM . A species PCI p can then be calculated from the values of s M and pM (although the calculation depends on the definition of species PCI: see Section 2.3 for various definitions.)

376

J.L. Spouge and L. Mariño-Ramírez

One very reasonable definition of the PCR-adjusted species PCI is the average p = ∑ (M ) pM s M . For the case of a barcode based on

a single marker, e.g., M is a subset of {1} , i.e., the empty set { } or {1} . Because the empty set { }corresponds to a complete absence of information about a specimen, the corresponding PCI is p{ } = 0 , so p = p{ }s { } + p{1}s {1} = p{1}s {1} , which agrees with the formula for the PCR-adjusted PCI in the main text, for a barcode based on a single marker. References 1. Hebert PD, Cywinska A, Ball SL, Dewaard JR (2003) Biological identifications through DNA barcodes. Proc Biol Sci 270:313–321 2. Floyd R, Abebe E, Papert A, Blaxter M (2002) Molecular barcodes for soil nematode identification. Mol Ecol 11:839–850 3. Hebert PD, Ratnasingham S, Dewaard JR (2003) Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species. Proc Biol Sci 270:S96–S99 4. Hajibabaei M, Janzen DM, Burns JM et al (2006) DNA barcodes distinguish species of tropical lepidoptera. Proc Natl Acad Sci U S A 103:968–971 5. Hogg ID, Hebert PDN (2004) Biological identification of springtails (hexapoda: Collembola) from the canadian arctic, using mitochondrial DNA barcodes. Can J Zool 82: 749–754 6. Lorenz JG, Jackson WE, Beck JC, Hanner R (2005) The problems and promise of DNA barcodes for species diagnosis of primate biomaterials. Philos Trans R Soc Lond B Biol Sci 360:1869–1877 7. Meyer CP, Paulay G (2005) DNA barcoding: error rates based on comprehensive sampling. PLoS Biol 3:e422 8. Saunders GW (2005) Applying DNA barcoding to red macroalgae: a preliminary appraisal holds promise for future applications. Philos Trans R Soc Lond B Biol Sci 360:1879–1888 9. Smith MA, Fisher BL, Hebert PDN (2005) DNA barcoding for effective biodiversity assessment of a hyperdiverse arthropod group: the ants of Madagascar. Philos Trans R Soc Lond B Biol Sci 360:1825–1834 10. Smith MA, Woodley NE, Janzen DH et al (2006) DNA barcodes reveal cryptic hostspecificity within the presumed polyphagous members of a genus of parasitoid flies (diptera: Tachinidae). Proc Natl Acad Sci U S A 103: 3657–3662 11. Chase MW, Salamin N, Wilkinson M et al (2005) Land plants and DNA barcodes: short-term

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

and long-term goals. Philos Trans R Soc Lond B Biol Sci 360:1889–1895 Cowan RS, Chase MW, Kress JW, Savolainen V (2006) 300,000 species to identify: problems, progress, and prospects in DNA barcoding of land plants. Taxon 55:611–616 Kress WJ, Erickson DL (2008) DNA barcodes: genes, genomics, and bioinformatics. Proc Natl Acad Sci U S A 105:2761–2762 Cbol Plant Working Group (2009) A DNA barcode for land plants. Proc Natl Acad Sci U S A 106:12794–12797 Meier R, Shiyang K, Vaidya G, Ng PK (2006) DNA barcoding and taxonomy in diptera: a tale of high intraspecific variability and low identification success. Syst Biol 55:715–728 Huang D, Meier R, Todd PA, Chou LM (2008) Slow mitochondrial coI sequence evolution at the base of the metazoan tree and its implications for DNA barcoding. J Mol Evol 66:167–174 Erickson DL, Spouge JL, Resch A et al (2008) DNA barcoding in land plants: developing standards to quantify and maximize success. Taxon 13:1304–1316 Kress WJ, Erickson DL (2007) A two-locus global DNA barcode for land plants: the coding rbcl gene complements the non-coding trnhpsba spacer region. PLoS One 2:e508 Austerlitz F (2007) Comparing phylogenetic and statistical classification methods for DNA barcoding. Paper presented at the second international barcode of life conference, Taipei, Taiwan, 2007 Little DP, Stevenson DW (2007) A comparison of algorithms for the identification of specimens using DNA barcodes: examples from gymnosperms. Cladistics 23:1–27 Altschul SF, Madden TL, Schaffer AA et al (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402 Altschul S (1999) Hot papers – bioinformatics – gapped blast and psi-blast: a new generation

17

23.

24.

25.

26.

27.

28.

29.

30.

31.

The Practical Evaluation of DNA Barcode Efficacy

of protein database search programs by s.F. Altschul, t.L. Madden, a.A. Schaffer, j.H. Zhang, z. Zhang, w. Miller, d.J. Lipman – comments. Scientist 13:15 Wouters MA, Husain A (2001) Changes in zinc ligation promote remodeling of the active site in the zinc hydrolase superfamily. J Mol Biol 314:1191–1207 Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453 Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197 Eddy SR (1995) Multiple alignment using hidden markov models. Proc Int Conf Intell Syst Mol Biol 3:114–120 Edgar RC (2004) Muscle: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113 Katoh K, Misawa K, Kuma K, Miyata T (2002) Mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res 30:3059–3066 Matz MV, Nielsen R (2005) A likelihood ratio test for species membership based on DNA sequence data. Philos Trans R Soc Lond B Biol Sci 360:1969–1974 Nielsen R, Matz M (2006) Statistical approaches for DNA barcoding. Syst Biol 55: 162–169 Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376

377

32. Felsenstein J (1988) Phylogenies from molecular sequences – inference and reliability. Annu Rev Genet 22:521–565 33. Kuksa P, Pavlovic V (2009) Efficient alignmentfree DNA barcode analytics. BMC Bioinform atics 10:S9 34. Efron B, Stein C (1981) The jackknife estimate of variance. Ann Stat 9:586–596 35. Ferguson JWH (2002) On the use of genetic divergence for identifying species. Biol J Linnean Soc 75:509–516 36. Blaxter M, Mann J, Chapman T et al (2005) Defining operational taxonomic units using DNA barcode data. Philos Trans R Soc Lond B Biol Sci 360:1935–1943 37. Lambert DM, Baker A, Huynen L et al (2005) Is a large-scale DNA-based inventory of ancient life possible? J Hered 96(3):279–284 38. Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Munro HN (ed) Mammalian protein metabolism. Academic, New York, pp 21–123 39. Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16:111–120 40. Jin L, Nei M (1990) Limitations of the evolutionary parsimony method of phylogenetic analysis. Mol Biol Evol 7:82–102 41. Tamura K (1994) Model selection in the estimation of the number of nucleotide substitutions. Mol Biol Evol 11:154–157 42. Waterman MS, Smith TF, Beyer WA (1976) Some biological sequence metrics. Adv Math 20:367–387

Chapter 18 Plant DNA Barcodes, Taxonomic Management, and Species Discovery in Tropical Forests Christopher W. Dick and Campbell O. Webb Abstract DNA barcodes have great potential for species identification and taxonomic discovery in tropical forests. This use of DNA barcodes requires a reference DNA library of known taxa with which to match DNA from unidentified specimens. At an even more basic level, it presupposes that the species in the regional species pool have Latin binomials. This is not the case in species-rich tropical forests in which many species are new to science or members of poorly circumscribed species complexes. This chapter describes a workflow geared toward taxonomic discovery, which includes the discovery of new species, distribution records, and hybrid forms, and to management of taxonomic entities in forest inventory plots. It outlines the roles of laboratory technicians, field workers and herbarium-based taxonomists, and concludes with a discussion of potential multilocus nuclear DNA approaches for identifying species in recently evolved clades. Key words: Tropical trees, Metadata, Vouchers, Taxonomy, Herbarium, DNA barcode, Discovery

1. Introduction Tropical forests contain over 90% of the world’s tree diversity (1). Single hectares of highly diverse Asian or South American forests contain more tree species than are found in the whole of eastern North America, or in the vast circumboreal forests. Yet our knowledge of these tropical plant species is poor, and the recorded species are only a fraction of the true species pool in most areas. For example, the Catalog of the Vascular Plants of Ecuador (2) lists only ca. 4,000 vascular plant species for Ecuadoran Amazon (a low intensity sampling of seven million hectares), whereas exhaustive sampling of a single hectare forest in the Ecuadorian Amazon yielded over 900 vascular plant species (3). Based on simple area versus species richness relationships, it is thus likely that the real tree richness of the region is many times 4,000 species (4), and that W. John Kress and David L. Erickson (eds.), DNA Barcodes: Methods and Protocols, Methods in Molecular Biology, vol. 858, DOI 10.1007/978-1-61779-591-6_18, © Springer Science+Business Media, LLC 2012

379

380

C.W. Dick and C.O. Webb

many of the additional species will be new to science; Prance (5) has estimated that 1 in 100 plant specimens collected from remote tropical forests, such as those in the Amazon basin and Papua New Guinea, come from as yet undescribed species. Both the high species diversity and the high probability of encountering undescribed or at least poorly delineated taxa create challenges for biologists working on the inventory of tropical forest plants. One of the major challenges of setting up an inventory plot in tropical forest is simply keeping track of the morphotype identity of each tree, which must then be followed by the lengthy process of matching the morphotypes to known species. DNA barcodes can assist at both stages, accelerating the matching of morphotypes across a plot (via matching DNA haplotypes), and then matching taxonomic entities in the plot to named taxa. Many ecologists involved with tropical forest inventory are excited by the prospect of using DNA barcodes to identify species (including new species), but they may not be familiar with (1) collection methods for molecular samples, (2) standard methods for making herbarium vouchers, or (3) the kinds of metadata that are needed to create DNA barcode reference libraries and describe new species. For a relatively low additional cost, an inventory program not primarily focused on plant taxonomy (e.g., inventory of carbon or the forest dynamics studies; ref. 6) can make high-quality DNA and physical vouchers which can be used to address urgent biodiversity questions. 1.1. DNA barcoding links to Systematics and Ecology

While traditional taxonomic work (e.g., collecting, matching, describing, publishing) will eventually increase global estimates of tropical tree diversity and yield new taxa, the specimens from botanical inventories often spend years in storage before they can be identified or described as new species (7). The “taxonomic bottleneck” is more pronounced now than ever due to the global decline in numbers of taxonomic specialists working on tropical groups. New species descriptions are also limited by inadequate representation of taxa in herbarium collections: herbarium-based specialists cannot describe the many new species that are not extremely distinctive because data pertaining to geographic range and morphological variation is frequently unavailable. DNA barcodes can assist in this process of “taxonomic discovery,” which can take the form of expanding species range information, delimiting species in closely related taxa, standardizing the nomenclature of species with multiple names (synonymy), or even recognizing species that are new to science. The effectiveness of DNA barcodes for taxonomic discovery depends on the preexistence of DNA records for many of the taxa under study, the so-called DNA reference library, but these libraries for tropical plant taxa are currently being built rapidly, primarily by ecologists working on forest inventories.

18

Plant DNA Barcodes, Taxonomic Management, and Species Discovery…

381

Tropical forest inventory plots provide several advantages for taxonomic discovery. Because the trees are tagged, the forest plots serve as living museums in which individual trees or their conspecific populations may be revisited to obtain additional data. Tropical forest plots delimit enough of the local flora to facilitate development of DNA barcode reference libraries for use in broader regional studies. Finally, the plot networks already have many of the human resources and institutional ties to universities and herbaria that are needed to sustain long-term research. An end goal of DNA-based forest inventory should be to standardize taxonomy across regional networks of forest plots, and thereby advance botanical knowledge of these relatively unexplored regions (6). For example, to date, very few of the sets of vouchers from any Center for Tropical Forest Sciences (CTFS) Forest Dynamics Plot have been compared to a set from another plot because of the cost and logistical difficulty of such a cross-plot matching; DNA barcode matching among plots, on the other hand, could be done nearly instantaneously if the data were available. The transfer of physical specimens and data between field workers, lab technicians, and systematists can be organized as a “workflow.” Field workers collect the specimens and metadata, passing the plant tissue to a molecular laboratory and the pressed specimens to a herbarium (Fig. 1). The DNA barcode data then can assist both the field ecologists with basic management of taxonomic entities (morphotyping and matching vouchers), and the herbarium workers with matching to named species, and with range extensions, new species discovery, etc. There should be established lines of communication between laboratory technicians, systematists, and the field workers (or their local supervisor). For example, the lab workers may need additional plant tissue (e.g., from the vascular cambium instead of leaves) if PCR repeatedly fails. The plot workers should receive training from systematists prior to the collections, if possible, because taxonomic specialists often have tips for collecting taxonomically diagnostic field information for their particular groups (e.g., ref. 8). The components of the total workflow that we focus on in this chapter are those geared toward the field ecologists, especially graduate students and postdocs, who work in forest inventory plots, such as the large (25–50 ha) tree inventory plots managed by CTFS (see Chapter 22) or the RAINFOR network of smaller (e.g., 1 ha) forest inventory plots scattered across the Amazon basin (9). We touch briefly on field methods, but make reference to additional sources. We do not emphasize molecular methods, which are described in detail by Fazekas et al. (Chapter 11) in this volume. We end with a discussion of multilocus nuclear DNA markers, which, in addition to the standard chloroplast DNA barcodes, will be necessary to rigorously test taxonomic hypotheses in forest inventory plots.

382

C.W. Dick and C.O. Webb

Fig. 1. Summary of the workflow that uses DNA barcodes for the purpose of taxonomic discovery in tropical forest inventory plots.

2. Field Materials The materials for a DNA barcode project will depend on the magnitude of the project, for example, whether the goal is complete taxonomic inventory of a large forest plot, or collection and identification of focal groups by graduate students. The following materials would be especially useful for a graduate student getting involved in a taxonomic inventory project. 1. Collecting equipment: CWD uses a Jameson 2.43 m (8 ft) fiberglass pole clipper (Sherrill Inc.) with heavy-duty head for cutting branches up to 4.5 cm and four additional poles for a reach of 9.75 m (32 ft). Additional rope (approximately 40 ft) is needed with extensions. Additionally, local botanists need hand clippers and machetes with leather sheaths, wrist mounted slingshots and replacement rubber tubing, and climbing equipment to access the forest canopy. The kinds of climbing equipment vary from canvas belts used to shimmy up tree trunks to more elaborate single rope climbing techniques. The methods for describing rope-climbing techniques for tree climbing are beyond the scope of this chapter. Safety issues are a prime

18

Plant DNA Barcodes, Taxonomic Management, and Species Discovery…

383

concern, and should be considered and implemented, even if field workers are willing to take risks, and an insurance policy should be explicitly established for village assistants before they climb. Ref. 10 outlines safety considerations. A rubber mallet and >1 in. gasket hole punch may be needed to obtain vascular cambium tissue for DNA extraction (11). 2. Specimen drying equipment: A portable plant dryer can be made using a space heater and canvas cloth (12); and propanefueled dryers with plywood frames are available in many large research stations. Plant presses for a drying oven should include the wooden mounting boards, tightening straps, blotting paper or cardboard, newspapers, and corrugated aluminum sheets to spread heat into the plant bundles (available at Forestry Suppliers Inc.). Enough newspapers should be available for layering between individual specimens. 3. Camera equipment: An optimal setup for making photographic vouchers would be a 35-mm digital camera with zoom lens and macro capacity, a ring flash for close-up shots, a tripod and black or gray cloth to use as standardized background. However, excellent results can be obtained with a high-end compact camera with built-in flash and macro (we recommend the Panasonic Lumix LX or GF series, or Canon G series). If in the field in the wet tropics for more than a few weeks, the camera and lens should be stored in an airtight container containing a desiccant, such as silica gel, as the humidity is conducive to the growth of fungi that destroy lens coatings. 4. Miscellaneous items: A large amount of relatively inexpensive equipment is needed to make field collections. For a more extensive list of items, refer to “Field Techniques used by Missouri Botanical Garden” (13). The items should include field notebooks that are small enough so as to contain relatively little information if lost (e.g., Rite in the Rain brand); garbage bags and Ziploc bags of varied sizes; alcohol and specimen jars to preserve small flowers; hand-held Geographic Positioning System (GPS) capable of fast reception under the forest canopy (e.g., Garmin CSx series), and fine silica gel with indicator beads to preserve leaf tissue for DNA extraction. Although fine silica gel is available from scientific supply companies, this high-grade variety is expensive and can be substituted with silica gel from florists.

3. Methods 3.1. Field Collections: What, When, and How

1. Replicate sampling: The collections for each morphospecies should include multiple individuals (n = 3–5) representing the full local habitat range (e.g., wetland and upland) and any

384

C.W. Dick and C.O. Webb

morphological variation that has been noted by workers in the plot, such as variants in bark color or texture or leaf shape or color. These variants may turn out to be cryptic species ((6, 14); Fig. 23). Because forest inventory plots may contain many species represented by single or just a few large trees, additional collections may be needed from outside of the plot in habitats in which the rare plot species may be more abundant. 2. Phenology: Because many tree populations do not reproduce annually, one would ideally make collections over the course of the year and for more than one year in order to obtain representative fruit and floral collections. The collections should be concentrated during seasons in which fruits or flowers are locally most likely. 3. Field observations: Some information about the tagged tree may be obtained from prior inventory data (e.g., DBH, preliminary species assignment, coordinates within plot, tag number, and habitat). The following additional information should be noted for inclusion in the herbarium label: GPS coordinates; date of collection; collector name and collection number; presence of trunk buttresses, bark texture (13). If time permits, noting 5–10 leaf and bark characters for each species can be used to develop basic identification keys for a local flora (15), and can serve to organize photographic resources. 4. Photographic metadata: The field collection presents an opportunity to obtain photographs of fresh flowers and fruits, which contribute valuable information for future species identifications with or without the use of DNA barcodes. In the Gunung Palung flora project in Borneo (16), workers take 10–20 images of each fresh plant, including bark slash, whole twig, twig tip, twig surface, stipules, whole leave (above and below), close up of leaf base underside and petiole, inflorescence, flower (or fruit) at different angles, and longitudinal and transverse sections. Slashing trunks to expose the inner bark is not recommended in forest dynamics plots as it may influence mortality. Each photograph should include a ruler for size scale, and a paper tag with the collector name and collection number (or plot tag number) to avoid confusion about the association of photos and specimens. 5. DNA sampling: From the fresh material collected for the herbarium voucher, select a single young leaf that is neither too tender (the DNA will degrade rapidly in a wilting young leaf) nor excessively damaged by herbivores or covered with epiphylls. Clean a leaf with a dry cloth and cut a 2 × 2 cm square using scissors. Very little plant tissue (e.g., 20 mg dry tissue) is needed for DNA extraction. A common mistake is to collect too much leaf tissue for DNA sampling using the silica-gel approach. If too much tissue (e.g., an entire leaf) is collected,

18

Plant DNA Barcodes, Taxonomic Management, and Species Discovery…

385

the leaf will dry slowly or incompletely, resulting in DNA degradation. Place the tissue sample immediately into a sealed Ziploc sandwich bag or 50-ml Falcon centrifuge tube containing 20 mL of dried silica gel and colored indicator beads. Alternatively, place the sample in a permeable fiber bag (e.g., tea-bag) in a larger box filled with fine silica gel; this prevents fragments of brittle or tender leaves from contaminating the silica gel. Having an excess of silica gel is important for maximum rate of drying. Wipe off scissors with alcohol after each use. Check leaf tissue after one day. If it is brittle and breaks when bent, the silica may be removed and reused if still dry (see color of indicator beads) or baked and reused. Care must be taken not to contaminate samples when reusing silica gel. The dried leaf may be stored indefinitely in a labeled coin envelope inside of an airtight plastic container kept dry with silica gel. There are as yet no standardized protocols for long-term storage of plant tissue for subsequent DNA work. Anecdotal evidence suggests that freezing best preserves DNA in silicadried leaves. One alternative to silica gel is to flash freeze the leaves in liquid nitrogen (N). Liquid N is available in many developing countries because it is used to preserve semen for animal breeding. Flash freezing produces more genomic DNA than silica drying and can maintain RNA for transcriptome sequencing. The disadvantage is the difficulty of handling liquid nitrogen tanks in the field, and the expense of long-term storage of frozen material. Liquid N is typically not permitted on flights, so the samples will need to be transported in dry ice, or in a dry shipper. Other alternatives to using silica gel include placing samples in CTAB buffer solution (11), or using Whatman FTA cards (http://www.whatman.com/). An alternative to using leaf material is obtaining DNA from the vascular cambium (17). Because there may be less need for a plant to invest in defensive secondary compounds in vascular cambium than in leaves, DNA extractions from vascular tissue may be more successful for PCR in some taxa. Pound the gasket-hole puncher into trunk to wood level. Carefully separate the cambium tissue from the inner bark and place in silica gel. Wipe the gasket punch opening with alcohol or bleach to prevent contamination of the next sample. 6. Specimen preservation: The specimen vouchers should be dried on the day of collection when possible, in an arrangement that best demonstrates all of the salient taxonomic characters (e.g., leaf tips, base and underside; stipules, fruits, etc.) (18). Some difficult groups, such as palms, require more specialized arrangement techniques (13). The plant press must be kept tight to prevent wrinkling of material, and retightened through

386

C.W. Dick and C.O. Webb

the course of drying as the material shrinks. The dryer should provide even airflow and temperatures of 35–45°C (12). Rapid drying retains color of specimens but overly high temperatures can produce darkened and brittle specimens. When collecting in remote areas outside of the field stations, one can layer the fresh collections in newspaper and soak with 90% ethanol (or even methanol, used for lighting lamps, in a pinch). This method will keep the plant parts together until they can be dried, but it produces darkly colored specimens and degrades DNA. For alcohol preserved specimens, fresh leaves should be separately dried with silica gel for use in DNA extraction. 7. Taxonomic sorting: For very large forest inventory plots (e.g., ³25 ha), sorting all of the designated morphospecies into higher taxonomic ranks can take years, especially if the initial inventory utilized sterile vouchers (19). Key steps in the determination of trees are: (1) collecting “daily vouchers” (either fallen leaves or sterile twigs) for all morphotypes encountered each day, while doing “within-day” matching for trees examined (i.e., “tree 1234 = tree 1245”); (2) matching the daily vouchers to a growing field herbarium collection, assigning field morphotype codes, splitting types where uncertain, and “synonymizing” identical morphotypes with different morphotype codes; (3) determining which taxa can be identified reliably by field crews without further voucher collections (there are always a few common, well-known taxa that “anyone” can spot). This process is time-consuming and tends to slow down because an increasing number of morphotypes have to be checked. If the period of sampling is long enough that DNA work can be carried out at the same time, and if enough stems can be sampled for DNA, then sequence data can be used to speed up the matching process (Fig. 2). If the sequence of a new tree can be queried against GENBANK (or otherwise available DNA reference library), or placed in a dynamic, community “guide phylogeny” (automatically rebuilt; see Chapter 19), to find a closely related taxon, then the number of vouchers in the field herbarium to which the new tree’s voucher must be compared can be reduced. If there is an exact match of the new tree’s sequence to a sequence of a precollected tree, then the first voucher to compare the new tree’s voucher with is that of the latter. Because the discriminating power of DNA barcodes in some groups is low (17), we cannot unfortunately expect a direct match of DNA sequence to indicate an exact match of all morphotypes (Fig. 2). 3.2. DNA Barcode Reference Library

The difference between a DNA barcode reference library and a standard DNA sequence database entry (e.g., a standard GenBank

i) Intra-plot (plot A) Tree 5000 GTGTACGT

GTGTACGT Tree 4000

Plot morphotype 070

Tree 6000 ACGTACGT

ACGTACGT Tree 1000 ACGTACGT Tree 2000 ACGTACGT Tree 3000

Plot morphotype 005 Plot morphotype 005 Plot morphotype 011

Tree 7000 CCTTCCTT

xxxxxxxx

(becomes m’ type 100)

No match

ii) Regional/among plots Plot A morphotype 070

GTGTACGT

GTGTACGT

Plot B morphotype 060

Plot A morphotype 005 Plot A morphotype 011

ACGTACGT ACGTACGT

ACGTACGT ACGTACGT

Plot B morphotype 010 Plot C morphotype 003

Plot A morphotype 100

CCTTCCTT

xxxxxxxx

No match

iii) Herbarium/global database Plot-wide morphotype 270 GTGTACGT

GTGTACGT GenBank/BoLD: Shorea parvifolia

Plot-wide morphotype 205 ACGTACGT Plot-wide morphotype 311 ACGTACGT Plot-wide morphotype 403 ACGTACGT

ACGTACGT GenBank/BoLD: Santiria tomentosa ACGTACGT GenBank/BoLD: Santiria indica ACGTACGA* GenBank/BoLD: Santiria sumatrana

Plot-wide morphotype 500 CCTTCCTT

xxxxxxxx

GenBank/BoLD: No close match

Fig. 2. Hypothetical examples of the use of DNA barcodes for taxonomic management and discovery. (i) Intra-plot matching. DNA from tree 5000 matches only DNA from tree 4000: it is likely that tree 5000 and tree 4000 are the same morphotype and the same species, but a physical comparison is recommended in case two closely related species have identical DNA barcodes. Time saved by using DNA barcodes: only one physical comparison is needed, versus many if no barcodes available. DNA from Tree 6000 matches a DNA sequence from three trees, which has already been found to come from two distinct morphotypes (probably in the same genus): physical comparison is mandatory, to determine the morphotype of Tree 6000. Time saved: only two morphotypes need to be compared with tree 6000. DNA from Tree 7000 does not match DNA from any other tree: it is possible that a physical comparison would find an identical morphotype and reveal a cryptic species, but unlikely. Time saved: physical comparison of tree 6000 is a low priority and could be skipped in some cases. (ii) Inter-plot matching. DNA from plot A morphotype 70 matches only DNA from plot B morphotype 060: it is likely that these morphotypes are the same, and are the same species, but a physical comparison is recommended, in case (a) two closely related but morphologically distinct species have identical DNA barcodes, or (b) there is geographical variation in morphology in one species. In the case of the latter, a taxonomic decision (one species or two) may require herbarium work (see below). Time saved: only one comparison is needed. Identical DNA from plot A distinct morphotypes 005 and 011 matches DNA from plot B morphotype 010 and plot C morphotype 003: thorough physical matching is needed among all four source morphotypes, to determine if there are two, three, or four plot-network-wide morphotypes. Time saved: only these four morphotypes need to be compared, rather than all members of a tentative genus. DNA from plot A morphotype 100 does not match DNA from any other plot morphotype: probably a unique morphotype and species. Time saved: physical comparison of plot A morphotype 100 is a low priority. A final physical review of all morphotypes should be completed, among morphotypes clustered by similar DNA (or by tentative genera, if these have been assigned by field botanists), to determine if there are potentially cryptic species, revealed by different DNA, but having identical morphology. (iii) Herbarium and DNA database matching. DNA from plot-wide morphotype 270 BLASTs to an identical match with Shorea parvifolia: it is likely that morphotype 270 is indeed S. parvifolia, but with relatively few Shorea having ever been sequenced, other Shorea may have identical barcodes, hence all Shorea in the same section rank should be compared morphologically with vouchers of morphotype 270. If the match is indeed to S. parvifolia, then a taxonomic discovery may be made (range expansion, minor morphological variation, etc.). Identical DNA from plot-wide distinct morphotypes 205, 311, and 403 BLASTs to an identical match with Santiria tomentosa, S. indica, and a close match to S. sumatrana. Thorough physical matching in the herbarium (and in monographs) is needed for the three morphotypes, focused on the three possible Santiria species, but including all likely Santiria, if possible. DNA from plot-wide morphotype 500 does not BLAST to any sequence in any database: the morphotype may be a new species, but more likely it is a species that has been collected before but not sequenced. A herbarium and book search should follow, directed either by a taxonomist’s recognition of genus, or starting with taxa with similar DNA sequences.

388

C.W. Dick and C.O. Webb

entry) is that whereas the standard database publishes sequence information at face value, a DNA barcode entry bundles together two hypotheses that must be supported with metadata: (1) that the DNA sequence is accurate, and (2) that the species identification is accurate. The DNA sequence in a DNA barcode reference library must be accompanied by the raw data (chromatogram) so that other researchers can verify that differences in nucleotide sequence between species are robust and not merely sequencing artifacts. The metadata needed to address the taxonomic hypothesis are the herbarium voucher data and accompanying collection information, the most important of which are geographic location, photographs, and collection date. Several data platforms accommodate DNA barcode sequences and metadata. These include the Barcode of Live Database (BOLD) (20) and the DNA barcode entry option of GenBank called “BarSTool”. Taxonomic metadata should also be registered in the Global Biodiversity Information Facility (GBIF; http://www.gbif.org), which serves as a repository for biodiversity information, including species ranges. 3.3. Role of the Herbarium

A significant added value of DNA barcode surveys are the associated specimens and genomic DNA that, if properly curated (21), can be used for future generations of biodiversity researchers. It is essential to provide the best quality voucher material (i.e., fertile material) for permanent herbarium curation (most herbaria will not accept sterile or poor-quality specimens). The herbarium provides the infrastructure for exchanging specimens to other institutions so that specialists can make taxonomic determinations and incorporate the specimen information into floras or species descriptions. Herbarium-based curators and systematists can recognize rare or novel taxa, and flag these for additional field collections or observations. Costs for herbarium curation need to be incorporated into research budgets, and collaboration agreements should be established prior to the initiation of a large-scale DNA barcode project. Herbarium staff members are often involved in acquiring research and collection permits, for example, which can be a time consuming and laborious procedure that should be dealt with as early as possible.

3.4. Taxonomic Discovery

The discovery of new species, site records, variants or hybrids involves a comparison of morphological data (morphospecies designation) based on field observations and herbarium vouchers, and the DNA barcode haplotypes (Fig. 2). There are two deviations from the ideal one-to-one relationship between the DNA barcode and the locally defined morphospecies: (1) DNA barcodes are identical across multiple morphospecies, or (2) multiple DNA barcode haplotypes are found within a single putative morphospecies. Since each scenario can arise from different biological causes, these cases require further evaluation (Fig. 2).

18

Plant DNA Barcodes, Taxonomic Management, and Species Discovery…

389

Case 1: One DNA barcode for multiple morphospecies. When identical DNA barcodes are found in different morphospecies, it likely reflects a recent speciation history in which mutational differences among species have not yet accrued and sorted. Such is the case in species rich tree genera, such as Inga and sections of the genus Ficus (22). The genetic discrimination of such taxa will require more variable DNA markers (23); and see discussion). These cases underscore the need to maintain archived DNA for future genotyping. Shared cpDNA haplotypes may also be explained by hybridization. Hybridization can be detected in several ways including: within a phylogenetic context as incongruence between nuclear and plastid phylogenies, by geographical associations of haplotypes shared across species (24), by levels of genetic admixture with nuclear loci (23), or by morphological intermediates between the putative species in the field. Taxonomic specialists often have a prior idea of the importance of hybridization in their taxa, based on their examination of morphological discontinuities among species. Case 2: One morphospecies with multiple DNA barcodes. Variant DNA barcodes can be found within species across the geographic range, or even locally in some species (25). This can indicate the existence of morphologically cryptic or semicryptic species, which might have been lumped together as single taxon by field workers (14). In this case, the field workers should revisit the individuals with divergent haplotypes, and carefully examine adult individuals along with nearby seedling and saplings, and collect samples from individuals representing the full range of morphological and ecological variation. If the DNA variation is consistently associated with certain morphological or ecological types, this can provide good evidence of multiple species. These cryptic species can be flagged for further study focused on potential reproductive barriers, such as nonoverlapping phenology and habitat segregation (26). If the two cryptic species are not sister species, then they should also segregate in different nodes within a broadly sampled phylogeny (see Fig. 3).

4. Discussion Standard cpDNA barcodes will be useful for discriminating species across distant clades and within relatively old clades (e.g., with sister species divergences older than the Pleistocene). We provide the example of Trema micrantha species complex (Fig. 3) as an example in which DNA barcodes could be used to discriminate a cryptic species in a long-term forest inventory plot (28).

390

C.W. Dick and C.O. Webb

Fig. 3. Example of using DNA barcodes to diagnose cryptic species. In Barro Colorado Island (BCI), Panama, there was thought to be a single species of Trema—the common pioneer tree species Trema micrantha (27). Molecular studies in the 50 ha plot on BCI revealed highly divergent cpDNA and ITS haplotypes (Dick C, unpublished) among samples, which corresponded with two ecotypes which exhibit ecological differences in light requirement and which can be morphologically distinguished by the color of the endocarp (26). Yesson et al. (2004) showed that showed that T. micrantha is a species complex, and the two BCI morphotypes (T. micrantha 1 and 2) are not even sister species. Each morphotype is widespread, as indicated by sampling from Ecuador (EC) and form clades with other species with high bootstrap support (*). This phylogeny was adapted from Fig. 2 in Yesson (2004).

The more recently evolved species-rich groups (e.g., Inga and Ficus sections) may contain enough morphological variation for discrimination in the field, and yet be invariant using the standard plant DNA barcodes. When morphology is not useful for discriminating these species, alternative sets of DNA markers may be used. The nuclear Internal Transcribed Spacer provides more nucleotide substitution variation than most chloroplast DNA and may be amplified using universal primers. Closely related species with recent common ancestors are expected to share many alleles and haplotypes, but their reproductive isolation should be apparent in the form of distinct allele frequencies among syntopic (co-occurring in the same habitat) populations of the putative species. This requires a population genetics approach. The forest inventory plots provide an excellent system in which to detect reproductive barriers based on genetic differentiation using tools of population genetics because (1) populations of the target species are already mapped and available for analyses and (2) because the species occur in the same locale, the genetic differentiation analysis will not be confounded by differentiation due to isolation by distance processes.

18

Plant DNA Barcodes, Taxonomic Management, and Species Discovery…

391

Microsatellite DNA markers (also known as simple sequence repeats or SSRs) are the most commonly used DNA markers for such analyses because of the high rate of mutation and allele richness within populations. Microsatellites are typically isolated from anonymous regions of the nuclear genome. However, because the primers that are designed from the flanking nucleotide sequences are also variable, the microsatellite markers are often speciesspecific or transferable only to very closely related species. It is not feasible to develop novel microsatellite DNA markers for every potential cryptic species pair. When working within families, an alternative method is to develop microsatellite DNA markers from Expressed Sequence Tags (ESTs). ESTs are short DNA fragments of expressed genes obtained from messenger RNA (mRNA). Although the mRNAs code for proteins, they contain untranslated regions (UTRs) at the 3¢ and 5¢ ends with SSRs at a frequency of 1–2% (29). Because EST-SSR loci are adjacent to coding sequences, highly conserved PCR primers can be designed, which are transferable across species, genera, and even higher-level taxa (29). ESTSSRs have an additional advantage over anonymous nSSRs in that they generally do not produce null alleles (unamplified alleles) because of their highly conserved priming sites. EST-SSRs can be mined from online EST databases using Web-based bioinformatics search engines. There are currently more than 52 million ESTs in GenBank, including thousands from important and species rich tropical tree families, such as Fabaceae, Rubiaceae, and Lauraceae. The multilocus dataset for multiple species can be analyzed using Bayesian clustering approaches that estimate the most likely number of genetic demes (K) in the sample (this can be done using the program STRUCTURE) (23, 30). If, for example, five morphospecies represent five distinct species, the analysis should infer K = 5 demes and assign all individuals to their morphospecies-defined deme. The existence of demes in forest plots is indicative of reproductive isolation (i.e., true species under a biological species concept) because there is not sufficient distance to impede gene flow due to geographic distance. The sample size often used for population genetic analyses is approximately 30 individual per species (to obtain allele frequencies), using ca. 10 SSR loci (for multiple independent estimates of demes). Individuals should be sampled at spaced intervals (e.g., 50 m) throughout the plot to avoid sampling of close relatives. ESTs can also be a source of phylogenetically informative introns. Although introns are spliced from the mRNA, the EST can be compared to known genomes (e.g., Arabidopsis thaliana or Populus trichocarpa) to determine which ESTs span introns. From these, Exon Primed Intron Crossing (EPIC) markers can be developed (31). EPIC markers are expected to amplify nuclear introns broadly across higher-level taxa because of their highly conserved priming regions. Markers such as these will be useful for distinguishing among closely related species, and for developing phylogenies for establishing species relationships.

392

C.W. Dick and C.O. Webb

In summary, we see great potential for DNA analyses to assist in the management of taxonomic entities in species-rich forest inventory plots, and in the discovery of new species. We can imagine a time when a multilocus library of DNA sequences existed for all named species, with an estimate of sequence variation within each species, and when we could affordably sequence millions of base pairs for each individual in the plot using pyrosequencing. We could then match trees to local plot taxa, and to named species, without ever consulting physical vouchers (not that such a DNAonly approach would necessarily be desirable). Significantly original sequences would then almost certainly indicate species new to science. However, we are of course far from having these data available, and so DNA barcodes must be considered an additional valuable source of data in our taxonomic work, to be used in dialog with physical vouchers, rather than a goal in themselves.

Acknowledgments Some of the methods were derived from research supported by the National Science Foundation (DEB awards 0640379 to CD, and 1020868 to CW) and the Center for Tropical Forest Sciences. We thank John Kress and David Erickson for the invitation and for useful ideas for the paper. References 1. Fine PVA, Ree RH (2006) Evidence for a timeintegrated species-area effect on the latitudinal gradient in tree diversity. Am Nat 168: 796–804 2. Jørgensen PM, León-Yánez S (1999) Catalogue of the vascular plants of Ecuador. Missouri Botanical Garden, St. Louis, MO 3. Balslev H, Valencia R, Paz y Miño G, Christensen H, Nielsen I (1998) Species count of vascular plants in one hectare of humid lowland forest in Amazonian Ecuador. Forest biodiversity in North, Central and South America, and the Caribbean, research and monitoring. In: Dallmeier F, Comiskey JA (eds) Man and the biosphere series, vol 21. UNESCO, Paris, pp 585–594 4. Ruokolainen K, Tuomisto H, Kalliola R (2005) Landscape heterogeneity and species diversity in Amazonia. In: Bermingham E, Dick CW, Moritz C (eds) Tropical Rainforests, Past, Present and Future. University of Chicago Press, Chicago, pp 251–270 5. Prance GT, Beentje H, Dransfield J, Johns R (2000) The tropical flora remains undercollected. Ann Mo Bot Gard 87:67–71

6. Dick CW, Kress WJ (2009) Dissecting tropical plant diversity with forest plots and a molecular toolkit. Bioscience 59:745–755 7. Bebber DP, Carine MA, Wood JRI et al (2010) Herbaria are a major frontier for species discovery. Proc Natl Acad Sci U S A. doi:10.1073/ pnas.1011841108 8. Mori SA, Prance GT (1987) A guide to collecting lecythidaceae. Ann Mo Bot Gard 74: 321–330 9. RAINFOR (Amazon Forest Inventor y Network) http://www.geog.leeds.ac.uk/ projects/rainfor/pages/project_eng.html. Last accessed on 28 Feb 2011 10. Laman TG (1995) Safety recommendations for climbing rain-forest trees with single rope technique. Biotropica 27:406–409 11. Colpaert N et al (2005) Sampling tissue for DNA analysis of trees: trunk cambium as an alternative to canopy leaves. Silvae Genetica 54:265–269 12. Blanco MA et al (2006) A simple and safe method for rapid drying of plant specimens using forced-air space heaters. Selbyana 27:83–87

18

Plant DNA Barcodes, Taxonomic Management, and Species Discovery…

13. Leisner R (Field Techniques used by the Missouri Botanical Garden) http://www. mobot.org/mobot/molib/fieldtechbook/ welcome.shtml. Last accessed on 28 Feb 2012 (Missouri Botanical Garden, St. Louis, MO) 14. Janzen DH et al (2009) Integration of DNA barcoding into an ongoing inventory of complex tropical biodiversity. Mol Ecol Resour 9:1–26 15. Kress WJ (2004) Plant floras: how long will they last? A review of flowering plants of the Neotropics. Am J Bot 91:2124–2127 16. Webb CO, Slik JWF, Triono T (2010) Biodiversity inventory and informatics in Southeast Asia. Biodivers Conserv 19:955–972 17. Gonzalez M-A, Baraloto C, Engel J et al (2009) Identification of Amazonian trees with DNA barcodes. PLoS Biol 4:e7483 18. Herbarium University of Florida (Preparation of plant specimens for deposit as herbarium vouchers) http://www.flmnh.ufl.edu/herbarium/ voucher.htm#Identification. Last accessed on 28 Feb 2012 19. Condit R (1998) Tropical forest census plots: methods and results from Barro Colorado Island. Panama and a comparison with other plots, Springer-Verlag, Berlin 20. Ratnasingham S, Hebert PDN (2007) BOLD: The Barcode of Life Data System (http://www. barcodinglife.org). Mol Ecol Notes 7:355–364 21. Savolainen V, Reeves G (2004) A plea for DNA banking. Science 304:1445 22. Kress WJ et al (2009) Plant DNA barcodes and a community phylogeny of a tropical forest

23.

24.

25.

26.

27. 28.

29.

30.

31.

393

dynamics plot in Panama. Proc Natl Acad Sci U S A 106:18621–18626 Duminil J, Caron H, Scotti I, Cazal SO, Petit RJ (2006) Blind population genetics survey of tropical rainforest trees. Mol Ecol 15:3505–3513 Saeki I, Dick CW, Barnes BV, Murakami N (2011) Comparative phylogeography of red maple (Acer rubrum L.) and silver maple (A. saccharinum L.): impacts of habitat specialization, hybridization and glacial history. J Biogeogr 38:992–1005 Dick CW, Heuertz M (2008) The complex biogeographic history of a widespread tropical tree species. Evolution 62:2760–2774 Silvera K, Skillman JB, Dalling JW (2003) Seed germination, seedling growth and habitat partitioning in two morphotypes of the tropical pioneer tree Trema micrantha in a seasonal forest in Panama. J Tropical Ecol 19:27–34 Croat TB (1978) Flora of Barro Colorado Island. Stanford University Press, Stanford, CA Yesson C, Russell SJ, Parrish T et al (2004) Phylogenetic framework for Trema (Celtidaceae). Plant Syst Evol 248:85–109 Ellis JR, Burke JM (2007) EST-SSRs as a resource for population genetic analyses. Heredity 99:125–132 Pritchard JK, Stephens JC, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155:945–959 Li C, Riethoven JM, Ma L (2010) Exonprimed intron-crossing (EPIC) markers for non-teleost fishes. BMC Evol Biol 10:90

Chapter 19 Construction and Analysis of Phylogenetic Trees Using DNA Barcode Data David L. Erickson and Amy C. Driskell Abstract The assembly of sequence data obtained from DNA barcodes into phylogenies or NJ trees has proven highly useful in estimating relatedness among species as well as providing a framework in which hypotheses regarding the evolution of traits or species distributions may be investigated. In this chapter, we outline the process by which DNA sequence data is assembled into a phylogenetically informative matrix, and then provide details on the methods to reconstruct NJ or phylogenetic trees that employ DNA barcode data, using only barcode data or combining barcodes with other data. Key words: Nucleotide, Homology, Alignment, Parsimony, Likelihood, DNA barcode, Community phylogeny

1. Introduction All molecular systematics is based on inferring relationships among species based on patterns of substitution at homologous nucleotide (and/or amino acid) bases that vary among taxonomic groups. The result of these analyses is a phylogenetic hypothesis, commonly expressed as a “phylogenetic tree” (1). A phylogenetic tree describes the evolutionary relationships among species, which can also provide inference of the relative degree of divergence (in time or other units) that separate taxa. In a phylogenetic tree, the topology is the pattern of which taxa are grouped with each other, a clade is a grouping of two or more taxa, and the term distance is the length of the branches that connect taxa or clades, which can represent the time since those taxa or clades diverged from a common ancestor. The ability of DNA barcode data to wholly or in part contribute to the reconstruction of well-resolved molecular phylogenies offers tremendous value to the entire community

W. John Kress and David L. Erickson (eds.), DNA Barcodes: Methods and Protocols, Methods in Molecular Biology, vol. 858, DOI 10.1007/978-1-61779-591-6_19, © Springer Science+Business Media, LLC 2012

395

396

D.L. Erickson and A.C. Driskell

of ecologists and evolutionary biologists who seek to elucidate and understand biological diversity, as well as benefiting those who employ phylogenetic data to address the ecological and evolutionary mechanisms that promote and maintain species diversity (2, 3). The construction of these trees can help researchers doing basic DNA barcode research by providing a way to assign unknown samples to a species or other taxonomic clade. For example, when sequences are submitted to BLAST for identification, the query sequence can be shown within a distance tree with other, similar sequences to help represent to what species the query sequence belongs. The sequence identification search engine at BOLD uses the same concept and many LIMS systems like WAISABI and Geneious use distance trees to help assign and verify the identity of sequences. Consequently, building aligned sequence matrices into which unknown or unverified sequences may be added promotes best practices and workflow processes in managing DNA barcode data. Likewise, the identification of novel genetic barcode sequences that may belong to new species may be initiated through incorporation into phylogenetic trees containing verified DNA barcode sequences. The delineation of novel species can be challenging and involve much more data than just DNA barcode sequences alone, but use of DNA barcode data to explore and quantify the magnitude of support for discrete genetic groupings that correspond to new species is greatly enhanced by use of molecular phylogenetic trees (see Chapter 18 and ref. 4). In addition to facilitating taxonomic investigations, molecular phylogenies using DNA barcode data are also being implemented by ecologists to explore the ecological mechanisms that affect community structure and function (see Chapter 20; also refs. 5–7). Lastly, phylogenies themselves represent a metric for the measurement of genetic diversity (8), which then allow the unambiguous quantification of genetic diversity for entire communities. This then promotes comparative analysis of genetic diversity in both time and space (9). Consequently, the ability to reconstruct phylogenetic trees has very many uses in ecology and evolution, and using DNA barcode data to correctly assemble community or taxonomic phylogenies provides a tremendous benefit to those who collect and use DNA barcode data. This chapter explicitly focuses on the methods we use to construct molecular phylogenies, starting from verified DNA barcode sequences in FASTA format, through to the generation of phylogenetic trees (Fig. 1). We emphasize that the methods used here are by no means exclusive to other methods; for example there are many methods of sequence alignment and those we have chosen,

19

Construction and Analysis of Phylogenetic Trees Using DNA Barcode Data rbcL

matK/CO1

trnH/ITS

Edited Sequence

FASTA

FASTA

FASTA

Sequence Alignment

Sequencher/ transAlign

transAlign/ MAFFT

Muscle/ MAFFT

Matrix Construction: Nexus File

SequenceMatrix/ SeqCat

Species

Phylogenetic Reconstruction

397

1 2 3 4 5 6 7 8 9

trnH/ ITS trnH/ ITS rbcL

matK

trnH/ ITS trnH/ ITS

Maximum Liklihood: Garli/RAxML

Parsimony: PAUP/TNT 1 2 3 4 5 6 9 8 7 Outgroup

1 2 3 4 5 6 9 8 7

Fig. 1. A workflow of data from sequence to phylogeny is outlined. The programs we use at each step are given.

we do so based on our own experience. Many of the examples of different genes come from plants, but the processes outlined can be applied across organisms and genes. The workflow pattern is what will remain constant even as the exact sequence analysis programs implemented may evolve.

398

D.L. Erickson and A.C. Driskell

2. Materials 2.1. Data Input Files

1. Coding DNA: Sequences of COI (animals), rbcL, and matK (Plants), are maintained as concatenated FASTA files. Each file is in .txt format and contains sequences from all species intended for use in the community phylogeny. Sequences are fully edited and comply with the DNA barcode standards established by CBOL (see Chapter 13 on LIMS). Alignments are done independently for each gene, thus each file should have only sequences for that gene present. Outgroup sequences must be included at this point. All sequences should be in the same orientation (typically 5¢ → 3¢), but length may vary. 2. Noncoding DNA: close relatives. Rapidly evolving noncoding sequences may vary in length in many clades due to the insertion or deletion of nucleotides, which can confound many alignment algorithms. However, if the phylogeny comprises a single plant Order or Family, noncoding sequences, such as the nuclear ribosomal internal transcribed spacer (ITS) or the chloroplast intergenic spacer trnH-psbA may be aligned globally in the same manner as coding genes. Thus, output can be formatted as a single FASTA file in same orientation. 3. Noncoding DNA: nested alignments. With more divergent taxa, or when performing analyses for entire communities or very divergent clades we subdivide alignment for rapidly evolving noncoding markers like trnH-psbA and ITS based on taxonomy (see Note 1). Input for alignment programs requires individual concatenated FASTA files for each Order or Family. Similar treatment may also be required for CO1 when combining very divergent taxa into a single community phylogeny (e.g., echinoderms, annelids, and platyhelminths; see Note 2). Algorithms that partition sequences based on genetic distance (i.e., Mega-phylogeny (10)) rather than taxonomy may also yield improved alignments.

2.2. Sequence Alignment Programs (see Note 3)

1. rbcL alignments: We use Sequencher 4.8 (GeneCodes Corp.) to perform alignments for rbcL across all taxa. Other programs, such as Geneious 1.07 (Biomatters Ltd.) may also be used, but we have more experience with Sequencher. Global alignments from Sequencher may contain many erroneously inserted gaps due to the divergence of sequences—Sequencher was not designed as a cross-taxon alignment tool and does not handle very divergent sequences in the manner one might expect. However, it is still a useful tool. Sequencher alignments must be edited to remove the inserted gaps—the command “collect all gaps right/left” under the “Sequence” menu greatly facilitates this.

19

Construction and Analysis of Phylogenetic Trees Using DNA Barcode Data

399

2. matK alignments: We use transAlign (ref. 11; freely downloadable at http://www.molekularesystematik.uni-oldenburg.de/ en/34011.html) for global alignment of matK data. The program transAlign translates the DNA sequence data of protein coding genes into amino acid (aa) sequences, sends the aa sequences to a separate program for alignment, and lastly “back” translates the resulting aa alignment into a nucleotide sequence alignment, preserving the original DNA sequences. Thus, transAlign does not perform the alignment, but rather interacts with an alignment program and maintains the integrity of the original DNA sequences while using an aa translation for the actual inference of sequence homology. As noted, transAlign requires an interface with an alignment program also installed on your computer—we use ClustalX (12), although other alignment programs can be substituted (e.g., Muscle, MAFFT, T-Coffee). 3. COI alignments: COI alignments can be treated as matK above, given that both are rapidly evolving coding genes. Backtranslation is frequently used for assembling aligned databases of COI, and thus the use of transAlign works well. MAFFT (ref. 13; http://mafft.cbrc.jp/alignment/software/) is also very effective. 4. trnH-psbA and ITS alignments: We use Muscle 3.6 (ref. 14; http://www.drive5.com/muscle) for alignments of trnHpsbA sequences (which would also apply to ITS and any other rapidly evolving noncoding sequence data. We use default parameters for the program when aligning. 5. Matrix construction: To combine separate aligned files into a matrix with sequences for multiple genes concatenated (end-toend) (see Fig. 2), we currently use the Java-based program SequenceMatrix 12.7.0 (ref. 15; freeware http://taxondna. sourceforge.net/). We have also used both MacClade4 (16) and more recently, a perl script called SeqCat (http://www.molekularesystematik.uni-oldenburg.de/en/34011.html). Taxon labels exported from SequenceMatrix in nexus format will be truncated to 32 characters long so input labels need to be adjusted accordingly. MacClade4 operates only in Mac OS, while SequenceMatrix runs on any operating system where the Java binary is installed. The SeqCat program allows much longer taxon labels, but its output is an interleaved nexus file. The output file may need to be converted to a noninterleaved file (e.g., via PAUP* 4.0) before use in some downstream programs. 2.3. Phylogeny Reconstruction Programs

1. Parsimony: We use PAUP* 4.0 (17) for phylogenetic inference using the parsimony criterion. PAUP* runs under the Mac OS or UNIX. An alternative for parsimony searches is TNT (18), http://www.zmuc.dk/public/phylogeny/tnt/, which can be

400

D.L. Erickson and A.C. Driskell Sequence Matrix rbcL

matK

Tree Topology

trnH-psbA

----------------------------------------------

Inga laurina

----------------------------------------------

Inga marginata

----------------------------------------------

Inga nobilis

----------------------------------------------

Inga oerstediana

Casearia aculeata ---------------------------------------------Casearia guianensis ---------------------------------------------Casearia arborea Casearia sylvestris ---------------------------------------------commersoniana - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Casearia -------------------------------------------------

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Solanum circinatum Solanum hayesii ---------------------------------------------- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Solanum asperum - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Solanum lepidotum ----------------------------------------------

Bactris coloniata

----------------------------------------------

Bactris barronis

-------------------------------------------------------------------------------------------

Ginkgo biloba

Bactris major

Fig. 2. Outline of Supermatrix (or nested matrix) design. Coding genes can be aligned globally, across highly divergent clades, whereas the most rapidly evolving sequences are partitioned into smaller alignment blocks to improve the likelihood of correctly assessing homology among aligned nucleotides. In this model of a matrix for a community containing four divergent lineages, the intergenic spacer trnH-psbA is aligned within orders, with different orders nested into discrete partitions of the matrix. A nested design may be implemented when using any hypervariable sequence region.

run under Windows or UNIX. We have also implemented parsimony analyses via PAUP on the CIPRES Web server (http:// www.phylo.org/sub_sections/portal/), which runs large jobs very quickly. 2. Maximum likelihood: We have used both Garli 0.951 and and RAxML for maximum likelihood (ML) phylogenetic inference. Garli (ref. 19; freeware http://garli.googlecode.com) and RAxML (ref. 20; freeware http://icwww.epfl. ch/~stamatak/) may also be run on the CIPRES Web server for very large data sets (see Note 4). 3. Neighbor Joining trees: We use PAUP for calculating neighbor joining trees from DNA sequence alignments; however, alternative programs include clearcut 1.0.9 (21), which is also available via CIPRES (see Note 5).

3. Methods 3.1. Conducting Alignments

1. transAlign. We use transAlign in conjunction with ClustalX for global alignment of the coding genes. After the program is called, it will prompt for file name and location of the

19

Construction and Analysis of Phylogenetic Trees Using DNA Barcode Data

401

concatenated FASTA file (from Subheading 2.1 above). Default location is the same directory as the executable. Subsequent information is requested in a stepwise order, with an important choice being the type of translation table to be used (for plants we use the bacterial/plastid; for animals the appropriate mtDNA table can be selected). We typically choose to output as an aligned FASTA files as this format is most easily read by the programs we use for matrix construction. We choose to have the program check all six possible reading frames to cope with any sequences accidentally submitted in reverse orientation. The program will also screen for and report possible insertion–deletion (indel) errors that create pseudogene-like stop codons. 2. Muscle: We use both command line and Web-based versions of Muscle for alignment of noncoding DNA sequences. The online version is available at: http://www.ebi.ac.uk/Tools/ msa/muscle/. Again, FASTA formatted sequences (from Subheading 2.1 above) are the source input. For nested alignments of noncoding sequences, we use a script (from above at Subheading 2.2, item 4) that automates batch submissions to a command line version of Muscle; this expedites having to run Muscle for each set of nested sequences. 3. MAFFT: MAFFT has proven to be highly accurate in alignment of complex sequences. As with Muscle, we use both online and local command line versions. Local command-line versions handle larger numbers of sequences than online implementations; and can be implemented in LIMS type programs like Geneious. Using the using the FFT-NS-I option, which assumes there are blocks of concordance among sequences punctuated by gaps, has produced the best alignments for rapidly evolving genes like matK and CO1. 4. Manual verification of alignment: Each alignment produced by the alignment methods listed above is checked manually using SeAl (freeware http://tree.bio.ed.ac.uk/software/seal/) (see Note 3). Typically, the rbcL alignment is then exported as a nexus formatted file (FASTA is also ok; see Note 6), and the matK and trnH-psbA files are exported as aligned FASTA files. Output files for each gene are then used to construct a combined supermatrix, or nested matrix as below (see Note 7). Alignments of the coding genes can be screened for one and two base gaps which typically correspond to sequence error and which arise from insertion of a gap from a single sequence. We cross check all one and two base gaps found in the aligned sequences with the raw data from the individuals causing the gaps and delete the base causing the gaps unless strongly supported by raw data. Similarly, the consensus sequence of all aligned coding genes can be exported and checked for reading

402

D.L. Erickson and A.C. Driskell

frame errors via translation. Stop codons in the consensus sequence are evaluated, with those sequences which are responsible for the stop codon in the consensus sequence checked. 3.2. Matrix Construction

1. Concatenation and supermatrix design. To combine multiple aligned DNA sequence files into a single matrix for use with phylogenetic analysis programs, we use the java program SequenceMatrix. SequenceMatrix combines alignment files (in aligned FASTA or Nexus format) so that each gene is added sequentially, and sequence data from different genes with the exact same taxon label will be concatenated. In addition, each specimen for which the data is to be concatenated must have a unique label and these must be consistent among input files. Alignment files can be dragged and dropped directly into the SequenceMatrix window and all files can be selected and added simultaneously. This is particularly useful when working with nested alignments—in large community phylogenies there may be dozens of Family or Ordinal-specific alignment files (see Fig. 2). We do not have the program code external gaps as question marks (however, see Note 9). Failure to identically label the same sample in different input files results in failure to concatenate sequences for a taxon. The resulting matrix is then exported as a nexus file with the option of “one single (potentially very long) line”.

3.3. Phylogenetic Reconstruction

1. Parsimony. We have employed PAUP* on local servers as well as via the CIPRES portal. The CIPRES portal implements the parsimony ratchet (22), as well as searching for a single most parsimonious tree using PAUP*. When analyzing nested supermatrices, it is critical to correctly specify the fraction of bases for the algorithm to “deform” during the ratchet runs (use the “set parameters” command). The ratchet procedure uses a hillclimbing algorithm to search for a best tree and to avoid fixation on a locally optimal, but globally suboptimal, tree-island, the algorithm “deforms” or alters base composition of the data to compel exploration of alternate potentially superior tree space. In a highly nested supermatrix (e.g., Fig. 2) where a large fraction of the matrix will consist of missing data (e.g., >95% missing data, Chapter 22), the percentage of bases deformed during the ratchet iteration must be relatively high (³50%) to ensure that the data is sufficiently deformed to dislodge the search from local optima. Because so many of the characters in a nested matrix have a high proportion of missing data, a large percentage of the matrix must be manipulated by the program in order to change enough informative data. For a data matrix that is globally aligned for all sequences (e.g., when using only CO1, or only rbcL + matK) the default setting

19

Construction and Analysis of Phylogenetic Trees Using DNA Barcode Data

403

of deforming 20% of bases should be adequate. We initiate five separate runs of PAUP* on CIPRES, and then combine the ratchet trees from each run into a single file for use in constructing a consensus tree. Using PAUP* on a local cluster has proven more useful, particularly when implementing a constraint tree (see Note 8). Constraint trees are readily defined, either through selection of a nexus formatted tree under GUI versions of PAUP*, or though inclusion of a constraint command in PAUP* command-blocks (see Note 9). In general, we do not include insertion–deletion characters as part of the data matrix (see Note 10). 2. Maximum likelihood on local cluster. We have used both RAxML and GARLI for ML-based phylogenetic reconstruction. Both are available via the CIPRES server, and both can readily use a newick-defined constraint tree (however, see Note 5). Both RAxML and GARLI are implemented on local clusters (see Note 11 for PAUP* command-block we use for both programs) which is equivalent to use of the GTR + I + Gamma model, which is broadly employed for many data sets. 3.4. Assessment of Topology and Support

1. For both MP and ML methods, we recommend the use of standard bootstrapping procedures. This is particularly true when the data matrix consists entirely of barcode data, which is by definition, a minimal quantity of data. For parsimony, the parsimony ratchet is a good alternative, and we combine trees from at least five separate MP runs with parsimony into a consensus tree. For ML, we suggest initiating 100 separate runs, each begun with a random addition starting tree. All trees are then assembled into a single nexus file with trees block, and a 50% majority rule tree is constructed. Use of a constraint tree may render estimation of topology irrelevant, but a phylogeny produced without that constraint, should be evaluated based on congruence with known phylogenies, and with respect to the fraction of taxonomic ranks that form monophyletic groups.

4. Notes 1. The partitioning of sequences that cannot be aligned globally is important since the accuracy of the alignment is dependent on genetic distance of those sequences being aligned. This will be true for the most rapidly evolving sequences, including any of the noncoding sequences as well as for CO1 which may saturate when alignments are among all members of a community. In plants, we evaluate the scale of alignment by comparing alignments using rbcL plus trnH-psbA (or ITS) aligned

404

D.L. Erickson and A.C. Driskell

at ordinal level with topologies based on rbcL only. We note if the phylogeny produced using both genes alters the topology from that produced with rbcL only. We assume trnH-psbA should not change the topology produced by rbcL, except in resolving polytomies within the rbcL-only tree. We partition the trnH-psbA alignment into successively lower taxonomic groups until the topology of the combined matrix is consistent with the rbcL-only data matrix. That is, when the trnH-psbA alignment becomes ambiguous due to alignment at too high a taxonomic scale, it will cause erroneous topological rearrangements at higher scales that conflict with topologies observed when using rbcL-only; thus, when trnH-psbA is aligned correctly, it should not contradict the rbcL (or rbcL + matK) alignments and will instead just increase resolution in poorly resolved clades. 2. COI can be aligned between highly divergent species, often through back translation. However, the phylogenetic information content may be limited, even if the alignment is legitimate because the rapid nucleotide substitution rate leads to saturation of changes at a given position. Accordingly, it may be prudent when one seeks to align many highly divergent clades with COI, as for a community phylogeny, to subdivide the alignment in a nested format, such as that used with noncoding genes. 3. When manually checking sequence alignments, we only make certain types of modifications, and otherwise leave the computer-generated alignments as is. Specifically, for rbcL alignments, there must be no gaps of any kind. SeAl provides a rapid tool for screening for gaps, usually the result of incorrect sequence editing. Further, the aligned sequence matrix for rbcL must contain no stop codons in the translation. For matK, there will be a substantial number of gaps, but nearly all gaps will be in multiples of three, corresponding to differences in the number of complete amino acids in the mature protein. Any one or two base gaps are likely the result of errors in sequence editing and must be confirmed in the source sequences. Further, as with rbcL, individual contigs as well as the aligned sequence matrix for all samples should contain no stop codons. We have observed pseudogenes in matK in a few families (particularly Melastomaceae) such that correct sequences contain one or two base gaps, which then affect inference of stop codons in an aligned matrix. Thus, one cannot definitely say no stop codons may be present in an aligned matK sequence matrix. For trnH-psbA, because it is noncoding, the use of stop codons to evaluate sequence alignment is not applicable. Typically, we trim trnH-psbA sequences with an

19

Construction and Analysis of Phylogenetic Trees Using DNA Barcode Data

405

internal primer that is different from the primer used for PCR (Hamilton 1999 psbA sequence). For CO1, the presence of stop codons should alert one to the possibility of having retrieved an NuMt, rather than a true mitochondrial copy. 4. RAxML appears to interpret congruence of the ML-generated tree with the constraint tree differently than does GARLI. In RAxML a clade that is a polytomy in the ML tree is not regarded as conflicting with a clade that is resolved in the constraint tree, thus it is possible for the constraint tree to have better resolution in some clades than an ML tree produced by RAxML which employs that constraint. Alternatively, GARLI appears to enforce the constraint tree topology more strictly, such that the ML tree produced by GARLI will always mirror the resolution of the constraint tree at a minimum, and often will resolve clades in the ML tree that are unresolved in the constraint tree. 5. Neighbor Joining is a discrete type of tree building algorithm in that it uses genetic distances obtained from an alignment matrix, but it produces the tree by ever finer subdivision of unresolved clades, as opposed to objectively evaluating relationships among species. The order in which species are listed in the alignment file may also affect the NJ tree topology because of the way ties are dealt with. Typically, ties are broken at random, thus a tie that includes >3 species can produce different topologies when the order in which those species are listed in the matrix differs. 6. Export of sequences out of Sequencher in a nexus format sometimes leads to inclusion of erroneous characters appended to the sequence reads that interfere with alignment using most alignment programs, including Muscle. Export of sequences as aligned FASTA solves this problem—and we paste sequences directly into the Web interface rather than upload text files to avoid incompatibilities between mac, pc, and unix machines. 7. We distinguish between a supermatrix and a nested matrix. A supermatrix may contain a large number of different genes for a set of specimens where the number of samples that have data for any one gene in the matrix may be low; the very large number of genes that are present unite the matrix (23). A nested matrix is a method to ensure that rapidly evolving genes are aligned correctly, when they cannot be aligned across all species in the matrix (cannot perform global alignment). In a nested alignment, one or more genes may be present and aligned for all samples, whereas other genes may

406

D.L. Erickson and A.C. Driskell

be aligned only within families to ensure estimation of homology during alignment (see Fig. 2). 8. A constraint tree can be implemented when part of the topology is known, and the user wants to enforce that known topology on the phylogeny being constructed with the barcode data. For example, in plants we can use a master tree from Angiosperm Phylogeny Group (APGIII), which specifies the relationships at the level of Order. When using an APGIII constraint tree when we run the analysis, the phylogeny produced from barcode data must be concordant with APGIII at the Ordinal level. Since APGIII does not specify the relationship at lower taxonomic levels, the barcode data will resolve family and species level relationships. Because barcode data is necessarily minimal, use of a constraint tree allows the barcode data to resolve lower level phylogenetic relationships, while leveraging existing phylogenetic data sets to correctly constrain deeper topological relationships. See Kress and coworkers (6) for further discussion and examples. 9. Example PAUP block for MP using a constraint tree: Begin Paup; Defaults hsearch; constraints =; set autoclose=yes; set criterion=parsimony; set root=outgroup; set storebrlens=yes; set increase=auto; outgroup ; hsearch addseq=random nreps=100 swap=tbr hold=5 enforce=yes; savetrees file= format=altnex; end;

10. When one wishes to use gaps and indel variation in phylogenetic computation, it may be desirable to have SequenceMatrix code the external gaps as question marks; this will allow programs like PAUP* to treat internal gaps as a fifth character. Failure to code external gaps as missing data while using indel variation in phylogeny estimation may allow the completeness of a sequence to be interpreted as natural indel variation.

19

Construction and Analysis of Phylogenetic Trees Using DNA Barcode Data

407

11. The following PAUP command block can be read and implemented by both Garli and RAxML: begin paup; set criterion=likelihood; constraints =; lset nst=6 basefreq=empirical; lset pinvar= estimate; lset rates=gamma ncat=4 shape= estimate; hsearch nreps=10 addseq=random swap=tbr enforce=yes; SaveTrees BrLens=yes My_tree.tre replace = yes end;

References 1. Swofford DL, Olsen GJ, Waddell PJ, Hillis DM (1996) Phylogenetic inference. In: Hillis DM, Moritz C, Mable BK (eds) Molecular systematics. Sinauer Associates, Boston 2. Webb CO (2000) Exploring the phylogenetic structure of ecological communities: an example for rain forest trees. Am Nat 156: 145–155 3. Harvey PH, Leigh Brown AJ, Maynard SJ, Nee S (2006) New uses for new phylogenies. Oxford University Press, Oxford 4. Smith MA, Rodriguez JJ, Whitfield JB et al (2008) Extreme diversity of tropical parasitoid wasps exposed by iterative integration of natural history. DNA barcoding, morphology, and collections. Proc Nat Acad Sci USA 105:12359–12364 5. Kress WJ, Erickson DL, Jones FA et al (2009) Plant DNA barcodes and a community phylogeny of a tropical forest dynamics plot in Panama. Proc Nat Acad Sci USA 106:18621–18626 6. Schreeg LA, Kress WJ, Erickson DL, Swenson NG (2010) Phylogenetic analysis of local-scale tree soil associations in a lowland moist tropical forest. PLoS One 5:e13685. doi:10.1371/ journal.pone.0013685 7. Uriarte M, Swenson N, Chazdon R et al (2010) Trait similarity, shared ancestry, and the structure of neighborhood interactions in a subtropical forest: Implications for community assembly. Ecol Lett 13:1503–1514

8. Forest F, Grenyer R, Rouget M et al (2007) Preserving the evolutionary potential of floras in biodiversity hotspots. Nature 445: 757–760 9. Hardy OJ, Jost L (2008) Interpreting and estimating measures of community phylogenetic structuring. J Ecol 96:849–852. doi:10.1111/ j.1365-2745.2008.01423.x 10. Smith S, Beaulieu JM, Donoghue MJ (2009) Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches. BMC Evol Biol 9:37. doi:10.1186/1471-2148-9-37 11. Bininda-Emonds ORP (2005) transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences. BMC Bioinformatics 6:156. doi:10.1186/ 1471-2105-6-156 12. Larkin MA, Blackshields G, Brown NP et al (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23:2947–2948. doi:10.1093/ bioinformatics/btm404 13. Katoh K, Misawa K, Kuma K-I, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30:3059–3066. doi:10.1093/nar/gkf436 14. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797. doi:10.1093/nar/gkh340

408

D.L. Erickson and A.C. Driskell

15. Meier R, Shiyang K, Vaidya G, Ng PKL (2006) DNA barcoding and taxonomy in diptera: a tale of high intraspecific variability and low identification success. Syst Biol 55:715–728. doi:10.1080/10635150600969864 16. Maddison DR, Maddison WP (2000) MacClade 4: analysis of phylogeny and character evolution, version 4.0. Sinauer Associates, Sunderland, MA 17. Swofford DL (2002) PAUP*. Phylogenetic analysis using parsimony (* and other methods) version 4. Sinauer Associates, Sunderland, MA 18. Goloboff PA, Farris JS, Nixon KC (2008) TNT, a free program for phylogenetic analysis. Cladistics 24:774–786 19. Zwickl DJ (2006) Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likeli-

20.

21.

22.

23.

hood criterion. Ph.D. Dissertation, The University of Texas at Austin Stamatakis A, Ott M, Ludwig T (2005) RAxML-OMP: an efficient program for phylogenetic inference on SMPs. In: Proceedings of 8th international conference on parallel computing technologies (PaCT2005). Lect Notes Comput Sci 3506288-302. Springer Verlag, Berlin Evans J, Sheneman L, Foster JA (2006) Relaxed neighbor-joining: a fast distance-based phylogenetic tree construction method. J Mol Evol 62:785–792 Nixon KC (1999) The parsimony ratchet, a new method for rapid parsimony analysis. Cladistics 15:407–414 Driskell AC, Ané C, Burleigh JG et al (2004) Prospects for building the tree of life from large sequence databases. Science 306:1172–1174

Chapter 20 Phylogenetic Analyses of Ecological Communities Using DNA Barcode Data Nathan G. Swenson Abstract Ecologists and conservation biologists are increasingly focusing on quantifying the phylogenetic component of biodiversity in order to inform basic and applied research. A major obstacle of this approach in tropical ecosystems has been the difficulty of generating high-quality phylogenetic trees for the vast numbers of species in these systems. Phylogenetic trees inferred from DNA barcodes hold the potential to overcome this obstacle. Here, I present a methodological framework for analyzing the phylogenetic alpha and beta diversity of ecological communities using a phylogenetic tree. The analytical approach is presented using the freely available and widely used software platform “R”. Key words: Biodiversity, Community ecology, Community phylogenetics, Phylogenetic beta diversity, Phylogenetic diversity, DNA barcode

1. Introduction Ecologists interested in studying and conserving biodiversity are tasked with quantifying that diversity through space and time. Typically, this has been done using a measure of species diversity. Other dimensions of biodiversity such as phylogenetic diversity and functional are less often quantified, but these forgotten dimensions may be equally or more important (1–3). Conservation biologists are interested in phylogenetic measures of biodiversity as a way to provide a more robust estimate of the overall evolutionary history being currently preserved in protected lands or the potential amount of biodiversity that could be lost in threatened regions (1, 3–5). Basic community ecology research, on the other hand, has focused on the phylogenetic structure of communities in order to gain insights into their historical assembly (6–8). Despite their differing aims, both research programs generate estimates of the phylogenetic diversity within and between species assemblages across scales. W. John Kress and David L. Erickson (eds.), DNA Barcodes: Methods and Protocols, Methods in Molecular Biology, vol. 858, DOI 10.1007/978-1-61779-591-6_20, © Springer Science+Business Media, LLC 2012

409

410

N.G. Swenson

The phylogenetic diversity of communities has been of interest in community ecology for almost 100 years where early studies analyzed the ratio of species and genera in communities as a way to understand whether biotic or abiotic interactions are important in community assembly (9, 10). Specifically, a low species to genus ratio indicates the coexistence of distantly related species—what is termed today as phylogenetic overdispersion (6). A high species to genus ratio indicates coexistence of closely related species—what is termed today as phylogenetic underdispersion or clustering (6). This species to genus ratio approach continued for decades culminating with the famous community assembly rules and null model debates of the 1960s and 1970s (10). The foundation of the species to genus ratio approach is the assumption that closely related species are more likely to have similar niches—often termed phylogenetic niche conservatism. If closely related species tend to share similar niches, then community assembly via abiotic filtering should result in phylogenetic clustering, whereas community assembly mediated by biotic interactions should result in phylogenetic overdispersion (6). Charles Darwin originally alluded to niche conservatism when he considered the implications of common descent. Specifically, species that share a recent common ancestor should, on average, tend to be more similar to one another than they are to more distant relatives. If this assumption is supported, then not only will phylogenetic diversity adequately estimate the functional diversity of an assemblage. A problem with the species to genus ratio approach, beyond the assumption of niche conservatism, is that taxonomic ranks do not convey detailed information regarding the time since two species diverged. A solution to this problem is to use phylogenetic trees with branch lengths. The branch lengths can be used to provide a more refined measure of relatedness between taxa. Though, until the early 2000s, generating phylogenetic trees representing communities (i.e., community phylogenies) was considered not possible. Pioneering work by Cam Webb and colleagues (6, 11) who developed software tools such as Phylomatic (11) for estimating phylogenetic trees for plant communities largely removed this obstacle. This innovation sparked a large number of investigations into the phylogenetic relatedness of coexisting plant species primarily in the tropics where measurements of species function or niches in tens to hundreds of locally coexisting species are difficult to achieve at best (6–8). This work has primarily sought to quantify the phylogenetic diversity in a community and to ask whether that phylogenetic diversity is higher or lower than that randomly expected given their species diversity. These results are then often used to determine to what degree abiotic or biotic interactions govern community assembly (6–8). As noted above, the development of Phylomatic by Webb and Donoghue (11), which made the estimations of community

20

Phylogenetic Analyses of Ecological Communities Using DNA Barcode Data

411

phylogenies for diverse botanical communities feasible, enabled a substantial and continually growing community phylogenetics research program. This provides evidence of how quickly the development of a new tool can spawn an entirely new literature within the matter of years. The generation of community phylogenies from DNA barcodes presents what may be considered the next substantial development in community phylogenetics (12, 13). There are two primary reasons for this prediction. First, the ability of Phylomatic to generate phylogenies for diverse communities was and is a powerful tool for ecologists, but the phylogenies generally lack much resolution within families and certainly within genera. Thus, fine-scale phylogenetic structuring in communities, particularly communities with many congeners, cannot be detected (12, 14). Second, the generation of a three-locus barcode and a barcode community phylogeny for the species hyper-diverse communities is entirely feasible (12, 13). Thus, the stage is set for DNA barcode community phylogenies to provide the next revolutionary tool in the emerging field of community phylogenetics. In the next two sections, the commonly used metrics of phylogenetic alpha and beta diversity is presented. 1.1. Phylogenetic Alpha Diversity and Dispersion

Central to the community phylogenetics research program is the quantification of the phylogenetic diversity within a community termed phylogenetic alpha diversity. One of the first phylogenetic alpha diversity metrics ever generated was “Faith’s Index” (1). Faith’s Index calculates the total branch lengths shared by the taxa in a community. A second commonly used metric is the mean pairwise phylogenetic distance (MPD) designed by Webb (6). This metric calculates the pairwise phylogenetic distance between all species in a community and then reports the mean value. This provides an overall picture of the “deep” or “basal” phylogenetic diversity in a community. It is calculated as follows: S

MPD = ∑ f i d ij , i =1

where S is the number of species in the community, f i is the relative abundance of species i, and d ij is the MPD between species i and all other species in the community. A third commonly used metric developed by Webb (6) is the mean nearest taxon distance (MNTD). This metric calculates the nearest phylogenetic neighbor between all species in a community and then reports the mean. This provides a “shallow” or “terminal” measure of phylogenetic diversity in a community. It is calculated as follows: S

MNTD = ∑ f i min d i . j , i =1

412

N.G. Swenson

where S is the number of species in the community, f i is the relative abundance of species i, and min d i . j is the nearest phylogenetic distance species i and all other species in the community. The majority of methods used to quantify the phylogenetic alpha diversity in communities are highly dependent upon the species alpha diversity. That is, there is often a strong correlation between phylogenetic and species alpha diversity. Therefore, it is difficult to determine whether the observed level of phylogenetic diversity is different from what would be expected at random given the species diversity. We cannot therefore determine the significance of the results with respect to mechanisms of community assembly. In order to determine the significance of the results, the researcher needs to utilize a null model. The concept of a null model is to hold constant all of the observed patterns except the one pattern in which you are interested. In this scenario, we are interested in the phylogenetic diversity of the community so we need to construct a null model that keeps the species diversity, species relative abundances, and species occupancy rates across communities constant. Not keeping these constant may result in inflated type I or type II statistical errors. A preferred null modeling method for determining if the observed phylogenetic diversity is higher or lower than expected is to randomize the names of the taxa across the tips of the phylogeny X times and calculate the phylogenetic alpha diversity with each random dataset. This provides a null distribution to which one can compare the observed phylogenetic alpha diversity. This null model does not randomize the community data. Therefore, all spatial patterns, abundance distributions, and species richness values are held constant. If the observed phylogenetic alpha diversity is higher than expected, then the community it is considered phylogenetically overdispersed. If it is lower than expected, it is considered phylogenetically underdispersed or clustered. 1.2. Phylogenetic Beta Diversity and Dispersion

Plant community ecologists have long been interested in the compositional dissimilarity of communities—termed beta diversity. Generally, compositional dissimilarity analyses have been conducted using lists of species in the communities. While this method has produced many important results and studies, ideally we would also like to know how evolutionarily dissimilar are the species in the communities being compared. Knowing such information would allow for stronger inferences regarding the ecological and evolutionary mechanisms that promote the observed distributions of plant species. Comparing the phylogenetic dissimilarity between two communities, or the phylogenetic beta diversity, is one rather new way of enhancing traditional species list based analyses of compositional

20

Phylogenetic Analyses of Ecological Communities Using DNA Barcode Data

413

dissimilarity (15). Here, I present two phylogenetic beta diversity metrics that are increasingly implemented in community ecology. The first metric is the pairwise phylogenetic dissimilarity (Dpw) between two communities where a pairwise phylogenetic distance between all species in one community and all species in another community: D pw =



nk1 i =1

f i d ik2 + ∑ j =1 f j d jk1 nk2

, 2 where d ik2 is the MPD between species i in community k1 and all species in community k2, d jk1 is the MPD between species j in community k2 and all species in community k1 and f i and f j are the relative abundance of species i and species j. The second metric is a nearest phylogenetic neighbor dissimilarity (Dnn) between two communities: D nn =



nk1 i =1

f i min d ik2 + ∑ j =1 f j min d jk1 nk2

2

,

where min d ik2 is the nearest phylogenetic neighbor to species i in community k1, min d jk1 is the nearest phylogenetic neighbor to species j in community k2 , and f i and f j are the relative abundance of species i and species j. Similar to phylogenetic alpha diversity metrics, phylogenetic beta diversity metrics may be correlated with the underlying species richness and species beta diversity. Thus, null model analyses can also be implemented in order to determine if the observed phylogenetic beta diversity is higher or lower than that expected given the observed species beta diversity. The null model used for this approach is identical to the one used for phylogenetic alpha diversity. In the following sections, the data required to quantify the phylogenetic alpha and beta diversity of communities are described. Then, how to calculate these metrics using the statistical software “R” is demonstrated.

2. Materials 2.1. DNA Barcode Community Phylogeny

1. Generate a DNA barcode community phylogeny following the methods described in Chapter 19. 2. Save the DNA barcode community phylogeny as a newick file. An example newick file lacking branch lengths is shown here: ((speciesA,speciesB),((speciesC,speciesD),speciesE));

2.2. Community Data

Organize community data into a three column tab delimited text file (.txt) where the first column is the name of the community,

414

N.G. Swenson

the second column is the abundance of a species in that community if it is present (i.e., no absences are represented), and the third column is the name of the taxa. If there are no abundance data, presence can be represented as a one in the abundance column (see Note 1). An example community data file for three communities is shown here: CommunityA

12

speciesA

CommunityA

4

speciesC

CommunityA

1

speciesE

CommunityB

8

speciesA

CommunityB

9

speciesB

CommunityC

5

speciesB

CommunityC

2

speciesC

CommunityC

19

speciesD

CommunityC

14

speciesE

3. Methods 3.1. Reading Phylogenetic and Community Data into R Software

1. Open R 2. Set working directory to the folder containing your phylogeny and community data files (see Note 2): setwd(“your.working.directory.path”) 3. Load the R package Picante (16) which will be used for the community phylogenetics calculations (see Note 3): library(picante) 4. Read in your newick phylogeny file: your.phylo.file = read.tree(“your.newick.tree.txt”) 5. Read in your community data file: your.community.file = readsample(“your.community.data.txt”)

3.2. Phylogenetic Alpha Diversity of Communities (see Notes 4 and 5)

1. Calculate Faith’s Index: faiths.index.output = pd(your.community.file, your.phylo.file) 2. Calculate the mean pairwise phylogenetic distance (MPD) for each community not weighting the result for abundance: mpd.output = mpd(your.community.file, cophenetic(your.phylo.file), abundance.weighted = FALSE)

20

Phylogenetic Analyses of Ecological Communities Using DNA Barcode Data

415

3. Calculate the mean nearest taxonomic distance (MNTD) for each community not weighting the result for abundance: mntd.output = mntd(your.community.file, cophenetic(your.phylo.file), abundance.weighted = FALSE) 3.3. Phylogenetic Alpha Dispersion of Communities

1. Quantify whether the observed Faith’s Index value is higher or lower than expected by generating a null model. The null model randomizes the names of taxa along the tips of the phylogeny and recalculates the Faith’s Index. The null distribution is then used to calculate a standardized effect size (SES) and a P value (see Note 6):

2. Quantify whether the observed MPD value is higher or lower than expected by generating a null model (see Notes 7 and 8):

3. Quantify whether the observed MNTD value is higher or lower than expected by generating a null model:

3.4. Phylogenetic Beta Diversity of Communities

1. Quantify the MPD (Dpw) between communities:

2. Quantify the mean nearest phylogenetic neighbor distance (Dnn) between communities:

3.5. Phylogenetic Beta Dispersion of Communities (see Notes 9 and 10)

1. Generate an empty three-dimensional array with the x and y dimensions equaling the number of communities and the z dimension equaling 999 randomizations plus one for the observed Dpw values: Dpw.nulls = array(NA, c(dim(as.matrix(Dpw.output)),1000))

416

N.G. Swenson

2. Assign the observed Dpw values to the first layer of the array: Dpw.nulls[,,1] = Dpw.output 3. Run a null model that randomizes the names of the species on the phylogeny 999 times. During each iteration, place the random Dpw values for your communities into an empty layer of the array: for(i in 2:1000){ random.phylo = tipShuffle(your.phylo.file); Dpw.nulls[,,i] = as.matrix(comdist(your.community.file, cophenetic(random.phylo), abundance.weighted = FALSE))} 4. Generate an empty matrix that will be propagated with P values indicating whether your observed Dpw value is higher or lower than that expected given the null distribution generated in step 3:

5. Calculate P values for Dpw metric: for(i in 1:dim(Dpw.pvalue)[1]){ for(j in 1:dim(Dpw.pvalue) [2]){ Dpw.pvalue[i,j] = (rank(Dpw.nulls[i,j,])[1])/1000 } } 6. Generate an empty three-dimensional array with the x and y dimensions equaling the number of communities and the z dimension equaling 999 randomizations plus one for the observed Dnn values: Dnn.nulls = array(NA, c(dim(as.matrix(Dnn.output)),1000)) 7. Assign the observed Dnn values to the first layer of the array: Dnn.nulls[,,1] = Dnn.output 8. Run a null model that randomizes the names of the species on the phylogeny 999 times. During each iteration, place the random Dnn values for your communities into an empty layer of the array: for(i in 2:1000){ random.phylo = tip.shuffle(your.phylo. file); Dnn.nulls[,,i] = as.matrix(comdistnt(your.community. file, cophenetic(random.phylo), abundance.weighted = FALSE))} 9. Generate an empty matrix that will be propogated with P values indicating whether your observed Dnn value is higher or lower than that expected given the null distribution generated in step 3:

10. Calculate P values for Dnn metric and store them in the Dnn. pvalue file: for(i in 1:dim(Dnn.pvalue)[1]){ for(j in 1:dim(Dnn.pvalue) [2]){Dnn.pvalue[i,j] = (rank(Dnn.nulls[i,j,])[1])/1000 } }

20

Phylogenetic Analyses of Ecological Communities Using DNA Barcode Data

417

4. Notes 1. Errors often occur due to file formatting issues. A common problem encountered is the delimitation of the community data file. Make sure that it is tab delimited, it has hard returns at the end of each line and no column names in the file. Also make sure that species names match those in the phylogeny and that no two communities or species have the same name. 2. An alternative easy way to set the working directory in R is to use the “MISC” drop down menu when using R on Macintosh operating systems and the “FILE” drop down menu when using Windows. 3. Help interpreting or implementing any R command can be accessed by typing a question mark followed by the command name. 4. The phylogenetic alpha diversity algorithms can take a fair amount of time to run if the null model is implemented. The phylogenetic beta diversity algorithms, on the other hand, are memory intensive due to the pairwise community distance matrices being stored and analyzed. The calculation of the observed results may take seconds to tens of minutes depending on the size of phylogeny and the number of communities analyzed. The null model will take 999 times longer. 5. The species richness of the communities is reported in the phylogenetic alpha diversity outputs under the “ntaxa” column. 6. The SESs in the phylogenetic alpha diversity output are in the columns with headers ending with a “.z” in which the P values are in the columns with headers ending with a “.p”. Remember the P value being reported represents a rank. Thus, when conducting a two-tailed test, a P value of 0.975 or higher is significant. 7. P values may be preferred over SES results. A SES of greater than 1.96 or lower than 1.96 may be considered significant only if the null distribution is normal. This is often not the case with community phylogenetics null model distributions. The P value provides the rank of the observed in the null distribution and therefore it is more directly interpretable. 8. The above code can be modified in several instances by changing the abundance.weighted = FALSE to abundance. weighted = TRUE, but this must be done for both the observed and null calculations. Abundance weighting can provide substantially different results depending on the evenness of the communities and it can often be a useful additional piece of information.

418

N.G. Swenson

9. Inferring the mechanisms governing community assembly from the above tests alone can be problematic for two reasons. First, species niches may or may not be phylogenetically conserved. Second, if some species traits are overdispersed in a community, while others are underdispersed this may give a random phylogenetic signal (17). Thus, analyses that marry the above phylogenetic analyses with analyses of phylogenetic signal in trait data and trait dispersion are particularly powerful and more likely to provide robust inferences. 10. Functional trait dendrograms that are often used in functional ecology take the same data structure as a phylogeny. They can therefore be used in the above code to provide measures of functional alpha and beta diversity. Thus, one can compare phylogenetic and functional alpha and beta diversities using the same exact metrics and statistical tools.

Acknowledgments I would like to thank John Kress and Dave Erickson for their collaboration and invitation to contribute to this volume. N.G.S. is supported by Michigan State University. References 1. Faith DP (1992) Conservation evaluation and phylogenetic diversity. Biol Conserv 61:1–10 2. Webb CO, Ackerly DD, McPeek MA, Donoghue MJ (2002) Phylogenies and community ecology. Annu Rev Ecol Syst 33:475–505 3. McGill BJ, Enquist BJ, Weiher E, Westoby M (2006) Rebuilding community ecology from functional traits. Trends Ecol Evol 21:178–185 4. Faith DP (1994) Genetic diversity and taxonomic priorities for conservation. Biol Conserv 68:69–74 5. Faith DP (2002) Quantifying biodiversity: a phylogenetic perspective. Conserv Biol 16:248–252 6. Webb CO (2000) Exploring the phylogenetic structure of ecological communities: an example for rain forest trees. Am Nat 156:145–155 7. Swenson NG, Enquist BJ, Pither J, Thompson J, Zimmerman JK (2006) The problem and promise of scale dependency in community phylogenetics. Ecology 87:2418–2424 8. Swenson NG, Enquist BJ, Thompson J, Zimmerman JK (2007) The influence of spatial and size scales on phylogenetic relatedness in

tropical forest communities. Ecology 88: 1770–1780 9. Elton J (1946) Competition and the structure of ecological communities. Animal Ecol 15:54–68 10. Jarvinen O (1982) Species-to-genus ratios in biogeography: a historical note. J Biogeogr 9:363–370 11. Webb CO, Donoghue MJ (2005) Phylomatic: tree assembly for applied phylogenetics. Mol Ecol Notes 5:181–183 12. Kress WJ, Erickson DL, Jones FA, Swenson NG, Perez R, Sanjur O, Bermingham E (2009) Plant DNA barcodes and a community phylogeny of a tropical forest dynamics pot in Panama. Proc Natl Acad Sci USA 106:18621–18626 13. Kress WJ, Erickson DL, Swenson NG, Thompson J, Uriarte M, Zimmerman JK (2010) Improvements in the application of DNA barcodes in building a community phylogeny for tropical trees in a Puerto Rican forest dynamics plot. PLoS One 5:e15409. doi:10.1371/journal.pone.0015409 14. Swenson NG (2009) Phylogenetic resolution and quantifying the phylogenetic diversity and dispersion of communities. PLoS One 4:e4390

20

Phylogenetic Analyses of Ecological Communities Using DNA Barcode Data

15. Graham CH, Fine PVA (2008) Phylogenetic beta diversity: linking ecological and evolutionary processes across space in time. Ecol Lett 11:1265–1277 16. Kembel SW, Ackerly DD, Blomberg SP et al (2010) Picante: R tools for integrating phylog-

419

enies and ecology. Bioinformatics 26: 1463–1464 17. Swenson NG, Enquist BJ (2009) Opposing assembly mechanisms in a Neotropical dry forest: implications for phylogenetic and functional community ecology. Ecology 90:2161–2170

Part V Case Studies Using DNA Barcodes

Chapter 21 FISH-BOL, A Case Study for DNA Barcodes Robert D. Ward Absract The FISH-BOL campaign was initiated in 2005, and currently has barcoded for the cytochrome c oxidase subunit I (COI) gene about 8,000 of the 31,000 fish species currently recognised. This includes the great majority of the world’s most important commercial species. Results thus far show that about 98% and 93% of marine and freshwater species, respectively, are barcode distinguishable. One important issue that needs to be more fully addressed in FISH-BOL concerns the initial misidentification of a small number of barcode reference specimens. This is unsurprising considering the large number of fish species, some of which are morphologically very similar and others as yet unrecognised, but constant vigilance and ongoing attention by the FISH-BOL community is required to eliminate such errors. Once the reference library has been established, barcoding enables the identification of unknown fishes at any life history stage or from their fragmentary remains. The many uses of the FISH-BOL barcode library include detecting consumer fraud, aiding fisheries management, improving ecological analyses including food web syntheses, and assisting with taxonomic revisions. Key words: COI, Cytochrome oxidase, Species identification, Fish, Elasmobranchii, Actinopterygii, DNA barcode

1. Introduction In 2005, a campaign—FISH-BOL—was launched to DNA barcode all the fish species of the planet (1). These currently number about 31,000. Approximately half are marine species, with an estimated 5,000 further marine species awaiting description (2). In total, there are likely to be around 40,000 extant fish species. Some 8,000 (August 2010) of the currently recognised fish species have been barcoded as part of this campaign, with an average of about seven specimens per species. Since the inception of FISHBOL, progress has been steady (Fig. 1). About 7,500 of the 30,000 known actinopterygiians have been barcoded, and about 500 of

W. John Kress and David L. Erickson (eds.), DNA Barcodes: Methods and Protocols, Methods in Molecular Biology, vol. 858, DOI 10.1007/978-1-61779-591-6_21, © Springer Science+Business Media, LLC 2012

423

424

R.D. Ward

Fig. 1. Progress of FISH-BOL, showing numbers of species barcoded by date.

the 1,000 or so elasmobranchs. Numbers of fish species barcoded according to the taxonomy browser of the Barcoding of Life Database (BOLD, www.barcodinglife.org) are appreciably higher, at over 10,000 actinopterygiians and 1,000 elasmobranchs. The difference in FISH-BOL and BOLD tallies comes from two sources: (1) BOLD captures barcodes that are lodged in GenBank but not in BOLD itself and (2) BOLD tallies include as discrete species those specimens not yet scientifically named, such as Cynoglossus cf. arel, Cynoglossus sp. E or Cynoglossus sp. Individual barcoding studies of 50 or more species include marine Australian fishes (3), Australian sharks and rays (4), Canadian freshwater fishes (5), North American marine fishes (6), coral reef fish (7), central American freshwater fishes (8), Indian marine fishes (9) and Antarctic fishes (10).

21

FISH-BOL, A Case Study for DNA Barcodes

425

Most important commercial species have now been barcoded. For example, of the 60 principal fish species that constitute the bulk of capture production (FAO Capture Production 2008, see ftp.fao.org/fi/stat/summary/default.htm#capture), 56 have been barcoded (August 2010, mean sample size per species = 28.1) and the 57th is sampled and awaiting barcoding. However, the intent of FISH-BOL is to barcode at least five specimens of every fish species on the planet, and achieving that goal will clearly be difficult.

2. Materials 1. Any part of the fish may be used for extracting DNA. Probably white muscle is used most frequently, but liver or fin tissues are also commonly used. If larval fish are to be retained and vouchered, DNA maybe extracted from a single eyeball. Single fish eggs also yield suitable DNA. 2. Tissue storage in 95% alcohol is recommended. If this is not practical, for example during fieldwork and/or transport, DMSO may be used (see Note 1). Where possible, we also try to retain a tissue portion frozen at −80°C. 3. During fieldwork, and in laboratory processing and sample cataloguing, we try to maintain the following sequence: (a) collection, (b) preliminary identification and labelling, (c) tissue extraction for DNA barcoding (usually white muscle taken from under a scalpel-cut skin flap on the right side of the fish), (d) photography (of left side of fish), see Note 2, (e) storage of whole specimen for later museum vouchering and/or identification verification.

3. Methods 3.1. Fish DNA Barcoding Methodology

1. Fish barcoding is based on sequencing the standard 655 bp fragment of cytochrome c oxidase I. 2. A range of fish primers is provided in Table 1 (also see Notes 3 and 4). The Biodiversity Institute of Ontario (BIO) has standardised protocols for DNA extraction, PCR and sequencing (see Note 4). 3. It is recommended that for reference barcodes, both forward and reverse sequences are read and that the consensus sequence be posted on BOLD. For matching of unknown specimens against the reference library, sequencing of the unknown in a single direction is likely to be sufficient.

426

R.D. Ward

Table 1 PCR primers for fish DNA barcoding Name

5¢-3¢ Sequence

Primers without M13 tails FishF1 TCAACCAACCACAAAGACATTGGCAC FishF2 TCGACTAATCATAAAGATATCGGCAC FishR1 TAGACTTCTGGGTGGCCAAAGAATCA FishR2 ACTTCAGGGTGACCGAAGAATCAGAA Fish-BCH ACTTCYGGGTGRCCRAARAATCA Fish-BCL TCAACYAATCAYAAAGATATYGGCAC TelF1 TCGACTAATCAYAAAGAYATYGGCAC TelR1 ACTTCTGGGTGNCCAAARAATCARAA M13-tailed primers Fish cocktail C_FishF1t1-C_FishR1t1 (ratio 1:1:1:1) VF2_t1 TGTAAAACGACGGCCAGTCAACCAACCACAAAGACATTGGCAC FishF2_t1 TGTAAAACGACGGCCAGTCGACTAATCATAAAGATATCGGCAC FishR2_t1 CAGGAAACAGCTATGACACTTCAGGGTGACCGAAGAATCAGAA FR1d_t1 CAGGAAACAGCTATGACACCTCAGGGTGTCCGAARAAYCARAA Mammal C_VF1LFt1-C_VR1LRt1 (ratio 1:1:1:3:1:1:1:3) cocktail LepF1_t1 TGTAAAACGACGGCCAGTATTCAACCAATCATAAAGATATTGG VF1_t1 TGTAAAACGACGGCCAGTTCTCAACCAACCACAAAGACATTGG VF1d_t1 TGTAAAACGACGGCCAGTTCTCAACCAACCACAARGAYATYGG VF1i_t1 TGTAAAACGACGGCCAGTTCTCAACCAACCAIAAIGAIATIGG LepR1_t1 CAGGAAACAGCTATGACTAAACTTCTGGATGTCCAAAAAATCA VR1d_t1 CAGGAAACAGCTATGACTAGACTTCTGGGTGGCCRAARAAYCA VR1_t1 CAGGAAACAGCTATGACTAGACTTCTGGGTGGCCAAAGAATCA VR1i_t1 CAGGAAACAGCTATGACTAGACTTCTGGGTGICCIAAIAAICA Sequencing primers for M13-tailed PCR products M13F TGTAAAACGACGGCCAGT M13R CAGGAAACAGCTATGAC

Reference (3) (3) (3) (3) (68) (68) (10) (10) (69)

(69)

(70) (70)

4. For PCR recalcitrant samples, mini-barcodes of just 100–200 bases may suffice for identification purposes (11, 12). 5. Tissues from fish preserved in formalin have long been considered too refractory to consider sequencing (see Note 5). 6. All participants in fish barcoding are strongly urged to use the Barcode of Life Database (BOLD), www.barcodinglife.org, see ref. 13 as the repository for their data (see Note 6). 7. The process of developing a reference library of fish barcodes has highlighted a number of issues that have impeded the completion of this library and these must be considered. They include: a failure to PCR amplify and sequence some specimens (see Note 7), the possibility that mitochondrial DNA inserts into nuclear DNA (numts) are being sequenced rather

21

FISH-BOL, A Case Study for DNA Barcodes

427

than the true mtDNA COI gene (see Note 8), and the inability of COI to distinguish some species (see Note 9). 8. More details on fish barcoding methodologies are given elsewhere in this volume. 3.2. Gaps in Species Coverage

1. Some 8,000 (FISH-BOL) to 10,000 (BOLD plus GenBank) fish species have been barcoded, leaving some 20,000 or more species awaiting examination. The barcoded species are not distributed randomly, either taxonomically (Table 2) or regionally (Table 3). Approximately one-half of all elasmobranchs have been barcoded, but only one-quarter of the much more numerous actinopterygiians. A detailed analysis of species coverage by family, as of mid-2010, has been published (14). At least one species has been barcoded from about 90% of all families. Some quite large families are well represented, such as the shark family Carcharhinidae with 47 of 52 species barcoded (90.4%). However, only 381 of 2,770 species of the largest family, Cyprinidae, have been barcoded (13.8%). Most of the unrepresented families have few species, the largest such being the rice fishes of the family Adrianichthyidae with 29 as yet unbarcoded species. Regional coverage is similarly varied, ranging from about 30% of all fish species in the Australian and North American regions to only 10% of North East Asia fishes. 2. The FISH-BOL goal of a reference barcode for every living fish species is also one of the goals of iBOL (www.iBOL.org), but will not be easily attained. A major impediment is the lack of sufficient dedicated funding for collecting trips and subsequent taxonomic identification and vouchering—the sequencing protocols themselves are relatively inexpensive to implement. One way to use limited funds in an efficient manner is to target a

Table 2 Breakdown of FISH-BOL progress by taxonomic class Class

Species number Barcoded number % Progress

Actinopterygii

29933

7266

24

42

24

57

1114

549

49

Holocephali

46

27

59

Myxini

74

13

18

Sarcopterygii

11

2

18

Cephalaspidomorphi Elasmobranchii

428

R.D. Ward

Table 3 Breakdown of FISH-BOL progress by regional Working Group (defined by FOA regions as given) Working group

FAO regions

Africa

1, 34, 47, 51

Australia

Species number

Barcoded number

% Progress

8980

1247

14

6, 57, 71, 81

8623

2521

29

Europe

5, 27, 37

2028

396

20

India

4, 51, 57

11023

1997

18

Meso America

2, 31, 77

7677

1750

23

North America

2, 18, 21, 31, 67, 77

8112

2274

28

North East Asia

4, 5, 18, 61

10414

924

9

Oceania/Antarctica

8, 48, 58, 77, 81, 88

5702

1403

25

South America

3, 31, 41, 87

8981

1043

12

South East Asia

4, 57, 71

12140

2103

17

hyper-diverse region over a short period of time—a “barcoding blitz” (see Note 10). 3. Other cost-effective strategies for increasing coverage can also be devised. There are many museums around the world with barcode-friendly tissues (i.e., not formalin stored) from vouchered specimens that still await barcoding. Perhaps more effort needs to be made to attract specialists in particular taxa or geographic regions to the campaign. Greater use could be made of piggy-backing on existing research expeditions of the catches of commercial and artisanal fishers. It is notable that the sampling effort provided by vessels and researchers engaged in the International Polar Year (2007–2009) resulted in the high coverage of 74% of fish species for the Arctic and 50% for the Antarctic (14), although fish diversity in these regions is limited compared with the tropics. 4. The majority (about 73%) of fish species barcoded thus far are marine (14), and in future much more attention needs to be placed on the large freshwater faunas of South America, Africa and Asia. 5. Finally, a plea that all researchers with collections of fish barcodes deposit these barcodes in BOLD as soon as possible. 3.3. Identification of Reference Specimens

1. Perhaps the most significant issue that has arisen in the development of the fish barcode library concerns mislabelling or misidentification of some reference specimens. The former can

21

FISH-BOL, A Case Study for DNA Barcodes

429

arise from sample contamination or sample confusion, and most such cases are usually obvious after inspection of the sequence data and can be rectified. More important is the issue of specimen misidentification. There are many contributors to FISH-BOL and not all identifications are made by trained taxonomists. With more than 30,000 fish species, including many morphologically similar species complexes, such errors are not unexpected. This issue is compounded by the uncertain or incomplete taxonomies of many fish groups, and by a lack of knowledge of the true extent of the range of many species and likely degrees of endemism. Elimination of errors and reconciliation of Linnean names across reference specimens is imperative and needs further effort by the scientific community of FISH-BOL. This is a complex and demanding task, and one that is therefore expensive to implement fully, but it needs increased attention. 2. Putative barcode errors can now be flagged in BOLD, either by removing them to a problematic sample project or by being individually flagged but remaining in the original project. Flagged records are removed from BOLD’s identification engine. When verified, or corrected, they can be moved back into their original project or the flag removed. 3. Effort put into correct initial diagnosis of species saves confusion and time later. Where specimens can be reliably identified to a known species, that diagnosis should of course be made. Where there is uncertainty, this should be recognised. Specimens can be identified just to genus (e.g. Arius sp.) or perhaps use might be made of the prefix “cf” (e.g. Arius cf. venosus, meaning that the specimen appeared to be closest to Arius venosus but that the identification is provisional and it may in fact be another, perhaps unrecognised, species). Other similar notations can be used (see Note 11). 4. Retention of whole specimen vouchers, whenever possible, is highly desirable. Sometimes this might not be possible, for example where specimens are very large. In such situations, digital images should be retained, both of the whole fish and of any diagnostic characters. If the permanent vouchering of all barcoded specimens is not possible, it might sometimes be possible to retain temporary vouchers. These can be discarded when identifications and barcodes have been verified. 5. It is now recognised that the inclusion of a precision index to gauge levels of confidence in identifications is highly desirable. Since July 1993, specimens in the Australian National Fish Collection database at CSIRO have been identified to one of five levels of reliability according to the taxonomic expertise of the identifier (15). These were discussed at the inaugural FISH-BOL meeting in 2005 and published in the

430

R.D. Ward

Table 4 Reliability of identification: the system used by the CSIRO Australian National Fish Collection Identification scale Level 1: Highly reliable identification. Specimen identified by (a) an internationally recognised authority of the group, or (b) a specialist that is presently studying or has reviewed the group in the Australian region Level 2: Identification made with a high degree of confidence at all levels. Specimen identified by a trained identifier who had prior knowledge of the group in the Australian region or used available literature to identify the specimen Level 3: Identification made with high confidence to genus but less so to species. Specimen identified by (a) a trained identifier who was confident of its generic placement but did not substantiate its species identification using the literature or (b) a trained identifier who used the literature but still could not make a positive identification to species or (c) an untrained identifier who used most of the available literature to make the identification Level 4: Identification made with limited confidence. Specimen identified by (a) a trained identifier who was confident of its family placement but unsure of generic or species identification (no literature used apart from illustrations) or (b) an untrained identifier who had/used limited literature to make the identification Level 5: Identification superficial. Specimen identified by (a) trained identifier who is uncertain of the family placement of the species (cataloguing identification only), (b) and untrained identifier using, at best, figures in a guide, or (c) where the status and expertise of the identifier is unknown

workshop proceedings; they are summarised here in Table 4 (see Note 12). 6. Taxonomy is an ongoing endeavour (2). New fish species are continually described, and existing species may have their generic or species placements changed. Keeping track of these taxonomic revisions in databases such as BOLD is not a trivial matter and requires constant attention. 3.4. Uses of the Fish Reference DNA Barcode Library

1. Once a full barcode reference library is in place, identifying the great majority of unknown specimens (of any life history stage) or samples is straight-forward. Most commercial fish species are now represented in the BOLD database and have distinguishable barcodes. Exactly how sequences from unknown specimens are best matched against reference sequences is still a matter of debate (see Note 13). 2. Barcoding can be applied to ensure food safety and to protect against consumer fraud (see Note 14). 3. Processed samples, including cooked, grilled and deep fried fillets, can be successfully barcoded (16, 17). Mini-barcodes

21

FISH-BOL, A Case Study for DNA Barcodes

431

have been proposed for species discrimination of canned products (18), where the combination of high temperatures and pressure degrades DNA. 4. The reference library can be used to check or provide identifications for fisheries management purposes. Finned, headed or gutted specimens can be identified to ensure that quota regulations are not being breached (see Note 15). 5. Biological sciences will benefit from having accurate identifications of all fish specimens, from eggs to adults (see Note 16). 6. Prey items of sharks have been barcoded and identified (19, 20). 7. Environmental barcoding offers the hope of identifying species in bulk samples taken from, for example ichthyoplankton tows. These may be analysed using massively parallel sequencing platforms. 8. Barcoding can also be used to verify the identity of cell lines from fish species (21, 22). 9. Finally, barcoding is already making important contributions to the science of fish taxonomy (see Note 17). Whenever possible, it would be useful if describers of new species include a DNA barcode as part of that description (23, 24). 3.5. Structure of FISH-BOL

1. The fish barcode of life campaign (FISH-BOL) was initiated at a meeting at the University of Guelph in June 2005. This was organised shortly after initial results showed that species of fish could indeed be reliably discriminated by DNA barcoding using COI (3) and was attended by about 50 fisheries geneticists and taxonomists, together with sponsors and supporters of the DNA barcoding approach to specimen identification. The ultimate goal of FISH-BOL is to barcode all the fish species of the planet. 2. The meeting agreed that global coverage would be best facilitated by establishing ten regional working groups defined by FAO regions (Table 3). These groups would take responsibility for overseeing collections, identifications and barcoding of the fish faunas of their areas. The working groups are therefore based on geography rather than taxonomy. They are expected to raise the profile of fish barcoding in their region in various ways, including holding or participating in barcoding workshops and conferences. 3. The meeting agreed that BOLD should be used as the workbench for assembly of fish barcode sequences, and a linked Web site (www.fishbol.org) was established to further the aims of FISH-BOL. 4. FISH-BOL is administered through two co-chairs and a campaign coordinator (see Note 18). Each working group also has

432

R.D. Ward

a chair and usually a deputy chair. Members of FISH-BOL and their contact details are listed on the Web site by working group, although not all contributors to the campaign are listed. 5. FISH-BOL is an informal collaboration of fish taxonomists and geneticists. It has no dedicated funds although its establishment and some initial meetings were assisted by funds from the Consortium for the Barcode of Life (CBOL). 6. The authoritative list of fish species is the Catalogue of Fishes (25). FishBase (www.fishbase.org) uses this list and maintains a database of species with distributions by FAO region, country and habitat type. FishBase worked with FISH-BOL to provide a list of fish species for each FAO region and thus each FISHBOL working group. Barcoding progress of working groups can thereby be monitored (Table 3). 7. For each specimen, the standard data fields of the BOLD submission are completed wherever possible: identification (genus and species), identifier (with email and institution), sample number, voucher number, institution storing, sample donor (with email), collector, collection date, locality (with GPS coordinates), elevation/depth (in metres), sex and life stage. The FAO region of a barcoded fish is stated in the “Extra Info” field of BOLD, and soon identification reliability levels will also be trialled in this field (see Note 12). Additional information can be recorded in the “Notes” field. 8. An initial target of five specimens per species per FAO region was set. It was recognised from the start that this will often be insufficient to encompass all the COI barcodes of a species (see Note 19) and that varying degrees of genetic isolation between populations (especially likely for freshwater fishes (26)) might mean that barcodes collected in one population differ from barcodes collected in other populations. Wherever possible, therefore, replicate specimens should be collected from different populations. Larger sample sizes will often be desirable, especially if species are widespread or genetically heterogeneous. 9. COI divergences within a fish species are usually less than 2% (27). Intraspecies variability exceeding that level (and thereby falling into multiple bins in BOLD) might reflect the existence of undescribed cryptic species. Resolution of such instances will likely require the barcoding of additional specimens (preferably at least five per putative species) and detailed taxonomic examination.

21

FISH-BOL, A Case Study for DNA Barcodes

433

4. Conclusions The FISH-BOL campaign has barcoded about 8,000 named fish species in the last 5 years, and it will clearly take several more years to approach its goal of barcoding all the world’s fishes. Yet the many uses of a validated and comprehensive fish barcode reference library in the diverse fields of food commerce and safety, taxonomy, and biological and ecological sciences suggest unequivocally that the goal is one worth attaining. FISH-BOL warmly thanks all those in the community who are already participating, and calls for the involvement of further scientists, collection managers and taxonomists to enable its campaign goal to be reached.

5. Notes 1. Large pieces of tissue should be cut into small pieces (