Statistics and Probability in Forensic Anthropology [1 ed.]
 012815764X, 9780128157640

Table of contents :
Front Matter
Copyright
Dedication
Contributors
Acknowledgment
Introduction
What ``statistical´´ questions can we expect from judges? An introductory note from a European adversarial s ...
Study design and sampling
Introduction
Study design
Sample size and power
Sampling
Measurement error/bias
Concluding remarks
References
Recommended reading
Physical and virtual sources of biological data in forensic anthropology: Considerations relative to practit ...
Introduction
Biological data and the concept of ``population specificity´´
Skeletal collections in physical and forensic anthropology
Documented versus nondocumented
Contemporary versus noncontemporary
Representativeness
Documented human skeletal collections (physical)
Documented human skeletal collections (virtual)
Conclusions
References
Recommended reading
Initial assessment: Measurement errors and interrater reliability
Introduction
Before you start
Assessment of measurement error
Assessment of the inter-rater reliability for observational data
References
General considerations about data and selection of statistical approaches
Introduction
Considerations about data
Qualitative variables (categorical, nominal, or ordinal variables)
Quantitative variables
Accuracy, precision, trueness, and reliability
Method selection and evaluation
Diving deeper into considerations on method selection and interpretation in forensic anthropology
Hypothesis testing and interpretation
The use of P-values
Errors and their meaning in statistics and methodological approaches
Conclusion
References
Recommended reading
Probability distributions, hypothesis testing, and analysis
Introduction
Bayesian versus frequentist analysis
Statistical testing and modeling
Probability distributions
Parametric and nonparametric tests
Measures of strength of association
Statistical models
Testing a test/method
Conclusion
Reference
Recommended reading
Data mining and decision trees
Introduction
Data mining
Decision trees
Example: Decision tree for cranial morphological sex assessment
Improving decision trees
References
Recommended reading
Frequentist approach to data analysis and interpretation in forensic anthropology
Introduction
Exploratory data analysis (EDA)
Hypothesis testing
Comparing two independent samples and the t-test
Deviation from normality
Estimating unknown parameters
Estimating continuous variables: Linear regression
Estimating categorical variables: Logistic regression
Model selection
Assessing performance: Model validation
Discussion
References
Use of Bayes theorem in data analysis and interpretation
Introduction
Principles
The question should be made explicit in propositions
The answer should be based on information and expertise
The scientist should not deviate from the laws of logic
Logic
Implications for forensic interpretation
Examples
Contextual information
Errors of reasoning
Phrasing propositions
Conclusion
References
Sex estimation using nonmetric variables: Application of R functions
Introduction
Binary logistic regression
Linear discriminant analysis
Sex assessment of an unknown individual using morphological traits: considerations
Sex estimation using R
Discussion and conclusion
References
Recommended reading
Sex estimation using continuous variables: Problems and principles of sex classification in the zone of unce ...
Introduction
Material
Sexual dimorphism and the accuracy of sex estimation
Discriminant function analysis in sex estimation
Accuracy overestimation and cross-validation
The population specificity of discriminant functions
Sex estimation in the zone of uncertainty
Conclusion
Acknowledgment
References
Age estimation of living persons: A coherent approach to inference and decision
Introduction
Uncertainty and inference in forensic age estimation
Bayesian perspective in forensic age estimation
Posterior probability distribution on the chronological age
Prior probability distribution on the chronological age
Likelihood function
Bayesian inference from an operational perspective
Normative approach to decision in age estimation
A hypothetical case example
Assessing the needs of the mandating authority
Evidence collection
Physical examination and age estimation interview
Skeletal and dental evidence
Evidence interpretation
Prior probability assignment
Likelihood assignment
Sensitivity analysis on prior probability
Decision theory in the example case
Discussion and conclusion
Acknowledgment
References
Extreme learning machine neural networks for adult skeletal age-at-death estimation
Introduction
Training neural networks for age-at-death estimation
The extreme learning machine algorithm
Efficient training and regularization
Obtaining valid prediction intervals with neural networks
Conformal prediction
Performance analysis
Funding
References
Statistical approaches to ancestry estimation: New and established methods for the quantification of cranial ...
Introduction
Linear discriminant analysis (LDA)
Geometric morphometrics (GM)
Ensemble modeling
Admixture approach
The quantification of nonmetric results
Concluding remarks
Acknowledgment
References
Stature estimation
Introduction
Completely preserved skeletal remains
Partially preserved skeletal remains of an individual whose population origin is known
Partially preserved skeletal remains of an individual whose population origin is unknown
Final remarks
References
Osteomics: Decision support systems for forensic anthropologists
Introduction
Biogeographic prediction
AncesTrees (http://osteomics.com/AncesTrees/)
rASUDAS (http://osteomics.com/rASUDAS/)
hefneR (http://osteomics.com/hefneR/)
Estimation of body parameters
MassReg (http://osteomics.com/MassReg/)
SPINNE (http://osteomics.com/SPINNE/)
RAXTE (http://osteomics.com/raxter/)
Age-at-death estimation
DXAGE (http://osteomics.com/DXAGE/)
SAMS (http://osteomics.com/SAMS/)
Sex diagnosis
SeuPF (http://osteomics.com/SeuPF/)
CADOES (http://osteomics.com/CADOES/)
Ammer-Coelho (http://osteomics.com/Ammer-Coelho/)
Final remarks
References
Recommended reading
Fordisc: Anthropological software for estimation of ancestry, sex, time period, and stature
Introduction
Ancestry and sex estimation with linear discriminant functions
Stature estimation with linear regression
Fordisc
What kind of data can you use in Fordisc?
Cranial measurements
Mandibular measurements
Postcranial measurements
Principal component analysis
What kind of population samples are in Fordisc?
Forensic data Bank (FDB)
Howells
Output
Standard results page
Leave-one-out cross validation
Posterior probability (PP)
Typicalities
Stature
Graphic output
Statistical group comparison
How can Fordisc results be presented in court?
References
Geometric morphometrics
Introduction
The concept of shape
Types of analyses within geometric morphometrics
Principal component analysis
Canonical variates analysis (CVA)
Multivariate regression
Partial least squares (PLS) analysis
Visualization
Symmetry and asymmetry
Symmetry
Asymmetry
Software
ThreeSkull
MorphoJ
3D-ID
R programs
References
Bayesian inference in personal identification
Introduction
Sex estimation
Age estimation
Age estimation for investigative purposes
Age estimation for evaluative purposes
Combining evidence
Acknowledgment
References
Visual identification of persons: Facial image comparison and morphological comparative analysis
Introduction
Terminology: Identification versus recognition
The morphological comparative analysis
Methodology in visual identification of persons
Controlling visual perception
Evidential value of morphological comparative analysis
Conclusion
References
Communicating evidence with a focus on the use of Bayes theorem
Introduction
Explaining Bayes theorem
Different modes of reporting
Reporting the propositions
Reporting the strength of the evidence
The use of a verbal scale
Dealing with uncertainty
The different meanings of LR=1
Conclusions
References
IBM SPSS statistics
Introduction
IBM SPSS statistics overview
SPSS data import
SPSS data editing
Missing values
Statistical analysis in SPSS
The outcome of statistical analysis in SPSS
Conclusion
Reference
Recommended reading
An introduction to the R language
Introduction
The R working environment
Types of data
Data input
Graphs
Libraries
Statistical analysis in R
References
Websites using R for anthropology
Recommended reading
Stata
Introduction
Data management
Analysis
Graphics and reporting
Help
References
Recommended reading
SAS
Introduction
Data step
Data presentation
Reference
Glossary
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Z

Citation preview

Statistics and Probability in Forensic Anthropology

Statistics and Probability in Forensic Anthropology Edited by

Zuzana Obertova´

Forensic Anthropologist, Visual Identification of Persons, Z€ urich Forensic Science Institute, Z€ urich, Switzerland Centre for Forensic Anthropology, School of Social Sciences, The University of Western Australia, Australia

Alistair Stewart

Retired, School of Population Health, The University of Auckland, Auckland, New Zealand

Cristina Cattaneo

Professor, LABANOF, Sezione di Medicina Legale, Dipartimento di Scienze Biomediche per la Salute, Universita` degli Studi di Milano, Milano, Italy

Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom © 2020 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices

Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-815764-0 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Stacy Masucci Acquisitions Editor: Elizabeth Brown Editorial Project Manager: John Leonard Production Project Manager: Selvaraj Raviraj Cover Designer: Matthew Limbert Typeset by SPi Global, India

ZO: For Sofia Antonia, Samuel Matene, Martin and my parents. AS: My thanks to my wife and fellow biostatistician, Joanna, who has supported me in all my biostatistical endeavours plus all other aspects of my life. CC: For all scientists and judges who wonder if the right answers can be found in statistics.

Contributors Pascal Adalian UMR 7268 ADES - Aix-Marseille University, CNRS, EFS, Marseille, France Bridget Algee-Hewitt Center for Comparative Studies in Race and Ethnicity, Stanford University, Stanford, CA, United States Radoslav Benˇusˇ Department of Anthropology, Faculty of Natural Sciences, Comenius University in Bratislava, Bratislava, Slovakia Charles E.H. Berger Netherlands Forensic Institute, The Hague; Institute for Criminal Law and Criminology, Leiden University, Leiden, The Netherlands Soren Blau Victorian Institute of Forensic Medicine/Department of Forensic Medicine, Monash University, Melbourne, VIC, Australia Hans H. de Boer Netherlands Forensic Institute, The Hague; Department of Pathology, Amsterdam University Medical Center, University of Amsterdam, Amsterdam, The Netherlands Silvia Bozza School of Criminal Justice, University of Lausanne, Lausanne, Switzerland; Department of Economics, University Ca’ Foscari of Venice, Venice, Italy Jaroslav Bru˚zˇek Department of Anthropology and Human Genetics, Faculty of Science, Charles University, Prague 2, Czech Republic  ´ Zuzana Caplova Department of Biological and Health Sciences, University of Milan, Milan, Italy Cristina Cattaneo Professor, LABANOF, Sezione di Medicina Legale, Dipartimento di Scienze Biomediche per la Salute, Universita` degli Studi di Milano, Milano, Italy Catarina Coelho Laboratory of Forensic Anthropology, Centre for Functional Ecology, Department of Life Sciences, University of Coimbra, Coimbra, Portugal Eug enia Cunha Laboratory of Forensic Anthropology, Centre for Functional Ecology, Department of Life Sciences, University of Coimbra, Coimbra; National Institute of Legal Medicine and Forensic Sciences, Lisbon, Portugal

xvii

xviii

Contributors

Francisco Curate Laboratory of Forensic Anthropology, Centre for Functional Ecology; Research Centre for Anthropology and Health (CIAS), Department of Life Sciences, University of Coimbra, Coimbra, Portugal Joa˜o d’Oliveira Coelho Laboratory of Forensic Anthropology, Centre for Functional Ecology, Department of Life Sciences, University of Coimbra, Coimbra, Portugal; Primate Models for Behavioural Evolution Lab, Institute of Cognitive & Evolutionary Anthropology, School of Anthropology, University of Oxford, Oxford, United Kingdom Oguzhan Ekizoglu Department of Forensic Medicine, Tepecik Training and Research Hospital, Izmir, Turkey; Centre Universitaire Romand de Medecine Legale, Lausanne—Gene`ve (CURML), Lausanne, Switzerland Daniel Franklin Centre for Forensic Anthropology, The University of Western Australia, Crawley, WA, Australia Patrik Galeta Department of Anthropology, University of West Bohemia, Pilsen, Czech Republic Julieta G. Garcı´a-Donas Edinburgh Unit for Forensic Anthropology, School of History Classics and Archaeology, University of Edinburgh, Edinburgh; Centre for Anatomy and Human Identification, School of Science and Engineering, University of Dundee, Dundee, Scotland, United Kingdom Richard L. Jantz The University of Tennessee, Knoxville, TN, United States Michael N. Kalochristianakis Medical School, University of Crete, Heraklion, Crete, Greece Elena F. Kranioti Forensic Medicine Unit, Department of Forensic Sciences, Faculty of Medicine, University of Crete, Heraklion, Crete, Greece; Edinburgh Unit for Forensic Anthropology, School of History Classics and Archaeology, University of Edinburgh, Edinburgh, United Kingdom Laura Manthey The University of Tennessee, Knoxville, TN, United States David Navega Laboratory of Forensic Anthropology, Centre for Functional Ecology, Department of Life Sciences, University of Coimbra, Coimbra, Portugal Efthymia Nikita Science and Technology in Archaeology and Culture Research Center, The Cyprus Institute, Nicosia, Cyprus

Contributors

Panos Nikitas Laboratory of Physical Chemistry, Department of Chemistry, Aristotle University of Thessaloniki, Thessaloniki, Greece Zuzana Obertova´ €rich Forensic Science Forensic Anthropologist, Visual Identification of Persons, Zu €rich, Switzerland; Centre for Forensic Anthropology, School of Social Institute, Zu Sciences, The University of Western Australia, Australia Stephen D. Ousley The University of Tennessee, Knoxville, TN; Mercyhurst University, Erie, PA, United States Grit Sch€ uler Expert Office for Morphological Anthropology, Berlin, Germany Minas Sifakis The Alan Turing Institute, London, United Kingdom Emanuele Sironi School of Criminal Justice, University of Lausanne, Lausanne, Switzerland Alistair Stewart Retired, School of Population Health, The University of Auckland, Auckland, New Zealand Petra Sˇva´bova´ Department of Anthropology, Comenius University, Bratislava, Slovakia Franco Taroni School of Criminal Justice, University of Lausanne, Lausanne, Switzerland Mayonne van Wijk Netherlands Forensic Institute, The Hague, The Netherlands Toma´sˇ Zeman Department of Military Science Theory, Faculty of Military Leadership, University of Defence in Brno, Brno, Czech Republic

xix

Acknowledgment Many thanks To Elsevier, especially Elizabeth, for supporting the idea for this book. To John, for his generosity, help, and kindness. To all contributors but especially to those of you who came to the rescue when we hit a rough patch. On a personal note: When I (ZO) first thought about writing this book, I had a vision of how I would like it to be. Now, I think that it has become much better than I could have imagined because of you sharing your ideas, experience, and knowledge. Some of you I know as my mentors, some I am honored to call my friends, and some of you I am yet to meet in person, which I very much look forward to. I sincerely appreciate all your help and support. Last but not least, I am extremely grateful to my coeditors for their support and hard work.

xxi

Introduction The world of statistics is enormous, and one can easily get lost in the numerous definitions, equations, and different approaches. There are many books on biomedical statistics/biostatistics, some specifically on statistics for forensic sciences (mostly covering topics not related to forensic anthropology) and a handful on statistics for anthropology (mostly related to research in cultural anthropology or physical anthropology). However, most of these books bring the statistical theory and definitions to the forefront, often showing complex calculations and only presenting a few examples. While understanding the mathematics behind statistical analyses is no doubt beneficial, probably only a few students and practitioners have the time (and willingness) to immerse themselves in the mathematical formulae. With this book, we intend to provide a practical guide for forensic scientists, mainly anthropologists and pathologists, on how to apply, interpret, and present statistical analyses in scientific publications and in forensic practice. Mostly the statistical concepts are presented in the context of particular research questions in forensic anthropology. The level of complexity of statistical approaches presented varies from basic descriptive statistics to advanced Bayesian frameworks, so it is our hope that each reader can learn something new. The book is divided into seven chapters, each with one or more contributions. Chapter 1 includes four contributions, starting off with an overview of what statistical questions forensic anthropologists may face in their cases. The other three contributions discuss the fundamental aspects of research underlying the forensic examination and evaluation of anthropological information, including study design and sampling; the importance of reference data, such as identified skeletal collections or virtual imaging collections; and assessment of sources of error. Chapter 2 includes three contributions on method selection, including advanced techniques such as data mining and decision trees and the types of data, probability distributions, and statistical modeling forensic anthropologists encounter in their work. The two contributions of Chapter 3 reflect on the frequentist and Bayesian approach to anthropological data analysis and interpretation, respectively. Chapter 4 is the longest with six contributions concerning the four components of the biological profile—sex, age, ancestry, and stature. The contributions discuss how to obtain and interpret population data to estimate the variables of the biological profile in forensic cases and how to present the (statistical) outcomes of anthropological analyses. In Chapter 5 advanced methodological and software solutions for variables of the biological profile are introduced in three contributions. Chapter 6 consists of two contributions on how to combine, evaluate, and report anthropological evidence in cases of personal identification by using logic and Bayesian inference and one contribution on visual identification of persons on images illustrating the use of verbal scales in a field, where statistical and probabilistic framework is still in development. Chapter 7 consists of four short contributions describing common statistical software—SPSS, STATA, SAS, and R. Most of us cannot imagine how we would apply statistics

xxiii

xxiv

Introduction

without one of these software packages. The contributions discuss the possibilities (and the limitations) of each of these programs, including short step-by-step guidelines how to deal with common statistical issues in forensic anthropology. The contributors to this book are mostly statisticians and forensic anthropologists/forensic pathologists working together, but some are “2-in-1”s, forensic anthropologists/pathologists with exceptional understanding of statistics. If you are reading this book, you are likely somehow involved in forensic anthropology or related forensic disciplines. What we hope for is that after reading this book, you will also become more involved or interested in statistics. Some of you may even be encouraged to become “2-in-1”s. In any case, we would like to hear from you: [email protected] if you have any comments, questions, or suggestions regarding the contents of this book.

CHAPTER

1.1

What “statistical” questions can we expect from judges? An introductory note from a European adversarial system

Cristina Cattaneo Professor, LABANOF, Sezione di Medicina Legale, Dipartimento di Scienze Biomediche per la Salute, Universita` degli Studi di Milano, Milano, Italy

The application of statistics in forensic science is still far from perfection. Judges however in European adversarial systems are more and more likely to barricade themselves, in the motivations of their sentences, behind the security of a statistical operation. This is quite understandable, and it has forced forensic scientists into the “obligation” of trying to stick numbers onto all answers—or at least many feel this pressure. Usually, for anthropologists or forensic pathologists, the question of the use of statistical “resources” comes up with issues related to positive identification or concerning the completion of the biological profile with information such as sex, age, and ancestry, hence the questions posed at a hearing, for example, can be distorted and complex. This brief note has the mere purpose of sharing with the reader, as an introductory message, legal situations, and questions of judges and prosecutors, be they insightful or not, which commonly arise, and where a “statistical” response, which sometimes does not exist (or at least does not exist yet) is requested. Every forensic scientist is well aware of the diatribe, among experts in identification for example, between those who claim to want to “quantify” any type of response regardless of the method and those who do not. Many odontologists declare that morphological uniqueness is “intuitive” and evident and cannot or should not be quantified, which could seem tautological and autoreferential, but this attitude has been accepted for several decades. In fact, one wonders why among the three traditional primary identifiers of Interpol and the DVI system, one encompasses fingerprinting, which identifies through numbers of minutiae; genetics that has as its

Statistics and Probability in Forensic Anthropology. https://doi.org/10.1016/B978-0-12-815764-0.00001-0 # 2020 Elsevier Inc. All rights reserved.

3

4

CHAPTER 1.1 What “statistical” questions can we expect from judges?

intrinsic characteristic the expression of identity through statistically significant allele frequencies; and odontology that has no way of expressing its results at the moment with reliable algorithms that lead to quantification of the possibility of error. In many ways, these issues still remain unsolved. The current emergency related to the identification of dead migrants across the world, for example, is forcing us to face this issue, because of the need to identify people who present great difficulty at being identified with so-called primary identifiers. The need to identify people from poor countries who die undocumented and who have moved away from home years ago makes it almost impossible to obtain clinical, dental, and often even antemortem fingerprints. And frequently the relatives who come forward to provide biological material for a genetic comparison are brothers or half-brothers at best. In addition, many times, the absence of allele frequencies for these populations makes it difficult to perform “proper” statistics. So we find ourselves in a situation where it is not possible to identify with genetics and it may be necessary to do it with morphological or anthropological methods (for instance through personal descriptors or shape of the face) that may not provide a “known” error. The same issues arise when we need to identify faces from video surveillance systems. In these cases, it is necessary to compare a face visible in a video or in a photo with that of a suspect. There are many ways of proceeding, through mere comparison or superimposition, but even in these cases, it is more and more frequent to be asked by judges if we can quantify the probability that this face belongs to another person and not to the suspect. And even here, inexistent statistics seem crucial. How much is enough to identify? And how can one translate the risk of making a mistake into a statistical expression? The more there is at stake in a forensic scenario, the more pressing these questions arise. In 2014 a media trial took place in Milan concerning the disappearance of the girlfriend of a member of the mafia. After 3 years of hearings and depositions of witnesses reporting that she had been killed and dissolved in acid, it was discovered through the testimony of a “pentito” (so called “repentants” who collaborate with justice) that she had been strangled and her body burnt, chopped, and thrown into a manhole. At the opening of the manhole, the forensic anthropologist and archaeologist found in fact 1500 g of bone fragments of the maximum dimensions of 2 cm. Several laboratories, law enforcement, and university experts attempted identification through DNA but given the severe state of calcination it was not possible to extract a reliable genetic profile. Since we had found residues and fragments of dental implants among the burnt remains of the cranium, we asked the investigators if there was any antemortem dental data available; fortunately the woman had been treated a few years earlier. We compared the antemortem and postmortem data and arrived at an identification from such a comparison. A year later, we were called to the hearing for a cross-examination. We explained all that had been done for the recovery, the documentation of the “cremation,” and finally the identification. The judge hovered over the identification issue stressing that given the “importance”

What “statistical” questions can we expect from judges?

of the case, it was crucial for her to know the uniqueness and the frequency within the populationof those dental elements. In short, she was asking us exactly what chance there was of another person sharing the same dental setup. But there was no “quantifiable” answer. The case ended in a conviction and the jury was convinced that the woman had been identified. However, the scientific problem remains. Many other examples exist, such as the instance where we need to provide statistics concerning the probability of a juvenile having reached the adult age and being imputable—how do we translate dental and skeletal growth into a satisfactory statistical answer? These are only some examples of the kind of expectations judges or prosecutors may have with respect to anthropological and medicolegal cases, and show how statistics seems to be inevitably more and more fundamental, or at least an important issue. This is why the time is crucial for deciding where we are at concerning the application of statistics and probability to several anthropological issues and when and how it is necessary. And even if not all questions will be solved by the illustrious scientists of the following chapters, the method and the type of logic we need to deal with nowadays in science and in court will have been made evident.

5

CHAPTER

Study design and sampling

1.2 Zuzana Obertova´a,b and Alistair Stewartc

€ Forensic Anthropologist, Visual Identification of Persons, Zurich Forensic Science Institute, € Zurich, Switzerland b Centre for Forensic Anthropology, School of Social Sciences, The University of Western Australia, Australia c Retired, School of Population Health, The University of Auckland, Auckland, New Zealand

a

Introduction In textbooks on medical statistics and epidemiology, chapters on study design and sampling have usually got a prime position. It is natural since for example as a patient, you wish that the study which tested a certain type of medication for its curative effect and the probability of side effects has been designed and performed in a scientifically sound manner. As forensic anthropologists and pathologists, we may wish the same for our work, along with others, such as the families who lost a loved one, lawyers, and judges. When testing a drug or a forensic hypothesis, a proper study design is the first and, one might say, the most important step in search for answers. Forensic anthropologists design usually studies that are meant to provide population data on some characteristic or to clarify some aspect of a forensic case, for example, regarding trauma or pathology. The study design can include an experiment (e.g., to clarify what happens to a body part if it was subject to burning under certain conditions) or an observational study based on a sample assembled through data collection (e.g., measuring the length of femur from 300 individuals from an identified skeletal collection to estimate stature) or information extracted from existing sources (e.g., searching a database to establish the frequency of impacted lower left canine in Greek females). The first step in any study should be to pose the research question or hypothesis we would like to answer/test with the study. The research, also called alternative hypothesis usually includes some kind of comparison and therefore states that there is a difference in terms of the feature/measure in question (between populations). In contrast the null hypothesis includes a statement of “no difference.” After posing the research and the null hypothesis, a sample or samples need to be identified, which will be the basis of hypothesis testing and reflect the target Statistics and Probability in Forensic Anthropology. https://doi.org/10.1016/B978-0-12-815764-0.00007-1 # 2020 Elsevier Inc. All rights reserved.

7

8

CHAPTER 1.2 Study design and sampling

population (the population of which the sample is representative). The population parameters will then be estimated based on the sample characteristics. A sample can consist of different types of data, which in forensic anthropology are mostly either qualitative, including categorical (or nominal) (e.g., sex and ancestry) and ordinal data (e.g., size of processus mastoideus not in terms of metrics but ordered from smallest to largest) or quantitative, including discrete (e.g., number of fractures) or continuous variables (e.g., stature or age). Quantitative variables are commonly categorized (e.g., continuous age categorized into age groups). Notably, categorizing results in the loss of information on data variation, as values in the same category are treated the same. Therefore categorization should be performed after giving careful thought to the research question, anthropologically, medically, or demographically relevant cutoffs, and data distribution such as the presence of multimodal distribution (probability distribution with more than one peak), for example, in age-related growth curves. As a consequence of the Daubert ruling, studies on the reliability and repeatability of findings (including testing intra- and interexaminer differences for a specific method) and studies on external validity (the extent to which the results [or methodology] can be generalized to other events/samples) or internal validity (the degree of confidence that the causal relationship studied is plausible and not influenced by other factors) have gained in importance in forensic sciences. Notably a reliable method is not always valid: the method can be reproduced well, but it does not mean that it measures what it should. A valid measurement is usually also reliable: if a method produces accurate results, it should also be reproducible. The validity of a study is largely dependent on the appropriateness of its design, including collection of sample data in an unbiased manner, and accurate and objective observations/measurements.

Study design Study designs can generally be classified as observational or experimental. In observational studies researchers observe events/features as they occur (or are listed in a database), classifying their levels for an outcome of interest and one or more predictors. In experimental studies the researcher sets one or more predictors to a specific level and observes how the outcome of interest changes with a change in a given factor. Randomization of the groups is key for the experimental design. Therefore experimental studies allow stronger causal inference than observational studies, particularly since in the latter the researcher may not detect variables that are essential for data interpretation. Since each study design is associated with a given form of sampling, the study design also determines the type of analysis and the outcomes. The types of observational studies are summarized in Table 1. To better evaluate the outcome, researchers often feel that they need to employ a complex study design to “cover all the bases.” However, complex study designs may be expensive, and difficult to translate into reality (e.g., difficulties in acquiring a sufficient sample size or specimens with specific characteristics). So, it is advisable

Study design

Table 1 Types of observational studies. Type of study

Definition

Simple example

Prospective

Sampling data on subjects are based on the presence (or absence) of a predictor identified at the onset of the study; then these subjects are followed for the occurrence or nonoccurrence of the outcome of interest; if the outcome is rare, large sample size is needed for a study of sufficient power Sampling data are based on the outcome of interest (presence compared with absence) identified at the onset of study; then information on predictors is obtained Sampling data on subjects over a specified time period Are often time-consuming, and subjects may be lost to follow-up Sampling data on subjects at a fixed time (randomly) and then subjects are grouped based on outcomes and predictors are assessed Usually retrospective, often based on databases of relevant data Sampling data on subjects based on a predictor (i.e., cohort), which are followed to obtain the outcome Usually prospective, followed over a specific period of time Sampling all cases (subjects with outcome) that meet fixed criteria, then controls (subjects without the outcome); cases and controls are compared with respect to one or more predictors Usually retrospective Alternative to a prospective study in case of rare cases Sampling is conditional on the outcome of cases

Children who play sports followed for a specific period of time, and then information is sought whether a fracture occurred or not

Retrospective

Longitudinal

Crosssectional

Cohort

Case control

Children with fracture (and without fracture) in an emergency department of a hospital and then information is sought on whether they play sports Body length/height of children measured every year between 0 and 18 years Body length/height of children aged 6–18 years (e.g., in a school)

As in prospective study

All children with fracture as cases, children without fracture as controls (can be more than one control for a case)

9

10

CHAPTER 1.2 Study design and sampling

to use the simplest (but not simpler than necessary) study design available. Researchers should define the study objectives, outcome(s), and predictors early, to avoid adding questions as the study progresses. Regardless of the complexity of the study, keeping detailed notes of how the researcher proceeded is essential. In addition, if a form for data collection is designed (which is mostly the case), it needs to be clarified early who will fill in the form and if possible pretest the form with suitable persons (persons who will be representative of those who will collect data in the study or who will fill in a questionnaire) to avoid incorrect entries due to misunderstandings. Therefore the forms should be selfexplanatory indicating details such as the degree of accuracy and the units of entries. A more extensive form of pretest is a pilot study, which includes all the steps of the actual study with small sample size. Notably a pilot study is not a substitution for a full study. The results of a pilot study should not be used for conclusive hypothesis testing or interpretations of the results. The role of the pilot study is to help assessing whether the selected sampling method actually results in a sample representative of the population of interest and whether the selected study design is appropriate for the full study. If possible the results of the pilot study should be compared with results of similar (published) studies on the topic to identify potential problems with the study design. However, differences to other studies may arise from the fact that due to the small sample size in pilot studies, it is often difficult to detect a difference when it actually exists (i.e., lack of statistical power resulting in type II error). Type II error means failing to reject the null hypothesis when it is false or saying that there is no effect/difference when actually there is one. Power of a study (calculated as 1 type II error) reflects the chance of rejecting a null hypothesis when it is false, so saying there is an effect, when there is one (e.g., 90% power means there is a 90% chance of saying there is an effect if the effect actually exists). Type I error means rejecting the null hypothesis when it is true, or saying that there is an effect/a difference when actually there is none. Notably, for a fixed sample size, as the probability of type I error decreases, the probability of type II error increases and vice versa. Increasing the sample size may balance the errors, but if this is not possible, one may need to decide, which error is “less important.”

Sample size and power Before conducting a study the sample size needs to be determined so we can detect meaningful effects without wasting resources. For studies with categorical outcomes, we need to specify the level of significance (probability of making type I error) (e.g., 0.05), power (e.g., 90%, in which case the probability of making type II error is 10%), estimate (possibly by looking at previous publications or based on the experience of the researcher) the proportion of Group 1 having outcome (in %), proportion of Group 2 having outcome (%), and also Group 1/Group 2 sample size ratio (especially when other than 1:1). Alternatively, for studies with continuous outcomes, the mean and standard deviation of the outcome in Group 1 and Group 2 need to be entered.

Sampling

In forensic anthropology, we often have the situation where we have a “predefined” sample size (e.g., there are only a certain number of male and female skeletons in an identified collection), so instead of calculating the sample size, we may want to know the power of our study with the given sample size. In that case, we will again need to specify the level of significance (e.g., 0.05), estimate the proportion of Group 1 having outcome (in %) and proportion of Group 2 having outcome (%), and give the number of individuals/specimens in Group 1 and Group 2. When the type I error is fixed by the researcher, type II error or power of the study is not fixed and depends on sample size (the larger the sample, the greater the power), magnitude of the effect we would like to detect (small effect is more difficult to detect, so large sample size may be needed to detect small but meaningful effect), and sample variance (when there is large variance, larger sample size is needed). Other aspects, such as statistical test are also important. Often some kind of trade-off is required to balance power, effect size (which should be meaningful within the research question), and an achievable sample size. Although some authors state specific numbers for minimum sample sizes needed for a given statistical analysis, a fixed number is usually misleading. For example, according to Long (1997), for a logistic regression, 100 is the recommended minimum sample size (with at least 10 observations per predictor). In addition, larger sample size may be needed with skewed outcome variable (e.g., with few 1’s and many 0’s) and with categorical predictors to avoid computational problems caused by empty cells or when multicollinearity is present. So already here, it is clear that there are many exceptions to the minimum number of 100 and the actual sample size needed for conclusive results depends on the particular study.

Sampling Frequentist analysis is based on the notion that population characteristics are unknown, but we can gather some information regarding the population based on a sample of data. We assume that this sample is a random selection from the population of interest (otherwise, we would not be able to make the inference from sample to population). Random sampling can be assumed (the emphasis here is on “assumed” since this is an approximation of random sampling), for example, if we select all individuals who consecutively attended the emergency department of a hospital with a rib fracture within the past 2 years, or there are study designs, such as sample surveys, where random sampling is explicitly performed by the researcher by drawing a sample from known population lists (e.g., school enrolment lists or lists of patients from general practitioner practices). Usually, we also assume that each individual in the population has equal chance of being selected into the sample, that is, we have a simple random sample. However, we can also be interested in doing random sampling with unequal chance of selection, since we would like one group to be overrepresented in our sample. Sample surveys can be designed with such nonsimple random sampling, for instance, cluster, or stratified sampling.

11

12

CHAPTER 1.2 Study design and sampling

Sampling is important since it forms the basis for the type of analysis that can be done and what conclusions can be made based on the data. Sampling regimes can be divided in unconditional and conditional. Unconditional sampling for discrete data, for example, means that the sample is selected at random from the population and distributed to groups (e.g., males and females with and without shovel-shaped incisors). Row and column totals represent (marginal) distribution of the two variables, and the proportions of sex and dental nonmetric trait can be estimated. In conditional sampling the selection is guided by a particular feature, so the researcher chooses, for example, 100 males and 100 females with or without shovel-shaped incisors. In this case the population proportion (or prevalence) of males and females cannot be estimated (since it has been fixed by the researcher). None of the statistical software would recognize based on the numbers in the contingency table how the sampling was done so it cannot choose the correct analysis by itself. The researcher is the one to decide. Cohort studies are often based on unconditional sampling, while case–control studies are sampled conditionally on cases. Stratified sampling can help simplifying the study design since it controls for certain predictors, which may affect the outcome, for example, males and females or age groups can be sampled as separate strata. The advantage of stratified sampling is that it leads to better precision of estimates, is relatively simple to perform, and can control for confounding. The disadvantage is that it cannot account for many confounders simultaneously because this would normally result in small numbers in each strata. For example, if we stratify by sex (male/females) and age (15–19, 20–29, 30–39, 40–49, 50–59, 60–69, and 70+ years, i.e., seven groups), we would already have 14 strata. Alternative to stratification within the study design is using a certain type of statistical analysis, for example, multiple regression. Cluster sampling strategy is based on clusters, which are groups derived from families, schools, or hospitals. When performing statistical analysis with clusters, confidence intervals (CIs) will be wider and P-values greater compared with simple random samples, since the analysis adjusts for the effectively smaller sample size in the clusters. However, cluster sampling can be equivalent to simple random sampling if intracluster correlation is minimal (the individuals within the cluster are as diverse as they would be in the population). In contrast, if the intracluster correlation is close to 1 (so individuals in the cluster are very similar), then the estimates will have wide CIs. In forensic practice, it is often not possible to work with random samples, so convenience samples are used. The methodological robustness of such samples and their appropriateness of use in forensic cases are subject of debate. However, as Evett and Weir (1998, 45) commented “… the scientist must also convince a court of the reasonableness of his or her inference within the circumstances as they are presented as evidence.”

Measurement error/bias In general, random sampling is important for reducing or eliminating bias within the study. Measurement error has a random and a systematic component. Random error cannot be attributed to a specific cause and is represented by unexplainable

Measurement error/bias

fluctuation in the data. Systematic error (or bias) has a direction and magnitude and is not the consequence of chance alone. To avoid at least the conscious bias (the unconscious may remain regardless), researchers should try to free their minds from interpreting data based on their desired outcomes or interpreting data in a way that they fit a certain theory. However, merely describing the data would not do. (Correct) Interpretation is necessary, and in this the researchers‘experience and knowledge play a major role, helping them differentiate between association and causation or recognize relevant patterns in data. Parameter estimates based on a simple random sample are known to be unbiased. As the sample size increases, the estimates get more precise. Unbiased estimates can be achieved when there are no confounding (mixing) effects, no measurement error (misclassification) or selection bias. In some instances, statistical modeling may help deal with some of these aspects (especially confounding) to still arrive at an unbiased estimate. In frequentist analysis the role of statistics in forensic anthropology (or any other discipline) is to find sample statistics that are appropriate estimators of the population parameters. However, the sample estimates usually differ from the population parameters. Therefore, by using point estimates only, we cannot describe the potential errors in the estimates. If we were to repeatedly sample data from a population, the estimate of standard error (standard deviation as a measure of variation within the sample divided by the number of observations) would reflect how the population parameter would be expected to vary purely by chance. Confidence intervals express the precision of an estimate and simultaneously assess the degree of sampling variability (or sampling error), which is associated with an estimate if all possible samples of a given size would be drawn from the population. If all these samples were used to create CIs, then the value of the population parameter would become known. Commonly, 95% CIs are reported, approximately equal to  2 standard errors of the estimate (1.96 exactly), which means that 95% of the CIs would include the population parameter and the remaining 5% would not. Notably we assume that the observed CIs are attributed to chance only, not to systematic bias (which would basically invalidate the results). By definition, confounding variables are associated with both the outcome and the main predictor but are not intermediate in the causal chain between the outcome and the main predictor. In sex estimation studies the age distribution of the sample is often of great importance. Age can be a confounding factor, since some features, which are considered sex-specific are also related to age, for example, stronger muscle attachment in older females compared with younger males. So if our sample consisted of younger males and older females, we would not be able to determine what is the effect of sex and what is the effect of age in the development of muscle attachment features. In observational studies (which are the more common ones in forensic anthropology), we have no way of knowing whether an observed effect is confounded (as opposed to an experiment where we set the rules). What we can do is to figure out based on experience or previous studies, which variables can be confounding and control for them in that we include them in our analysis, for example, in regression models. The effect on the outcome and our conclusions when

13

14

CHAPTER 1.2 Study design and sampling

confounding is present can be either positive or negative. As well as by using certain types of statistical modeling (e.g., multiple regression analysis), confounding can be controlled by different approaches within the study design. Randomization is one but this may not always be possible as it may need too big a sample size; restriction is another, for example where we only sample young adults for our sex estimation study, but that would limit the external validity and thus limit the use of our method to a specific age group; or matching. In matching, we decide on confounding variables in the beginning of our study and pair each individual/specimen from Group 1 with one from Group 2 with the same values of confounding variable(s), so our sex estimation example is male of a given age with a female of the same age (it may not be feasible to match exactly by point age, so we may define ranges, e.g.,  2 years). We can do the same with ethnicity or other variables, in our example on features of muscle development, for example, occupation or activity level (if such information is available, of course). In reality, it may be difficult to find an exact pair, so frequency matching can be used. This means that the groups will be “balanced” with respect to the matching confounding variables, so the proportion of males of a given age (range) will be similar to the proportion of females of the same age. Notably, matching has a similar effect as randomization. One disadvantage of matching is that we cannot measure the association between the matching variable and the outcome.

Concluding remarks In summary, when designing a study, the researcher should first formulate the research hypothesis, which usually talks about a difference (e.g., there is a difference between the maximum femur length between males and females) and then the null hypothesis, which will be the opposite (stating no difference). In the next step a sufficient sample size should be established, along with the probabilities of type I and type II error. These decisions need to be made in association with the selection of the appropriate test statistic (outcome) and predictors. In this regard a sampling strategy and the rules of data inclusion and exclusion should also be considered (and noted among others in the methods section of a publication). Only after all these considerations had been made, should the researcher start collecting data and proceed with the study. Notably, giving thought to study design and following the aforementioned steps can save time, money, and effort. Many of the studies that forensic anthropologists and pathologists undertake are meant to be published. In general the study design should be described clearly, accurately, and completely in the methods section of a scientific paper, so that the readers have the opportunity to repeat, assess, or compare with other studies. Key information in the methods section should include setting and location of the study (country, institute, skeletal collection, etc.), a short description of the participants or specimens (among others, living or deceased, anatomical region or skeletal element, males or females or both, age distribution, and ethnic origin), ethical issues (particularly

References

informed consent or other types of consents and adherence to ethical guidelines), study design (including planned sample size), outcomes (definition of primary and any further outcomes, if applicable), predictors, and all statistical methods. There are a number of guidelines, which provide recommendations regarding reporting of the study design and methodology for different types of studies in scientific publications, such as STROBE for observational studies in epidemiology (but can be used also for observational studies in other disciplines; Vandenbroucke et al., 2007; von Elm et al., 2007), STARD (for diagnostic accuracy studies; Bossuyt et al., 2015), TRIPOD (for reporting of multivariable prediction models; Collins et al., 2015; Moons et al., 2015), or PRISMA (for systematic reviews; Moher et al., 2009; Shamseer et al., 2015).

References Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L., Lijmer, J.G., Moher, D., Rennie, D., de Vet, H.C.W., Kressel, H.Y., Rifai, N., Golub, R.M., Altman, D.G., Hooft, L., Korevaar, D.A., Cohen, J.F., the STARD Group, 2015. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ 351, h5527. https://doi.org/10.1136/bmj.h5527. Collins, G.S., Reitsma, J.B., Altman, D.G., Moons, K.G., 2015. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement. J. Clin. Epidemiol. 68, 134–143. Evett, I.W., Weir, B.S., 1998. Interpreting DNA Evidence. Sinauer Associates, Sunderland, MA. Long, J.S., 1997. Regression with Categorical and Limited Dependent Variables. Sage, Thousand Oaks, CA. Moher, D., Liberati, A., Tetzlaff, J., Altman, D.G., the PRISMA Group, 2009. Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. J. Clin. Epidemiol. 62, 1006–1012. Moons, K.G., Altman, D.G., Reitsma, J.B., Ioannidis, J.P., Macaskill, P., Steyerberg, E.W., Vickers, A.J., Ransohoff, D.F., Collins, G.S., 2015. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): Explanation and elaboration. Ann. Intern. Med. 162, W1–W73. Shamseer, L., Moher, D., Clarke, M., Ghersi, D., Liberati, A., Petticrew, M., Shekelle, P., Stewart, L.A., the PRISMA-P Group, 2015. Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015: elaboration and explanation. BMJ 349, g7647. https://doi.org/10.1136/bmj.g7647. Vandenbroucke, J.P., von Elm, E., Altman, D.G., Gotzsche, P.C., Mulrow, C.D., Pocock, S.J., Poole, C., Schlesselman, J.J., Egger, M., 2007. Strengthening the reporting of observational studies in epidemiology (STROBE): explanation and elaboration. Epidemiology 18, 805–835. von Elm, E., Altman, D.G., Egger, M., Pocock, S.J., Gotzsche, P.C., Vandenbroucke, J.P., 2007. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: Guidelines for reporting observational studies. Lancet 370, 1453–1457.

15

16

CHAPTER 1.2 Study design and sampling

Recommended reading Machin, D., Campbell, M.J., Walters, S.J., 2007. Medical Statistics, fourth ed. John Wiley & Sons, Ltd., Chichester. Madrigal, L., 2012. Statistics for Anthropology, second ed. Cambridge University Press, Cambridge. Milner Jr., D.A., Meserve, E.E.K., Soong, T.R., Mata, D.A., 2017. Statistics for Pathologists. Springer, New York. van Belle, G., Fisher, L.D., Heagerty, P.J., Lumley, T., 2004. Biostatistics. A Methodology for the Health Sciences, second ed. John Wiley & Sons, Inc., Hoboken, NJ.

CHAPTER

1.3

Physical and virtual sources of biological data in forensic anthropology: Considerations relative to practitioner and/or judicial requirements

Daniel Franklina and Soren Blaub a

Centre for Forensic Anthropology, The University of Western Australia, Crawley, WA, Australia Victorian Institute of Forensic Medicine/Department of Forensic Medicine, Monash University, Melbourne, VIC, Australia

b

Introduction Forensic anthropology can be defined as an applied subfield of physical anthropology, the latter of which involves the analysis of morphological variation in the human skeleton. The expertise of the forensic anthropologist is based on the ability to quantify skeletal morphologies relative to a “normal state” as defined according to biological criteria, for example, sex-specific functioning and age-related metamorphoses, either during development or in senescence (Cattaneo, 2007; Christensen et al., 2014; Blau and Ubelaker, 2016; Franklin and Flavel, 2019). An anthropological assessment of skeletal remains referred for forensic investigation typically involves the following requisite analyses: determining human versus nonhuman (animal) origin; providing estimations of sex, age, ancestry, and stature (the “biological profile”—see Chapter 4); estimating the time since death (the postmortem interval); and, at the request of the forensic pathologist, contributing to the determination of a possible cause and manner of death through interpretation of perimortem skeletal trauma and/or pathology (Christensen et al., 2014, 2015; Franklin and Marks, 2013). Consequently, it is readily apparent that the basic foundations of forensic anthropological knowledge and practice are largely based on biological interpretations using extant biological data with associated statistical modeling. Statistics and Probability in Forensic Anthropology. https://doi.org/10.1016/B978-0-12-815764-0.00008-3 # 2020 Elsevier Inc. All rights reserved.

17

18

CHAPTER 1.3 Statistics in forensic anthropology

Forensic anthropological methods typically utilized in the construction of a biological profile can be based on quantitative (measurements) and/or qualitative morphoscopic (visual) observations relative to diagrammatic and/or written descriptions, through morphometric data collected directly in the specimen of interest. The data derived from either of those broad approaches (morphological or metric) can, thereafter, be used to assign the required biological attribute (e.g., skeletal sex) based on outputs derived from predictive multivariate statistical models (e.g., discriminant function or logistic regression), either calculated manually or via specialized computer software such as FORDISC ( Jantz and Ousley, 2005), CRANID (Wright, 1992), or ADBOU (Boldsen et al., 2002). While there has been some debate about the results of a metric assessment having more rigor compared with those based on a morphoscopic approach due to less subjectivity (e.g., Winburn, 2018), there are, however, two primary ways that both methods can inherently fail, either practically or judicially: (i) when they are nonrepresentative of the current modern population, for example, temporally and/or genetically removed (noncontemporary), and/or (ii) when they are misapplied by the practitioner, for example, incorrect measurements or assessment, or applied to an individual “foreign” to the original reference sample, either temporally, genetically, or geographically. Given the fact that human populations are dynamic and not static, the first issue is one that will likely eventually occur irrespective with the passage of time. The second issue can be simply a matter of training, but more often than not, there are no specific predictive models to suit the population within the jurisdiction of interest. In other words, methods representing predictive models do not (and frankly cannot) exist for all global populations, and some assumptions regarding approximations, perhaps even concessions, must be made in the application of data between divergent reference populations. For the practicing forensic specialist tasked with an anthropological assessment, the initial choice of method is often inherent to the remains being analyzed (e.g., the age of the individual—juvenile or adult, and degree of preservation and/or completeness) (Blau, 2018), in conjunction with other considerations that may include ease of application, reliability, and the stated accuracy. Such considerations, however, while undoubtedly important, should go deeper to further critically evaluate the source (skeletal collection) of the biological data from which any method is derived. While a published method may statistically demonstrate that the data analyzed to formulate the predictive models are both reliable and accurate, the latter is inherently optimized to the sample studied—“population specific” (Cattaneo et al., 2018). Thus, the sample used to derive the model itself is of practical and increasingly judicial interest (Kimmerle and Jantz, 2008). It is evident in the forensic literature that there is an increasing understanding and demonstrated awareness of population variation in the expression of various skeletal attributes from which biological inferences are drawn. Thus, there are numerous studies that assess the reliability of established techniques on specific populations (e.g., skeletal age at death: Southeast Asian, Gocha et al., 2015; Chilean, Herrera and Retamal, 2017; Greek, Moraitis et al., 2014; French, Savall et al., 2016) or

Biological data and the concept of “population specificity”

develop specific techniques for a specific population (e.g., estimation of sex: Australian, DeSilva et al., 2014; Brazilian, de Oliveira Gamba et al., 2016; Egyptian, El Morsi et al., 2017; Dutch, Colman et al., 2018). Importantly, there is an increase in the dissemination of population-specific models relative to the jurisdiction of their intended application (Ubelaker and DeGaglia, 2017; Franklin and Flavel, 2019). The ability to acquire biological data that are contemporary and representative of modern populations is largely associated with a fundamental shift from the traditional reliance on the physical study of human remains, to one that involves the acquisition of data acquired from virtual three-dimensional volumetric modalities, such as digital x-rays, multidetector computed tomography (MDCT), and magnetic resonance imaging (MRI). While the continuing importance of documented collections of human skeletal remains (particularly contemporary collections) cannot be discounted, there now exists the opportunity to develop skeletal references collections, but in a digital format that is readily updatable and easily accessible. The aim of this chapter is to provide a succinct introduction to the physical and virtual sources of biological data utilized in forensic anthropological research. The current use of extant repositories of human skeletal remains, both physical and virtual are discussed, and the inherent complexities and limitations of each that ultimately determine whether the derived data are judicially admissible and/or optimized for implementation into routine forensic practice are considered. In addition, some of the underlying considerations of various sources of biological data that form the foundations of professional practice are examined.

Biological data and the concept of “population specificity” Before considering reference skeletal collections per se, it is prudent to first briefly explore the concept of “population specificity.” The complexity in both defining and understanding this concept is interrelated to how a population is actually defined. In previous considerations by Franklin and Flavel (2019), it was noted that Harrison et al. (1977) refer to “population” as a conveniently neutral term that “…refers to an aggregate of people and does not specify or imply how the group in question is distinguished from others” (p. 196). The significance of this definition is amenable to then defining populations in “…geographical, political, linguistic, genetical, or any other terms that may be appropriate to the work in hand” (Harrison et al., 1977, p. 186). Relative to the forensic specialist, a population could be considered an “aggregate” of people representing morphological variations based on genetic similarities derived from shared selective and/or environmental influence(s) specific to a defined geographic location (Franklin and Flavel, 2019). In reality, however, this definition is an oversimplification because people are often transient and populations are dynamic. A reference skeletal collection and any predictive models derived from the same comprise a biased subset of the contemporary “aggregate,” and it thus has to be assumed that they do not (and cannot) truly represent all potential variances (e.g., due to individual variation, influences

19

20

CHAPTER 1.3 Statistics in forensic anthropology

from foreign nationals, and other likely unquantifiable factors). Furthermore, as Komar and Grivas note, “[a] frequently overlooked assumption relates to bias in how a collection was assembled. Depending on the sampling method not every individual in a population has an equal chance of becoming part of a collection” (2008: 225). This idea has been discussed previously in the literature, but among the most obvious biases are those that relate to social stigma, socioeconomic and educational background, age, sex (e.g., Chamberlain, 2000; Usher, 2002; Murphy, 2008; Hunt and Albanese, 2005; Wilson et al., 2007; Ubelaker, 2014), and “race.” Although beyond the scope of this present chapter, it is important to remain cognizant of ongoing debate in the forensic anthropological discipline relative to historical ethnocentric biases and the concept of “race” that are embedded in “neocolonial” approaches to human variation (a small subset of a broad literature include: Sauer, 1992; Brace, 1995; Smay and Armelagos, 2000; Albanese and Saunders, 2006; Ousley et al., 2009; Albanese and Cardoso, 2019; Stephan and Ross, 2018, 2019). Consequently, it is necessary to contemplate how population specificity is relevant, if at all, to the forensic anthropologist and/or practitioner. In further considering the definition of forensic methods (see in the preceding text), the latter relates to the fact that they are “optimized” for the reference sample. Accordingly, when statistical models are applied (unintentionally or otherwise) to individuals “foreign” to the reference sample, there is likely to be (although not always—see Albanese et al., 2016) an associated reduction in the accuracy of the biological attribute(s) being estimated. For example, studies that have assessed aging techniques are often confronted with the problem of age mimicry because all samples have an inherent age distribution (e.g., Boldsen et al., 2002). Such an effect is likely to be positively magnified with increasing “biological distance” from the reference source, albeit the reality is that any relationship is unlikely to be linear or easily predictable (Franklin and Flavel, 2019). In relation to the judiciary, the subsequent effect of how demographic biological data is measured and presented is more straightforward. For example, Kimmerle and Jantz (2008) suggest that both legal and scientific challenges to anthropological evidence based on demographic data used to estimate basic biological attributes (e.g., age, sex, and stature) are likely to be debated, especially in relation to the reliability and accuracy of applying methods across populations. Accordingly, it is important to contemplate whether the discipline of forensic anthropology can provide a practical solution to these challenges. Does there need to be a general consensus, or even a legal decision, as to how a “forensic population” is defined? In considering forensic practice in Australia, it is certainly not unreasonable to posit that models based on a North American population are less representative than the same biological data collected from a subset of the multicultural Australian population. However, little is known about intrapopulation effects, for example, what is the magnitude of biological variation within a country such as Australia that has had indigenous occupation for over 65,000 years and 28.2% of the resident population (25 million) were born overseas? (Phillips and SimonDavies, 2016). Are specific models required that consider “local forensic populations,” for example, Western relative to Eastern Australia?

Documented versus nondocumented

Taking into account that populations are dynamic, further research may facilitate a deeper understanding of how to define populations relative to forensic profiling, while considering exactly how to meet evidentiary requirements of the judiciary in which those data are tendered. Such research requires access to biological data acquired from appropriate repositories of human skeletal collections, physical and/or virtual, with associated demographic information.

Skeletal collections in physical and forensic anthropology While it has been argued that the skeleton is one of the least intimidating ways to engage with the dead (Davis, 2015), it is often forgotten that the skeletal remains that archeologists and forensic anthropologists excavate and recover were once corpses and these in turn were once human beings (Duday, 2009). In 1989, the World Archeological Congress (WAC) held its first intercongress of “Archaeological Ethics and the Treatment of the Dead” that resulted in a Code of Ethics of archeologists and others who deal with skeletal remains, the so-called Vermillion Accord on Archeological Ethics and the Treatment of the Dead (Day, 1990). This code of ethics is based on the premise that human rights continue after death and that the dead impact on the living (Bradley, 2016). Consequently, there has been much discussion about the use of reference skeletal collections for teaching and research purposes (e.g., Blau, 2016; Passalacaqua and Pilloud, 2018; Rosenswig, 1997; Tobias, 1991), most recently including not only some vigorous debate concerning ethical guidelines for the establishment and use of human skeletal collections (e.g., BABAO, 2019; Stephan and Ross, 2018, 2019; Albanese and Cardoso, 2019), but also consideration of the ethics of sharing online digital three-dimensional models of archeological human remains (Ulguim, 2018). Before moving further toward an understanding of the conceptual basis of skeletal collections and the biological data they represent, it is important to acknowledge that the manner in which collections are described, and in fact collected and managed, is continually evolving. Changes in relation to technological developments, end-user requirements, critical appraisals of current approaches, and an increasing awareness of acceptable practice relative to general operating procedures, all contribute to how collections are used. The following sections thus explore the criteria used to determine the suitability of biological data derived from human skeletal collections as directly relevant to the contemporary forensic anthropologist, both practically and judicially.

Documented versus nondocumented To be relevant to research and training in forensic anthropology, reference skeletal collections should ideally comprise documented (“identified”) individuals for which pertinent biological data are known, including age at death, sex, ancestry, and (albeit

21

22

CHAPTER 1.3 Statistics in forensic anthropology

less frequently) stature. Information about the individual’s occupation and health record is also particularly useful relative to incidences and skeletal morphological expressions of disease, illness, and trauma (Cardoso and Henderson, 2013; Mann, 2013). Notwithstanding ongoing debate about self-reported biological attributes (e.g., Usher, 2002; Gannett, 2014; Maijanen and Jeong, 2018), the latter biological information in totality is imperative to facilitate data on specific populations, which as noted by Cattaneo et al., “…permits comparisons across populations and a higher scientific precision for the identification process in the forensic context” (2018:219. e5). Documentation is thus imperative for methodological development and testing in forensic anthropology that then flow into proper interpretations, practically and judicially (Ubelaker, 2014). Skeletal collections that have unknown or missing demographic information for the individuals it comprises are less likely to have direct applicability for the formulation of forensic methods (see in the preceding text). For example, many museum collections originating prior to the 20th Century represent skeletal remains collected during funded expeditions, or consolidations of private collections, and thus lack adequate provenience, with most material labeled according to broadly defined racial categories or geographical regions (Lambert and Walker, 2019). However, the value of nondocumented skeletal collections should not be discounted. As outlined by Lambert and Walker “…most of the skeletal material in museums derives from the work of professional archeologists and is associated with at least some contextual information that allows the individual to be placed into a meaningful historical, geographical, environmental and cultural context” (2019:13). It is this contextual information that is of research value, especially in bioarcheology, for the reconstruction of the cultural ecology of past populations. Also, the value of nondocumented collections for the basic study of human skeletal anatomy and the testing or development of anthropological methods should not be overlooked, albeit on the basis of assumed (estimated) biological attributes (e.g., age and sex) and due awareness of the inherent limitations of such assumptions.

Contemporary versus noncontemporary The contemporality of a skeletal collection is defined or measured according to the extent to which it accurately represents the present living population of the relevant jurisdiction. As premised earlier, populations are not stagnant, and secular variations in the expression of biological attributes are well documented—for example, cranial (Wescott and Jantz, 2005; Jellinghaus et al., 2018) and postcranial morphology ( Jantz et al., 2016; Guyomarc’h et al., 2016). Accordingly, data derived from a temporally removed sample of a population (e.g., start of the 20th century) may not accurately represent the same population in the modern context (Guyomarc’h et al., 2016). The incompatibility of the latter data is a consequence of the complex interaction of multiple factors (both known and unknown) driving morphological variations in the relative expressions of biological attributes that are used in forensic

Representativeness

profiling. Broadly speaking the genetic composition of a population will vary according to increasing propensity for cross continental migration and associated admixture.

Representativeness It can only ever be assumed that any subset of a population, such as individuals comprising a reference skeletal collection, truly represents that broader population in totality. Sampling bias is a known issue relating to the fact that it is a specious assumption that all individuals from/within one population, however defined, have the same likelihood of becoming part of (physically), or being represented in (virtually), a skeletal collection. There are certainly “collection biases” that exist in documented repositories of human skeletal remains that are derived from bequest programs, such as social stigmas resulting in the under-representativeness of one sex relative to the other (e.g., Terry Collection) or socioeconomic factors relative to planned donations and those that enter the collection after their death at the wish of family members or medicolegal authority (Hunt and Albanese, 2005; Wilson et al., 2007; Komar and Grivas, 2008). The extent to which a documented skeletal collection represents a population may, at face value, be more of an issue with physical skeletal collections compared to virtual collections which are based on postmortem scanning and/or hospital picture archiving and communication system (PACS) databases (see in the succeeding text). However, because the development of a postmortem computed tomography (PMCT) database requires technology, it is more than likely that a PMCT database “population” will be biased as not all countries have access to CT scanning, whether in a hospital or forensic medical institutional setting. Further, it is important to remember that there is bias in the demographics of individuals who end up a “victim” in a forensic context. For example, between 1976 and 2015 in the United States, almost three quarters of homicides involved a male killing another male (Fox and Fridel, 2017).

Documented human skeletal collections (physical) Numerous documented human skeletal collections exist that are both teaching and research resources. These include collections of complete skeletons with documented demographic data (Table 1), specific individual skeletal elements (e.g., skulls— Italy, Guidotti et al., 1986; Hong Kong, King, 1997a; the Netherlands, Perizonius, 1984; and Denmark, Serjrsen et al., 2005—and femora, Belgium, DefriseGussenhoven and Orban-Segebarth, 1984 and Australian, Thomas and Clement, 2012), and osteological specimens representing many major bone diseases and trauma (Ruhli et al., 2003; Buikstra, 2019). Repositories in the United States such as the Hamann–Todd Collection housed at the Cleveland Museum of Natural History and the Terry Collection at the

23

Table 1 Global collections of complete skeletons with documented demographic data.

Country

Collection name

Number of documented complete skeletons

Collection date

Location

Reference

School of Medical Sciences of the National University of La Plata Faculty of Medicine, University of Buenos Aires School of Biomedical Sciences, University of Queensland

Salceda et al. (2012)

Argentina

Lambre Collection

>400

1936–2001

Argentina

Chacarita Collection

159

1987–2000

Australia

School of Biomedical Sciences Skeletal Collection Institute of Forensic Medicine Collection Schoten collection

c. 20

2014–present

Large numbers

20th century

Institute of Forensic Medicine, University of Vienna

Szilvassy and Kritscher (1990)

51

1837–1931

Orban et al. (2011), Orban and Vandoorne (2006)

Brazil

Museo do Departamento de Anatomia

492

Modern

Brazil

The 21st Century Collection of the Center for Studies in Forensic Anthropology

239

21st century

l’Institut royal des Sciences naturelles de Belgique Instituto de Ciencias Biomedicas da Universidade de Sao Paul Faculty of Dentistry of Pernambuco, University of Pernambuco

Austria

Belgium

Bosio et al. (2012)

Stephan et al. (2017)

de Francisco et al. (1990)

Cunha et al. (2018)

Brazil

Brazil

Brazil

Canada Canada Chile

China

Identified Skeletal Collection of Sergipanos Osteological and Tomographic Collection of FOP/UNICAMP

223

2009–15

University Tiradentes, Aracaju

Cunha et al. (2018)

320

20th century

Cunha et al. (2018)

Osteological Collection of the Institute of Teaching and Research in Forensic Sciences (IEPCF), Grant Collection

143

20th century

The Professor Eduardo Daruge Laboratory of Forensic and Physical Anthropology at Piracicaba Dental School, University of Campinas (FOP/ UNICAMP), Piracicaba, Sa˜o Paulo Institute of Teaching and Research in Forensic Sciences (IEPCF), Guarulhos, Sa˜o Paulo

202

1928–50

St. Thomas Anglican Church Modern Collection of Santiago

80 1282

1950–70

Hong Kong Collection

100

Modern

Cunha et al. (2018)

Anatomy Department, University of Toronto Belleville, Ontario

Albanese (2018)

University of Chile in the Department of Anthropology of the Faculty of Social Sciences Prince Philip Dental Hospital, University of Hong Kong Medical School

Urzua et al. (2008)

Saunders et al. (1997)

King (1997b)

Continued

Table 1 Global collections of complete skeletons with documented demographic data—cont’d

Country China

Colombia

Colombia

Finland

France

Germany Germany Greece

Collection name Institute of Forensic Sciences Human Skeletal Reference Collection of Modern Colombian Population Universidad de Antioquia’s human skeletal reference collection Natural History Museum skeletal material Brest Bone Collection Institute of Anatomy €bingen Tu Collection Athens Collection

Number of documented complete skeletons

Collection date

>205

Location

Reference

Modern

Ministry of Public Security, Beijing

Liu et al. (1988)

600

2004 and 2008

National Institute of Legal Medicine and Forensic Sciences, Bogota,

Sanabria-Medina et al. (2016)

>100

Modern

Monsalve Vargas and Isaza Pelaez (2014)

108

20th century

The Laboratory of Human Osteology and Forensic Anthropology at Universidad de Antioquia, Medellin Natural History Museum, University of Helsinki

450

1996–97

Baccino et al. (1999)

101

Modern

>108

1964–94

University Hospital (Centre Hospitalier Universitaire) Montpellier University of Technology, Aachen €bingen Tu

214

1996–2003

Department of Animal and Human Physiology, University of Athens

Eliopoulos et al. (2007)

€ki (2011) Niinima

Prescher and Bohndorf (1993) Graw et al. (1999)

Hungary

University of Szeged Collection Hungarian Natural History Museum collection Banaras Hindu University

? Fetal collection

?

>10

?

>244

Modern

The CAL Milano Cemetery Skeletal Collection Sassari Collection

2127

1910–2001

606

1908–47

Italy

Certosa Collection

425

1898 and 1944

Italy

Frasetto Collection

433

First half of the 20th century

Italy

Institute of Legal Medicine

>80

Modern

Italy

Istituto di Anatomia

742

Modern

Hungary

India

Italy

Italy

Department of Forensic Medicine, Faculty of Medicine, Szeged Natural History Museum, Department of Anthropology, Budapest

Castellana and Kosa (2001)

Department of Anatomy, Institute of Medical Sciences, Banaras Hindu University, Varanasi LABANOF, Milan

Singh and Singh (1972), Singh et al. (1974)

Museum of Anthropology, University of Bologna Museum of Anthropology of Alma Mater Studiorum University of Bologna Museum of Anthropology, University of Bologna Institute of Legal Medicine, University of Bari Istituto di Anatomia, Siena

Belcastro et al. (2008)

Hershkovitz et al. (1999)

Cattaneo et al. (2018)

Belcastro et al. (2017)

Facchini et al. (2006)

Introna et al. (1998)

Brasili-Gualandi and GualdiRusso (1989) Continued

Table 1 Global collections of complete skeletons with documented demographic data—cont’d

Country

Collection name

Number of documented complete skeletons

Collection date

Italy

University of Torino

1064

Modern

Japan

Jikei Collection

c. 100

First half of the 20th century

Japan

Modern Japanese Osteological Collection rida, Yucatan Me

c. 300

Late 19th early 20th century

84

Modern

Mexico

Mexico

San Nicolas Tolentino Collection

102

19th and early 20th century

Mexico

Universidad Nacional Auto´noma de Mexico Collection Utrecht Collection

172

Initiated 1994

124

1870–1960

Netherlands

Location

Reference

Department of Human Anatomy, University of Torino Jikei University School of Medicine’s Department of Anatomy The University Museum, University of Tokyo

Giraudi et al. (1984)

School of Anthropological Sciences, the Autonomous University of Yucatan Department of Physical Anthropology of the Escuela Nacional de Antropologı´a e Historia, Mexico City Faculty of Medicine, Physical Anthropology Laboratory, National Autonomous University of Mexico ?

Iscan et al. (1994)

Case and Heilman (2005), Suwa (1981)

Chi-Keb et al. (2013)

Gonzalez et al. (2006)

Molgado et al. (2007)

Maat and Mastwijk (1995)

Philippines

Portugal

Portugal

Portugal

Romania

Manila North Cemetery collection Coimbra Identified Skeletal Collection The Lisbon Collection (also known as Luıs Lopes Collection) The 21st Century Identified Skeletal Collection, Francis J. Ranier Collection

75

Died in the 21st century

Archeological Studies Program, University of the Philippines, Diliman University of Coimbra, Museum of Anthropology

Go et al. (2017)

505

1826–1922 to 1904–38

1800

1880–1975

The Bocage Museum (National Museum of Natural History), University of Lisbon

Cardoso (2006)

230

1995–2008

Department of Life Sciences, University of Coimbra

Ferreira et al. (2014), Rocha (1995)

6800

1900s

Anthropology Institute of the Romanian Academy, Bucharest Department of Anatomy at the Tygerberg Medical Campus, University of Stellenbosch, Cape Town Department of Anatomy at the University of Cape Town

Ion (2011)

Department of Anatomy, University of Pretoria

L’Abbe et al. (2005)

South Africa

Kirsten Collection

1161

1920 and 1950

South Africa

Cape Town Documented Skeletal Collection Pretoria Bone Collection

200

1980–96

290

From 1942 to the present

South Africa

Cunha and Wasterlain (2007)

Alblas et al. (2018)

Ubelaker (2014)

Continued

Table 1 Global collections of complete skeletons with documented demographic data—cont’d

Country

Collection name

Number of documented complete skeletons

Collection date

Location

Reference Dayal et al. (2009)

South Africa

The Raymond Dart Collection

2605

From 1921 to the present

South Africa

UCT Human Skeletal Collection Cape Town

352

Modern

South Africa

M.R. Drennan Museum and Departmental Specimen Collection Universidad de Complutense Collection UAB Collection of Identified Human Skeletons Granada Osteological Collection of Identified Infants and Young Children

c. 250

Modern

University of Witwatersrand, School of Anatomical Sciences Division of Clinical Anatomy and Biological Anthropology in the Department of Human Biology, University of Cape Town University of Cape Town

200

1880–1976

Madrid

Trancho et al. (1997)

35

1977–91

Universitat Auto`noma de Barcelona (UAB)

Rissech and Steadman (2011)

230 juveniles

Mid-20th century

Laboratory of Anthropology of the University of Granada

Alema´n et al. (2012)

Spain

Spain

Spain

http://www.anatomybioanth.uct. ac.za/uct-human-skeletalcollection (accessed 24/04/19)

http://www.cca.uct.ac.za/cca/ arc/visual-university/drennananatomy (accessed 23/04/19)

Switzerland

Switzerland

Thailand

Thailand

UK

UK

UK UK UK USA

USA

Spitalfriedhof St. Johann Known Age Collection I. Gemmerich Collection

220

1845–65

€rgerspital Hospital, Bu Basel

Griffin et al. (2009)

151

104

Department of Anthropology and Ecology, University of Geneva Faculty of Medicine, Chiang Mai University

Gemmerich (1999)

Chiang Mai University Skeletal Collection Khon Kaen University Collection The Spitalfields Coffin Plate Collection Scheuer Collection

Modern cemeteries of the Vaud Canton 1993–96

Late 19th century to 1988 1729–1859

Techataweewan et al. (2017)

St. Bride’s Church St Bride’s Church St Thomas Anglican Church George Huntington Collection Hamann–Todd Osteological Collection

244

Archeological and anatomical 1666–1850

56

19th century

90

19th century

Department of Anatomy, Faculty of Medicine, Khon Kaen University British Museum of Natural History Collections: London Centre for Anatomy and Human Identification, University of Dundee. St. Bride’s Church, Fleet Street, London British Museum (Natural History), London London

>3600

1893–1921

3300

1912–38

745

389

150

National Museum of Natural History, Smithsonian Institution Cleveland Museum of Natural History

King et al. (1998), King (1997b)

Molleson and Cox (1993)

Centre for Anatomy and Human Identification, Anatomical Collections (n.d.) Gapert et al. (2009) Scheuer and Maclaughlin-Black (1994) Saunders et al. (1992) Hunt and Spatola (2008)

Cobb (1959), Kern (2006)

Continued

Table 1 Global collections of complete skeletons with documented demographic data—cont’d

Country USA USA

USA

USA

USA

USA

Collection name Johns Hopkins Fetal Collection Edgar H. Maxwell Museum’s Documented Skeletal Collection National Museum of Health and Medicine Skeletal Collections Robert J. Terry Anatomical Skeletal Collection: W. Montague Cobb Human Skeletal Collection William M. Bass Donated Skeletal Collection

Number of documented complete skeletons

Collection date

112

1920s

>300

Last 40 years

c. 5000

Location

Reference

Cleveland Museum of Natural History Maxwell Museum of Anthropology, part of the University of New Mexico, Albuquerque

Erikson (1981)

Civil War; Indian Wars

National Museum of Health and Medicine, Washington

Barbian et al. (2000), National Museum of Health and Medicine Skeletal Collections (n.d.)

1728

1898–1941

Hunt and Albanese, 2005

700

1932–69

National Museum of Natural History, Smithsonian Institution, Washington Howard University, Washington

>1700

From 1981 to the present

Forensic Anthropology Center, The University of Tennessee

Shirley et al. (2011), Steadman (n. d.)

Komar and Grivas (2008), Edgar (n.d.)

Rankin-Hill and Blakey (1994)

USA

Morphology Collection

236

Modern and archeological

American Museum of Natural History, New York

USA

Hamilton County Forensic Center Donated Collection

67

2005

Hamilton County Forensic Center, Chattanooga

USA

Trotter Collection

>133 foetal

?

Washington State University

https://www.amnh.org/ourresearch/anthropology/ collections/collections-history/ biological-anthropology (accessed 23/04/19) https://www.google.com/maps/ d/viewer?amp%3Bz&% 3Bll¼26.056892449769215% 2C0&mid¼162_ ElRDZuDCJfM10jCkPpRSFPSw& ll¼35.08017709999999%2C-85. 26476680000002&z¼8 (accessed 24/04/19) Holcomb and Konigsberg (1995)

34

CHAPTER 1.3 Statistics in forensic anthropology

_ Smithsonian Museum (Iscan, 1992; Hunt and Albanese, 2005) were the basis upon which many of the standard anthropological techniques were developed. Increasingly, however, there are research skeletal collections being developed around the world (see Table 1). A continually updated summary of information relative to global osteological collections is also available (e.g., Anon, 2019a; FASE, ND) in the form of an interactive map summarizing individual collections grouped according to three categories: (i) modern identified, individuals born after 1920 with known age at death and sex; (ii) nonmodern identified, as aforementioned but born before 1920; and (iii) identified of uncertain temporal status, individuals with known variables but no data relative to dates of birth. A contemporary and historic skeletal collection database is also available online that is searchable according to region, time period, and/or specific skeletal collection; a description of the collection is provided with a link to associated published research (e.g., Anon, 2019b).

Documented human skeletal collections (virtual) While there has been much debate about the ethics associated with the collection of human skeletal remains for research and teaching (Walsh-Haney and Lieberman, 2005; Blau, 2016; Passalacaqua and Pilloud, 2018), there has also been discussion about the applicability of using collections where the temporal period, socioeconomic status, and age-at-death distribution of the individuals may not necessarily be relevant to the research being undertaken (Henderson and Cardoso, 2018). Consequently, there has been increased attention applied to the development of virtual skeletal collections created by three-dimensional laser scanning of skeletal remains (Beebe, 2014; see also Hassett, 2018), or based on medical modalities, including computed tomography (CT) or magnetic resonance imaging (MRI) (Anon, 2018). Anthropological research exploiting such digital technologies embodies what is now known as “virtual anthropology,” defined according to an approach involving the study of three-dimensional representations of anatomical data (Weber and Bookstein, 2011; Johnstone-Belford et al., 2018; Lottering et al., 2016). In general, there are two different types of virtual skeletal “collections” specifically based on medical imaging These “collections” are defined according to whether the primary throughput of individuals they comprise are scanned relative to ante- (e.g., clinical or research) or postmortem (forensic) investigations. Repositories of clinical medical images, while not acquired for the specific purpose of facilitating forensic research, offer an important mechanism for acquiring contemporary biological data specific to a geographical region. Although clinical medical scans represent a permanent source of information that can be used as a proxy for physical skeletal collections, there are, however, limitations to the clinical datasets. For example, the reason why the scan is obtained typically means that only one anatomical region of the body is imaged, compared with the full body in postmortem cases. Further, because clinical scans are undertaken on the living, they must accord with ethical guidelines relative to the deleterious effect of ionizing radiation (Prasad et al., 2004). In addition, in most clinical records, ancestry is not defined or recorded, as it is

Representativeness

not deemed medically relevant. Even if that data existed, there are still issues surrounding how individuals identify (e.g., social perceptions versus physical biological indicators of ancestry). The benefits, however, of contemporary population specific data that are otherwise not possible to acquire, mitigate to some extent those limitations, especially faced with the option of otherwise applying data from geographically and/or temporally removed populations (see in the preceding text and Franklin and Flavel, 2019). There are also issues relating to how (if at all) consent for clinical data can be obtained for research. The specific mechanisms vary according to different national and institutional legislations and as such are simply too broad and variable to consider here. However, in the context of tertiary research involving deidentified (anonymized) CT images from living individuals, such approvals typically involve (in the first instance) review and approval of a submitted research proposal by a University Human Research Ethics Committee (HREC). If approved, that same proposal is then subject to further review and formal approval by an ethics committee of any participating institution for which clinical scans are being requested. In that instance, personal consent from each patient whose scan(s) are accessed is deemed unnecessary because the data received are completely deidentified and confidentiality is thus ensured. Many of the latter limitations are less inherent in postmortem computed tomography (PMCT) scans that are performed as part of routine autopsy procedures. While radiation exposure to the decedent is not an issue when dealing with the deceased, and demographic data are generally more readily available through associated case management systems, details such as ancestry are often still limited. The driver for the development of repositories of postmortem scans is largely an artifact of increasing implementation of modern cross-sectional imaging technologies into traditional autopsy approaches—“Virtopsy” (e.g., Grabherr et al., 2007; Bolliger and Thali, 2015). The benefits of PMCT in forensic medicine are well documented (e.g., Bolliger and Thali, 2015; Brough et al., 2015; Christensen et al., 2018) and include a noninvasive supplement, or even a partial replacement (e.g., religious objection— Byard, 2011), to traditional approaches using three-dimensional optical and volumetric radiographic data based on individual documentation of the body surface and internal structures (Thali et al., 2005). Consequently, databases of postmortem images are starting to be developed, which following the appropriate ethics approval, can be accessed for teaching and research purposes (Bassed, 2018; Naimo et al., 2015; Hall et al., 2019). For example, at the Victorian Institute of Forensic Medicine in Australia, more than 75,000 full body PMCT scans have been acquired as part of routine autopsy procedures since 2005 (Blau et al., 2018). At the University of New Mexico, whole body PMCT scans obtained between 2010 and 2017 are now being made available for research (Anon, 2018). Databases such as these are increasingly used in forensic medical institutes relative to investigations in various case contexts, most routinely including (but not limited to) violent, unnatural, unexpected, accidental, or injury-related deaths. Postmortem imaging is also particularly relevant in investigations of health-related

35

36

CHAPTER 1.3 Statistics in forensic anthropology

and custodial deaths referred for coronial enquiry, including instances where the person’s identity and/or cause of death is unknown. The broad spectrum of deaths requiring investigation suggests that the “forensic population” is typically well representative of the broader population overall, thus further reaffirming the importance of such collections in teaching and research, albeit cognizant of the inherent advantages and disadvantaged of such repositories.

Conclusions Biological data specific to the human skeleton are the foundation of forensic anthropological research that directly informs professional practice in a number of contexts. Repositories of real human skeletal remains that have traditionally represented the primary sources of anthropological information are now increasingly being supplemented by data acquired from virtual modalities; the drivers for the latter are many and varied but primarily relate to the importance of demonstrating accurate judicially admissible findings based on population-specific data. As discussed, the appropriateness of such data (e.g., methods comprising predictive models) derived from the analysis of a specific skeletal collection(s), whether physical or virtual, varies relative to context of application. It is thus the responsibility of the forensic specialist tasked with performing an anthropological assessment to ensure that their findings are based on methods encompassing data that are most appropriate to their specific case context and that also meet requisite judicial requirements. That cannot be achieved without deep and critical reflection of the basic foundations of the skeletal collection(s) used to produce the methods used in routine practice.

References Albanese, J., 2018. The Grant Human Skeletal Collection and other contributions of J.C.B. Grant to anatomy, osteology, and forensic anthropology. In: Henderson, C.Y., Cardoso, F.A. (Eds.), Identified Skeletal Collections: the Testing Ground of Anthropology? Archaeopress, Oxford, pp. 35–58. Albanese, J., Cardoso, H.F.V., 2019. Commentary on: Stephan CN, Ross AH. Letter to the editor—A code of practice for the establishment and use of authentic human skeleton collections in forensic anthropology. J. Forensic Sci. 63 (5), 1604–1607. https://doi.org/ 10.1111/1556-4029.14078. Albanese, J., Saunders, S.R., 2006. Is it possible to escape racial typology in forensic identification? In: Schmitt, A., Cunha, E., Pinheiro, J. (Eds.), Forensic Anthropology and Medicine. Humana Press Incorporated, Totowa, NJ, pp. 281–316. Albanese, J., Tuck, A., Gomes, J., Cardoso, H.F., 2016. An alternative approach for estimating stature from long bones that is not population-or group-specific. Forensic Sci. Int. 259, 59–68. Alblas, A., Greyling, L.M., Geldenhuys, E.-M., 2018. Composition of the Kirsten skeletal collection at Stellenbosch University. S. Afr. J. Sci. 114, 1–6.

References

Alema´n, I., Irurita, J., Valencia, A.R., Martı´nez, A., Lo´pez-La´zaro, S., Viciano, J., Botella, M.C., 2012. Brief communication: The Granada osteological collection of identified infants and young children. Am. J. Phys. Anthropol. 149 (4), 606–610. Anon, 2018 UNM database of deceased people a national first [Internet]. The University of New Mexico; Available from: https://carc.unm.edu/research/unm-database-of-deceasedpeople-a-national-first.html. Anon, 2019a. Identified collections. https://www.google.com/maps/d/viewer?amp%3Bz& amp%3Bll¼26.056892449769215%2C0&mid¼162_ElRDZuDCJfM10jCkPpRSFPSw& ll¼50.06890860000001%2C14.424538799999937&z¼8 (Accessed 24 April 2019). Anon, 2019b Skeletal collections database. http://www.highfantastical.com/skelcoll/results. php?region¼all&time¼modern (Accessed 24 April 2019). BABAO, 2019. Code of Practice. (The British Association of Biological Anthropology and Osteoarchaeology). Baccino, E., Ubelasker, D.H., Hayek, L.C., Zerilli, A., 1999. Evaluation of seven methods of estimating age at death from mature human skeletal remains. J. Forensic Sci. 44, 931–936. Barbian, L.T., Sledzik, P.S., Nelson, A.M., 2000. Case studies in pathology from the national museum of health and medicine, armed forces institute of pathology. Ann. Diagn. Pathol. 4, 170–173. Bassed R. 2018. Optimising image and PMCT databases for research at the Victorian Institute of Forensic Medicine. (Accessed 14 May 2019) Available from: https://monash.figshare. com/articles/Optimising_image_and_PMCT_databases_for_research_at_the_Victorian_ Institute_of_Forensic_Medicine/6060239. Beebe KL. 2014. The Feasibility of Creating a 3D Digital Skeletal Collection for Research Purposes and Museum Use. Master of Arts Thesis. Louisiana State University. Belcastro, M.G., Rastelli, E., Mariotti, V., 2008. Variation of the degree of sacral vertebral body fusion in adulthood in two European modern skeletal collections. Am. J. Phys. Anthropol. 135, 149–160. Belcastro, M.G., Bonfiglioli, B., Pedrosi, M.E., Zuppello, M., Tanganelli, V., Mariotti, V., 2017. The history and composition of the identified human skeletal collection of the Certosa Cemetery (Bologna, Italy, 19th–20th Century). Int. J. Osteoarchaeol. 27 (5), 912–925. Blau, S., 2016. More than just bare bones: Ethical considerations for forensic anthropologists. In: Blau, S., Ubelaker, D.H. (Eds.), Handbook of Forensic Anthropology and Archaeology. Routledge, London, pp. 593–606. Blau, S., 2018. It’s all about the context: Reflections on the changing role of forensic anthropology in medico-legal death investigations. Aust. J. Forensic Sci. 50, 628–638. Blau, S., Ubelaker, D.H. (Eds.), 2016. Handbook of Forensic Anthropology and Archaeology. Routledge, London. Blau, S., Ranson, D., O’Donnell, C., 2018. An Atlas of Skeletal Trauma in Medico-Legal Contexts. Elsevier, London. Boldsen, J., Milner, G., Konigsberg, L., Wood, J., 2002. Transition analysis: A new method for estimating age from skeletons. In: Hoppa, R., Vaupel, J. (Eds.), Paleodemography: Age Distributions from Skeletal Samples. Cambridge, New York, pp. 73–106. Bolliger, S.A., Thali, M.J., 2015. Imaging and virtual autopsy: Looking back and forward. Philos. Trans. R. Soc. B 370, 1–7. Bosio, L.A., Garcı´a Guraieb, S., Luna, L.H., Aranda, C., 2012. Chacarita project: Conformation and analysis of a modern and documented human osteological collection from Buenos Aires City—Theoretical, methodological and ethical aspects. HOMO – J. Comp. Human Biol. 63 (9), 481–492.

37

38

CHAPTER 1.3 Statistics in forensic anthropology

Brace, C.L., 1995. Region does not mean “race”—Reality versus convention in forensic anthropology. J. Forensic Sci. 40, 171–175. Bradley, R., 2016. A Matter of Life and Death. Jessica Kingsley Publishers, London. Brasili-Gualandi, P., Gualdi-Russo, E., 1989. Discontinuous traits of the skull: variations on sex, age, laterality. Anthropol. Anz. 47, 239–250. Brough, A.L., Morgan, B., Rutty, G.N., 2015. Postmortem computed tomography (PMCT) and disaster victim identification. Radiol. Med. 120, 866–873. Buikstra, J.E., 2019. Ortner’s Identification of Pathological Conditions in Human Skeletal Remains, third ed. Academic Press, London. Byard, R.W., 2011. Indigenous communities and the forensic autopsy. Forensic Sci. Med. Pathol. 7, 139–140. Cardoso, H.F.V., 2006. Brief communication: the collection of identified human skeletons housed at the Bocage Museum (National Museum of Natural History), Lisbon, Portugal. Am. J. Phys. Anthropol. 129 (2), 173–176. Cardoso, F.A., Henderson, C., 2013. The categorisation of occupation in identified skeletal collections: A source of bias? Int. J. Osteoarchaeol. 23, 186–196. Case, D.T., Heilman, J., 2005. Pedal symphalangism in modern American and Japanese skeletons. HOMO-J. Comp. Human Biol. 55, 251–262. Castellana, C., Kosa, F., 2001. Estimation of fetal age from dimensions of atlas and axis ossification centers. Forensic Sci. Int. 117, 31–43. Cattaneo, C., 2007. Forensic anthropology: Developments of a classical discipline in the new millennium. For Sci Int 165, 185–193. Cattaneo, C., Mazzarelli, D., Cappella, A., Castoldi, E., Mattia, M., Poppa, P., De Angelis, D., Vitello, A., Biehler-Gomez, L., 2018. A modern documented Italian identified skeletal collection of 2127 skeletons: The CAL Milano cemetery skeletal collection. Forensic Sci. Int. 287, 219-e1–e5. Centre for Anatomy and Human Identification, Anatomical Collections, n.d. Available at: http:// www.lifesci.dundee.ac.uk/cahid/anatomical-collections (Accessed 17 August 2011). Chamberlain, A., 2000. Problems and prospects in palaeodemography. In: Cox, M, Mays, S. (Eds.), Human Osteology: In Archaeology and Forensic Science. Greenwich Medical Media, London, pp. 101–115. Chi-Keb, J.R., Albertos-Gonza´lez, V.M., Ortega-Mun˜oz, A., Tiesler, V.G., 2013. A new reference collection of documented human skeletons from Merida, Yucatan, Mexico. HOMO—J. Comp. Human Biol. 64 (5), 366–376. Christensen, A.M., Passalacqua, N.V., Bartelink, E.J., 2014. Forensic Anthropology: Current Methods and Practice. Elsevier, Oxford. Christensen, A.M., Passalacqua, N.V., Schmunk, G.A., Fudenberg, J., Hartnett, K., Mitchell, R.A., et al., 2015. The value and availability of forensic anthropological consultation in medicolegal death investigations. Forensic Sci. Med. Pathol. 11, 438–444. Christensen, A.M., Smith, M.A., Gleiber, D.S., Cunnignham, D.L., Wescott, D.J., 2018. The use of X-ray computed tomography technologies in forensic anthropology. Forensic Anthropol. 1, 124–140. Cobb, W.M., 1959. Thomas Wingate Todd. J. Natl. Med. Assoc. 51, 233–246. Colman, K.L., Janssen, M.C.L., Stull, K.E., van Rijn, R.R., Oostra, R.J., de Boer, H.H., van der Merwe, A.E., 2018. Dutch population specific sex estimation formulae using the proximal femur. Forensic Sci. Int. 286, 268.e1–268.e8. Cunha, E., Wasterlain, A., 2007. The Coimbra identified osteological collection. In: Grupe, C., Peters, J. (Eds.), Skeletal Series and their Socio Economic Context. Verlag Marie Leidorf, Germany, pp. 23–33.

References

Cunha, E., Lopez-Capp, T.T., Inojosa, R., Marques, S.R., Moraes, L.O.C., Liberti, E., et al., 2018. The Brazilian identified human osteological collections. Forensic Sci. Int. 289, 449. e1–e6. Davis S. 2015. Meet the living people who collect dead human remains. https://www.vice. com/en_us/article/wd7jd5/meet-the-living-people-who-collect-human-remains-713. (Accessed 02 April 2019). Day, M., 1990. Archaeological ethics and the treatment of the dead. Anthropol. Today 6, 15–16. Dayal, M.R., Kegley, A.D.T., Strkalj, G., Bidmos, M.A., Kuykendall, K.L., 2009. The history and composition of the Raymond A. Dart collection of human skeletons at the University of Witwatersrand, Johannesburg, South Africa. Am. J. Phys. Anthropol. 140 (2), 324–335. de Francisco, M., Lemos, J.L.R., Liberri, E., Adamo, J., Jacomo, A.L., Matson, E., 1990. Contribution to the study of anatomical hypoglossal canal variations. Revista de ondontologia da Universidade de Sa o Paulo 4, 38–42. de Oliveira Gamba T, Corr^ea Alves M, Haiter-Neto F. 2016. 3.1 Mandibular sexual dimorphism analysis in CBCT scans of a Brazilian population. Journal Forensic Radiology and Imaging 2:104. Defrise-Gussenhoven, E., Orban-Segebarth, R., 1984. Generalized distance between different thigh-bones and a reference population. In: van Vark, G.N., Howells, W.W. (Eds.), Multivariate Statistical Methods in Physical Anthropology. D. Reidel Publishing Company, Dordrecht, pp. 89–99. DeSilva, R., Flavel, A., Franklin, D., 2014. Estimation of sex from the metric assessment of digital hand radiographs in a Western Australian population. Forensic Sci. Int. 244, 314. e311–314.e317. Duday, H., 2009. The Archaeology of the Dead: Lectures in Archaeothanatology. Oxbow Books, Oxford. Edgar H. Maxwell Museum’s Documented Skeletal Collection; https://maxwellmuseum.unm. edu/collections/osteology (Accessed 24 April 2019). El Morsi, D.A., Gaballah, G., Mahmoud, W., Tawfik, A.I., 2017. Sex determination in Egyptian population from scapula by computed tomography. J Forensic Res 8, 1–4. Eliopoulos, C., Lagia, A., Manolis, S., 2007. A modern, documented human skeletal collection from Greece. HOMO-J. Comp. Human Biol. 58, 221–228. Erikson, G.E., 1981. Adolph Hans Schultz, 1891–1976. Am. J. Phys. Anthropol. 56, 365–371. Facchini, F., Mariotti, V., Bonfiglioli, B. & Belcastro M.G. 2006. Les collections osteologiques et osteoarcheologiques du Musee d’Anthropologie de l’universite de Bologne (Italie). In: Ardagna, Y., Bizot, B., Boe¨tsch, G. and Delestre, X. (eds.) Les collections osteologiques humaines: gestion, valorisation et perspectives. Actes de la table ronde de Carry-le-Rouet (Bouches-du-Rh^one, France) 25–26 avril 2003. Bulletins Archeologique de Provence, Supplement 4: 67–70. FASE. ND. Osteological Collections. http://forensicanthropology.eu/osteological-collections/. Ferreira, M.T., Vicente, R., Navega, D., Goncalves, D., Curate, F., Cunha, E., 2014. A new forensic collection housed at the University of Coimbra, Portugal: The 21st century identified skeletal collection. Forensic Sci. Int. 245, 202.e1-e5. Fox, J.A., Fridel, E.E., 2017. Gender differences in patterns and trends in US homicide, 1976– 2015. Violence Gend. 4, 37–43. Franklin, D., Flavel, A., 2019. Population specificity in the estimation of skeletal age and sex: Case studies using a Western Australian population. Aust. J. Forensic Sci. https://doi.org/ 10.1080/00450618.2019.1569722.

39

40

CHAPTER 1.3 Statistics in forensic anthropology

Franklin, D., Marks, M.K., 2013. Species: Human versus nonhuman. In: Siegel, J.A., Saukko, P.J. (Eds.), Encyclopedia of Forensic Sciences. second ed. In: vol. 1. Academic Press, Waltham, pp. 28–33. Gannett, L., 2014. Biogeographical ancestry and race. Stud. History Philos. Sci. C: Stud. History Philos. Biol Biomed. Sci. 47, 173–184. Gapert, R., Black, S., Last, J., 2009. Sex determination from the occipital condyle: discriminant function analysis in an eighteenth and nineteenth century British sample. Am. J. Phys. Anthropol. 138, 384–394. Gemmerich, I. 1999. Creation of an anthropological collection of reference and use of discrete traits in the case of known genealogy. Dissertation, University of Geneva, Geneva. Giraudi, R., Fissore, F., Giacobini, G., 1984. The collection of human skulls and postcranial skeletons at the Department of Human Anatomy of the University of Torino (Italy). Am. J. Phys. Anthropol. 65, 105–107. Go, M.C., Lee, A.B., Santos, J.A.D., Vesagas, N.M.C., Crozier, R., 2017. A newly assembled human skeletal reference collection of modern and identified Filipinos. Forensic Sci. Int. 271, 128.e1–128.e5. Gocha, T.P., Ingvoldstad, M.E., Kolatorowicz, A., Cosgriff-Hernandez, M.T., Sciulli, P.W., 2015. Testing the applicability of six macroscopic skeletal aging techniques on a modern southeast Asian sample. Forensic Sci. Int. 249, 318.e1–318.e7. Gonzalez, A.T., Lara Barajas, I.D., Olvera Palma, R.R., Garcia Rodriguez, S., Silva Magana, M., 2006. Catalogo San Nicola´s Tolentino: una coleccion osteologica contemporanea mexicana. Instituto Nacional de Antropologıae Historia, Mexico City, Mexico. Grabherr, S., Stephan, B.A., Buck, U., N€ather, S., Christe, A., Oesterhelweg, L., Ross, S., Dirnhofer, R., Thali, M.J., 2007. Virtopsy–radiology in forensic medicine. Imag. Decis. MRI 11, 2–9. Graw, M., Czarnetzki, A., Haffner, H.T., 1999. The form of the supraorbital margin as a criterion in identification of sex from the skull: investigations based on modern human skulls. Am. J. Phys. Anthropol. 108, 91–96. Griffin, R.C., Chamberlain, A.T., Hotz, G., Penkman, K.E., Collins, M.J., 2009. Age estimation of archaeological remains using amino acid racemization in dental enamel: a comparison of morphological, biochemical, and known ages-at-death. Am. J. Phys. Anthropol. 140, 224–252. Guidotti, A., Bastianini, A., De Stefano, G.F., Hauser, G., 1986. Variations of supraorbital bony structures in Sienese skulls. Acta Anat. 127, 1–6. Guyomarc’h, P., Velemı´nska´, J., Sedlak, P., Dobisı´kova´, M., Sˇvenkrtova´, I., Bru˚zˇek, J., 2016. Impact of secular trends on sex assessment evaluated through femoral dimensions of the Czech population. Forensic Sci. Int. 262, 284.e1–284.e6. Hall, F., Forbes, S., Rowbotham, S., Blau, S., 2019. Using PMCT of individuals of known age to test the Suchey-brooks method of ageing in Victoria. Aust. J Foren. Sci. https://doi.org/ 10.1111/1556-4029.14086. Harrison, G.A., Weiner, J.S., Tanner, J.M., Barnicot, N.A., 1977. Human Biology: An Introduction to Human Evolution, Variation, Growth and Ecology. Oxford University Press, Oxford. Hassett, B.R., 2018. Which bone to pick: Creation, curation, and dissemination of online 3D digital bioarchaeological data. Archaeologies 14, 231–249. Henderson, C.Y., Cardoso, F.A. (Eds.), 2018. Identified Skeletal Collections: The Testing Ground of Anthropology? Archaeopress, Oxford. Herrera, M.J., Retamal, R., 2017. Reliability of age estimation from iliac auricular surface in a subactual Chilean sample. Forensic Sci. Int. 275, 317.e1–317.e4.

References

Hershkovitz, I., Greenwald, C., Rothschild, B.M., Latimer, B., Dutour, O., Jellema, L.M., Wish-Baratz, S., Pap, I., Leonetti, G., 1999. The elusive diploic veins: anthropological and anatomical perspective. Am. J. Phys. Anthropol. 108, 345–358. Holcomb, S.M., Konigsberg, L.W., 1995. Statistical study of sexual dimorphism in the human fetal sciatic notch. Am. J. Phys. Anthropol. 97, 113–125. Hunt, D.R., Albanese, J., 2005. The history and demographic composition of the Robert J. Terry anatomical collection. Am. J. Phys. Anthropol. 127, 406–417. Hunt, D.R., Spatola, B., 2008. History and demographic profile of the George S. Huntington collection at the Smithsonian Institution. Am. J. Phys. Anthropol. 46, 121–122. Introna, F., Di Vella, G., Campobasso, C.P., 1998. Sex determination by discriminant analysis of patella measurements. Forensic Sci. Int. 95, 39–45. Ion, A., 2011. A brief overview of “Francis J. Rainier” human osteological collection. Annuaire Roumain d’Anthropologie 48, 24–32. _ Iscan, M.Y., 1992. A comparison of the hamann-todd and terry collections. Anthropologie 30, 35–40. Iscan, M.Y., Yoshino, M., Kato, S., 1994. Sex determination from the tibia: standards for contemporary Japan. J. Forensic Sci. 39, 785–792. Jantz RL, Ousley SD. 2005. FORDISC 3: Computerized forensic discriminant functions. Version, 3. Jantz, R.L., Jantz, L.M., Devlin, J.L., 2016. Secular changes in the postcranial skeleton of American whites. Hum. Biol. 88, 65–75. Jellinghaus, K., Hoeland, K., Hachmann, C., Prescher, A., Bohnert, M., Jantz, R., 2018. Cranial secular change from the nineteenth to the twentieth century in modern German individuals compared to modern euro-American individuals. Int. J. Leg. Med. 132, 1477–1484. Johnstone-Belford, E., Flavel, A., Franklin, D., 2018. Morphoscopic observations in clinical pelvic MDCT scans: Assessing the accuracy of the Phenice traits for sex estimation in a Western Australian population. J. Forensic Radiol. Imag. 12, 5–10. Kern, K.F.T., 2006. Wingate Todd: pioneer of modern American physical anthropology. Kirtlandia 55, 1–42. Kimmerle, E.H., Jantz, R.L., 2008. Variation as evidence: Introduction to a symposium on international human identification. J. Forensic Sci. 53, 521–523. King, C.A., 1997a. Osteometric Assessment of 20th Century Skeletons from Thailand and Hong Kong. Universal-Publishers. King, C.A. 1997b. Osteometric assessment of 20th century skeletons from Thailand and Hong Kong. MA thesis, Florida Atlantic University, Boca Raton, FL. King, C.A., Iscan, M.Y., Loth, S.R., 1998. Metric and comparative analysis of sexual dimorphism in the Thai femur. J. Forensic Sci. 43, 954–958. Komar, D.A., Grivas, C., 2008. Manufactured populations: What do contemporary reference skeletal collections represent? A comparative study using the Maxwell Museum documented collection. Am. J. Phys. Anthropol. 137, 224–233. L’Abbe, E.N., Loots, M., Meiring, J.H., 2005. The Pretoria bone collection: a modern South African skeletal sample. HOMO-J. Comp. Human Biol. 56 (2), 197–205. Lambert, P.M., Walker, P.L., 2019. Bioarchaeological ethics: Perspectives on the use and value of human remains in scientific research. In: Katzenberg, M.A., Grauer, A.L. (Eds.), Biological Anthropology of the Human Skeleton. John Wiley & Sons, London, pp. 3–42. Liu, W., Chen, S., Xu, Z., 1988. Estimation of age from the pubic symphysis of Chinese males by means of multiple analysis. Acta Anthropol. Sin. 7, 147–153.

41

42

CHAPTER 1.3 Statistics in forensic anthropology

Lottering, N., MacGregor, D.M., Alston, C.L., Watson, D., Gregory, L.S., 2016. Introducing computed tomography standards for age estimation of modern Australian subadults using postnatal ossification timings of select cranial and cervical sites. J. Forensic Sci. 61 (S1), S39–S52. Maat, G., Mastwijk, R., 1995. Fusion status of the jugular growth plate: an aid for age and death determination. Int. J. Osteoarchaeol. 5 (2), 163–167. Maijanen, H., Jeong, Y., 2018. Discrepancies between reported and cadaveric body size measurements associated with a modern donated skeletal collection. Homo 69, 86–97. Mann, R.W., 2013. Our bones: The need for diverse human skeletal collections. Anthropol. Forum. 1. e103. Molgado, S.B., Tellez-grio, J.R., Saint-Leu, P.P.H., 2007. La antropologı´a fı´sica y la medicina en la UNAM. Revista de la Facultad de Medicina, UNAM 50 (1), 17–20. Molleson, T., Cox, M., 1993. The Spitalfields Project. The Anthropology. The Middling Sort. vol. 2. Council for British Archaeology, York: UK. Monsalve Vargas, T., Isaza Pelaez, J., 2014. Estudio biosocial de una muestra de restos o´seos provenientes de la coleccio´n osteolo´gica de referencia de la Universidad de Antioquia. Bolet Antropol. 29 (47), 28–55. Moraitis, K., Zorba, E., Eliopoulos, C., Fox, S.C., 2014. A test of the revised auricular surface aging method on a modern European population. J. Forensic Sci. 59, 188–194. Murphy, E., 2008. Deviant Burial in the Archaeological Record. Oxbow Books, Oxford. Naimo, P., O’Donnell, C., Bassed, R., Briggs, C., 2015. The use of computed tomography in determining development, anomalies, and trauma of the hyoid bone. Forensic Sci. Med. Pathol. 11 (2), 177–185. National Museum of Health and Medicine Skeletal Collections n.d. Available at: http://nmhm. washingtondc.museum/collections/guide/ganatomical/ga_skeletal/ga_skeletal.html. (Accessed 17 August 2011). Niinim€aki, S., 2011. What do muscle marker ruggedness scores actually tell us? Int. J. Osteoarchaeol. 21 (3), 292–299. Orban, R. & Vandoorne, K. 2006. Les squelettes humains de Koksijde (Coxyde) et Schoten: deux collections remarquables conservees a` l’Institut royal des Sciences naturelles de Belgique. In: Ardagna, Y., Bizot, B., Boe¨tsch, G. and Delestre, X. (eds.), Les collections osteologiques humaines: gestion, valorisation et perspectives. Actes de la table ronde de Carry-le-Rouet (Bouches-du-Rh^ one, France) 25-26 Avril 2003. Bulletins Archeologique de Provence, Supplement 4: 79–84 Orban, R., Eldridge, J., Polet, C., 2011. Potentialites et historique de la collection de squelettes identifies de Schoten (Belgique, 1837–1931). Anthropologica et Praehistorica 122, 147–190. Ousley, S., Jantz, R., Freid, D., 2009. Understanding race and human variation: Why forensic anthropologists are good at identifying race. Am. J. Phys. Anthropol. 139, 68–76. Passalacaqua, N.V., Pilloud, M.A., 2018. Ethics and Professionalism in Forensic Anthropology. Academic Press, London. Perizonius, W.R.K., 1984. Closing and non-closing sutures in 256 crania of known age and sex from Amsterdam (AD 1883—1909). J. Hum. Evol. 13, 201–216. Phillips, J., Simon-Davies, J., 2016. Migration to Australia: A Quick Guide to the Statistics, Research Paper Series, 2015–16. Parliamentary Library, Canberra. Prasad, K.N., Cole, W.C., Haase, G.M., 2004. Radiation protection in humans: Extending the concept of as low as reasonably achievable (ALARA) from dose to biological damage. Br. J. Radiol. 77, 97–99.

References

Prescher, A., Bohndorf, K., 1993. Anatomical and radiological observations concerning ossification of the sacrotuberous ligament: is there a relation to spinal diffuse idiopathic skeletal hyperostosis (DISH)? Skeletal. Radiology 22, 581–585. Rankin-Hill, L.M., Blakey, M.L., 1994. W. Montague Cobb (1904–1990): physical anthropologist, anatomist, and activist. Am. Anthropol. 96, 74–96. Rissech, C., Steadman, D.W., 2011. The demographic, socio-economic and temporal contextualisation of the Universitat Autonoma de Barcelona collection of identified human skeletons (UAB Collection). Int. J. Osteoarchaeol. 21 (3), 313–322. Rocha, M.A., 1995. Les collections osteologiques humaines identifiees du Musee Anthropologique de l’Universite de Coimbra. Antropol. Port. 13, 7–38. Rosenswig, R.M., 1997. Ethics in Canadian archaeology: an international comparative analysis. Can. J. Archaeol. 21, 99–114. Ruhli, F.J., Hotz, G., Boni, T., 2003. Brief communication: The Galler collection: a littleknown historic Swiss bone pathology reference series. Am. J. Phys. Anthropol. 121, 15–18. Salceda, S.A., Desantolo, B., Garcia Mancuso, R., Plischuk, M., Inda, A.M., 2012. The ‘Prof. Dr. Ro´mulo Lambre’ Collection: An Argentinian sample of modern skeletons. HOMO—J. Comp. Human Biol. 63 (4), 275–281. Sanabria-Medina, C., Gonzalez-Colmenares, G., Restrepo, H.O., Rodriguez, J.M.G., 2016. A contemporary Colombian skeletal reference collection: A resource for the development of population specific standards. Forensic Sci. Int. 266, 577.e1–e4. Sauer, N.J., 1992. Forensic anthropology and the concept of race: if races don’t exist, why are forensic anthropologists so good at identifying them? Soc. Sci. Med. 34, 107–111. Saunders, S.R., FitzGerald, C., Rogers, T., Dudar, C., McKillop, H., 1992. A test of several methods of skeletal age estimation using a documented archaeological sample. Can. Soc. Forensic Sci. J. 25, 97–118. Saunders, S.R., Devito, C., Katzenberg, M.A., 1997. Dental caries in nineteenth century upper Canada. Am. J. Phys. Anthropol. 104, 71–87. Savall, F., Rerolle, C., Herin, F., Dedouit, F., Rouge, D., Telmon, N., Saint-Martin, P., 2016. Reliability of the Suchey-brooks method for a French contemporary population. Forensic Sci. Int. 266, 586.e1–586.e5. Scheuer, L., Maclaughlin-Black, S., 1994. Age estimation from the pars basilaris of the fetal and juvenile occipital bone. Int. J. Osteoarchaeol. 4, 377–380. Serjrsen, B., Lynnerup, N., Hejmadi, M., 2005. An historical skull collection and its use in forensic odontology and anthropology. J. Forensic Odontostomatol. 23, 40–44. Shirley, N.R., Wilson, R.J., Jantz, L.M., 2011. Cadaver use at the University of Tennessee’s anthropological research facility. Clin. Anat. 24, 372–380. Singh, S., Singh, S.P., 1972. Identification of sex from the humerus. Indian J. Med. Res. 60, 1061–1066. Singh, S., Singh, G., Singh, S.P., 1974. Identification of sex from the ulna. Indian J. Med. Res. 62, 731–735. Smay, D., Armelagos, G., 2000. Galileo wept: A critical assessment of the use of race in forensic anthropology. Transform. Anthropol. 9, 19–29. Steadman, D. WM Bass Donated Skeletal Collection; https://fac.utk.edu/wm-bass-donatedskeletal-collection/ (Accessed 24 April 2019). Stephan, C.N., Ross, A.H., 2018. Letter to the editor—A code of practice for the establishment and use of authentic human skeleton collections in forensic anthropology. J. Forensic Sci. 63, 1604–1607.

43

44

CHAPTER 1.3 Statistics in forensic anthropology

Stephan, C.N., Ross, A.H., 2019. Authors’ response. J. Forensic Sci. https://doi.org/10.1111/ 1556-4029.14079. Stephan, C.N., Caple, J.M., Veprek, A., Sievwright, E., Kippers, V., Moss, S., et al., 2017. Complexities and remedies of unknown-provenance osteology. In: Sˇtrkalj, G., Pather, N. (Eds.), Commemorations and Memorials: Exploring the Human Face of Anatomy. World Scientific Publishing Co, Singapore, pp. 65–95. Suwa, G., 1981. A morphological analysis of Japanese crania by means of the vestibular coordinate system. J. Anthropol. Soc. Tokyo 89, 329–350. Szilvassy, J., Kritscher, H., 1990. Estimation of chronological age in man based on the spongy structure of long bones. Anthropol. Anz. 48, 289–298. Techataweewan, N., Tuamsuk, P., Toomsan, Y., Woraputtaporn, W., Prachaney, P., Tayles, N., 2017. A large modern Southeast Asian human skeletal collection from Thailand. Forensic Sci. Int. 278, 406.e1–406.e6. Thali, M.J., Braun, M., Buck, U., Aghayev, E., Jackowski, C., Vock, P., Sonnenschein, M., Dirnhofer, R., 2005. VIRTOPSY—Scientific documentation, reconstruction and animation in forensic: individual and real 3D data based geo-metric approach including optical body/object surface and radiological CT/MRI scanning. J. Forensic Sci. 50, 1–15. Thomas, C.D.L., Clement, J.G., 2012. The Melbourne femur collection: How a forensic and anthropological collection came to have broader applications. In: Crowder, C., Stout, S. (Eds.), Bone Histology: An Anthropological Perspective. CRC Press, Boca Raton, pp. 361–375. Tobias, P.V., 1991. On the scientific, medical, dental and educational value of collections of human skeletons. Int. J. Anthropol. 6, 277–280. Trancho, G.J., Robledo, B., Lopez-Bueis, I., Sanchez, J.A., 1997. Sexual determination of the femur using discriminant functions. Analysis of a Spanish population of known sex and age. J. Forensic Sci. 42, 181–185. Ubelaker, D.H., 2014. Osteology reference collections. In: Smith, C. (Ed.), Encyclopedia of Global Archaeology. Springer, pp. 5632–5641. Ubelaker, D.H., DeGaglia, C.M., 2017. Population variation in skeletal sexual dimorphism. Forensic Sci. Int. 278, 407.e1–407.e7. Ulguim, P., 2018. Models and metadata: The ethics of sharing bioarchaeological 3D models online. Archaeol.: J. World Archaeol.l Congr. 14 (2), 189–228. Urzua, C.L., Balboa, M.R., Yermani, R.R., Lafonataine, E.A., 2008. Arqueologı´a del depo´sito: manejo integral de las colecciones bioantropolo´gicas en el Departamento de Antropologı´a de la Universidad de Chile. Conserva 12, 69–96. Usher, B.M., 2002. Reference samples: The first step in linking biology and age in the human skeleton. In: Hoppa, R.D., Vaupel, J.W. (Eds.), Paleodemography: Age Distributions from Skeletal Samples. In: vol. 31. Cambridge University Press, Cambridge, pp. 29–47. Walsh-Haney, H., Lieberman, L.S., 2005. Ethical concerns in forensic anthropology. In: Turner, T.R. (Ed.), Biological Anthropology and Ethics: From Repatriation to Genetic Identity. State University of New York, Albany, pp. 121–132. Weber, G.W., Bookstein, F.L., 2011. Virtual Anthropology: A Guide to a New Interdisciplinary Field. Springer Verlag, Vienna. Wescott, D.J., Jantz, R.L., 2005. Assessing craniofacial secular change in American blacks and whites using geometric morphometry. In: Slice, D.E. (Ed.), Modern Morphometrics in Modern Morphometrics in Physical Anthropology. Springer, Boston, MA, pp. 231–245.

Recommended reading

Wilson R.J., Algee-Hewitt B. and Jantz L.M., Demographic trends within the forensic anthropology Center’s body donation program, Presented at the 2007 Annual Meeting of the American Academy of Physical Anthropologists, p. 252. Winburn, A.P., 2018. Subjective with a capital S? Issues of objectivity in forensic anthropology. In: Boyd, C.C., Boyd, D.C. (Eds.), Forensic Anthropology: Theoretical Framework and Scientific Basis. Wiley, Chichester, pp. 21–37. Wright, R.V.S., 1992. Correlation between cranial form and geography in Homo sapiens: CRANID—A computer program for forensic and other applications. Archaeol. Ocean. 27, 128–134.

Recommended reading Franklin, D., Flavel, A., 2015. CT evaluation of timing for ossification of the medial clavicular epiphysis in a contemporary Western Australian population. Int. J. Leg. Med. 129, 583–594.

45

CHAPTER

Initial assessment: Measurement errors and interrater reliability

1.4

Toma´sˇ Zemana and Radoslav Benˇusˇb a

Department of Military Science Theory, Faculty of Military Leadership, University of Defence in Brno, Brno, Czech Republic b Department of Anthropology, Faculty of Natural Sciences, Comenius University in Bratislava, Bratislava, Slovakia

Introduction Once you obtain your data, the next step should always be the assessment of reliability of this data. Imagine that you want to estimate the sex of a person based on his skeletal remains. You measure, for example, several diameters of the pelvis of this person, input these figures into a logistic regression equation, and estimate with 98% probability that this person was a female. However, what if your measurement was imprecise or inaccurate? What if the real dimensions are somewhat different from your values? What would happen when you input the real dimensions into the logistic regression equation? Would the probability still be 98%? Or would you even still come to same conclusion that the person was a female? To answer these questions the assessment of measurement error or interrater and intrarater reliability needs to be undertaken.

Before you start Before calculating the measurement error, or interrater or intrarater reliability, it is useful to clarify what type of data you are dealing with. The best-known classification of measurement scales by Stevens (1946) divides the scales into four categories: nominal, ordinal, interval, and ratio. Stevens (1946) defined these categories and the criteria for inclusion of variables into these categories according to the mathematical operations that can be performed on these variables. For variables in nominal scale, we are only able to tell if one element (it can basically be anything, e.g., word, number, or letter) does or does not equal another element. We are unable to determine the order of these elements. For example, we cannot say, that an incisor is more than a canine; it is just a different type of tooth. Statistics and Probability in Forensic Anthropology. https://doi.org/10.1016/B978-0-12-815764-0.00009-5 # 2020 Elsevier Inc. All rights reserved.

47

48

CHAPTER 1.4 Initial assessment: Measurement errors

But we can find out (based on morphology or metrics) whether the tooth under study is or is not an incisor or a canine. For variables in the ordinal scale, we are also able to order the elements, but we are unable to determine the distance between these elements. This means that we can say if one element equals another and also if the element is more or less than the other. For example, when we look at two molars (but not measure them), we may be able to say which of them shows a more advanced abrasion. But we would be not able to say if the difference in the abrasion stage between one pair of molars is the same as the difference in abrasion stage between another pair of molars. This is because we do not have any objective metrics for the abrasion stage in this example. For variables in the interval scale, we can determine if two differences are of the same magnitude. Unlike ratio variables the zero point is not an absolute zero, but is given conventionally. Good example is the age at death. We have two persons of age 60 and 30 years. We can say that the first person is 30 years older than the other. But from the biological point of view, we can hardly say that the first person is twice as old as the other. This is because counting age from birth is nothing more than a convention. Just as well, age could be counted from the time of conception. If we calculated the age in this way, we would certainly get a different ratio. For variables in the ratio scale, we can determine if two differences or ratios are the same. For example, if we measure the relative surface of the exposed dentine as the metric of abrasion rate, we can tell if the difference in abrasion rate between a pair of molars is the same as the difference in abrasion rate between another pair of molars. The classification of Stevens (1946), sometimes slightly modified, is often used not only for description of data but also for the selection of an appropriate statistical test. Nevertheless, the concept of Stevens was later subjected to severe criticism (for review, see Velleman and Wilkinson, 1993). One of the criticisms was that the classification of variables into Stevens’ categories is not unambiguous. There are many cases when single variable can be categorized into two or more scales depending on the context, in which we used the variable. Imagine, for example, a mass grave of victims of war crimes. Those who buried them may have had a system. They would add a label with an identification number to each body. When comparing these numbers with documentation prepared by the perpetrators, the bodies in the mass grave could be identified by a forensic team. In this case the identification numbers would be viewed as a variable in nominal scale. One day, however, the team figure out that the numbers were given consecutively by the perpetrators, in which case, the identification numbers can be treated as variable in ordinal scale (the victim with a lower number was killed earlier than the one with a higher number). This additional information can give insight into the order of killings, and statistical tests can be performed to answer questions, such as whether men were killed first. Moreover, these identification numbers may have also contained information (maybe encrypted) about the date when each of the victims was killed, so these numbers may be treated as a variable in interval scale. As a consequence, after analyzing these numbers, the examiner would be able to determine whether one or more victims

Assessment of measurement error

were, say, killed 10 days earlier than others. As a result the same identification number can be simultaneously viewed as a variable in nominal, ordinal, or interval scale based on the context. Fortunately, such dilemma is not very common in forensic anthropology.

Assessment of measurement error Several techniques for the assessment of measurement error have been proposed for variables in interval or ratio scale (for review, see Ulijaszek and Kerr, 1999, Ulijaszek and Lourie, 1994). Of these the most commonly used approach is calculating the technical error of measurement (TEM) and the coefficient of reliability (R), which were probably first introduced to anthropology by Mueller and Martorell (1988). All you need to do is to measure a selected dimension, for example, the maximum femoral length (F1) in a sample of skeletal remains twice (or more times). Afterward, TEM can be easily calculated using the following equation: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0 0 !2 1 1 u K u X u X CC ! B u N B K B j¼1 i, j C C uXB B CC B X 2 u  X B CC B u i, j B CC K u i¼1 B @ AA @ j¼1 u u t TEM ¼ , N ðK  1Þ

(1)

where Xi,j is the jth measurement of ith specimen, K is the number of repeated measurements (i.e., how many times was each specimen measured), and N is the sample size (i.e., the number of specimens that were measured). In case of two repeated measurements, the Eq. (1) may be simplified into vffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N uX u D2 u t i¼1 i TEM ¼ , 2N

(2)

where D is the difference between the first and the second measurement (does not matter if the first measurement is subtracted from the second or vice versa). TEM is given in the same unit as the original measurement, for example, in millimeter for F1. When both measurements are performed by one person, the result is an intrarater TEM, when each measurement is performed by different person, the result is an interrater TEM. Based on the TEM value, the coefficient of reliability (R) can be calculated using the following equation: R¼1

TEM2 , SD2

(3)

where SD is the standard deviation of the used measurement (such as F1 in our example) in the analyzed sample. R value tells us what portion of variability in the selected

49

50

CHAPTER 1.4 Initial assessment: Measurement errors

measurement is not caused by error of measurement. For example, R equal to 0.8 means that 80% of variability in the selected measurement remains when we exclude the influence of the measurement error. For two repeated measurements, the mean difference (D) between them can be calculated using the following equation: N X



Di

i¼1

:

N

(4)

When three or more repeated measurements were performed, the analysis can be undertaken analogously in pairs (the first measurement against the second, the first measurement against the third, the second measurement against the third, etc.). In this case, it is recommended to test (e.g., by using the paired t-test) whether D is not significantly different from zero, since this is undesirable, especially when assessing the intraobserver error. The existence of such difference would mean that the measurement technique of the examiner changed significantly during the measurement indicating insufficient experience of the examiner. With no significant difference of D from zero, TEM gives us valuable information about the precision of measurement (i.e., random error of measurement). When we assess the intraobserver error and TEM is based on a sufficient number of measurements with the mean difference being zero and measurement errors have normal distribution (which can be tested for example by using the Shapiro-Wilk test), then 95% confidence interval for the selected measurement can be approximately calculated as ðX  1:96  TEM, X + 1:96  TEMÞ

(5)

where X is the given measurement. If you want to narrow this interval you can simply measure the dimension more times and use the mean value based on these repeated measurements. Then the 95% confidence interval can be calculated using 0

K X

B B j¼1 B B K @

K X

Xj TEM  1:96  pffiffiffiffi , K

j¼1

K

1 Xj

C TEMC + 1:96  pffiffiffiffi C , KC A

(6)

where Xj is the jth measurement of the dimension and K the number of repeated measurements (this K does not have to be the same as the K used when calculating the TEM according to Eq. (1)). The more times you measure the dimension, the more precise will the estimate of the mean value be. But all of this is only valid if your measurement is not systematically biased (i.e., the mean value of measurement errors does not differ from zero). Imagine again the situation discussed in the introduction of this chapter. We have measured twice some pelvic dimension, calculated TEM related to the selected dimension, and we would like to estimate sex based on this measurement.

Assessment of the inter-rater reliability for observational data

As described in the preceding text, the best choice is the use of the mean value based on repeated measurements of the dimension and enters this mean value into a chosen logistic regression equation. But we can do even more. We can calculate the 95% confidence interval according to Eq. (6). There is a high probability that the true value for the analyzed specimen lies within this interval. Then, we can calculate with the logistic regression equation again substituting the lower and upper limits of the 95% confidence interval into it. Using this procedure, we will obtain two additional P-values (e.g., 0.88 and 0.98) expressing the probability that the pelvic bone comes from a female. We know that the true P-value lies most likely between these two limits. This enables us to get an idea of the impact of our measurement error on the reliability of the performed sex estimation. Nevertheless, this procedure will only work if the error of measurement is random, but not if the measurement is systematically biased (i.e., if we measure on average 3 mm less than the author of the logistic regression equation).

Assessment of the inter-rater reliability for observational data The situation is substantially different when dealing with observational data, that is, data that you obtain without the use of any standardized tool such as craniometer or caliper. These data are mostly based on subjective categorization of a trait performed by a rater. The rater usually bases his decision only on verbal or pictorial description of predefined categories. It should be noted, however, that some variables are at the boundary between observational and measurement data. For example, when you are assessing color using a standardized color swatch, it is questionable whether the data you obtain are still observational. It is also hard to decide whether such variable is in nominal, ordinal, or interval scale. In general, observational data are in nominal or ordinal scale. The procedures described in the previous section are inappropriate for nominal and ordinal data. Therefore a different strategy must be used for the assessment of interrater and intrarater reliability in observational data. Imagine the situation when two raters have the task to decide, if a pelvis is male or female (based on the descriptions of features typical for male and female pelvic bones). We let both raters assess the same sample of pelvic bones. The easiest way to assess the rate of agreement between their ratings is to simply compute the number of pelvic bones for which both raters agreed on their classification. Nevertheless, using the percent of agreement between raters was repeatedly demonstrated (e.g., Cohen, 1960; Hallgren, 2012; Kundel and Polansky, 2003) to be an inadequate measure of interrater and intrarater reliability. This is due to the fact that even if both raters classified samples randomly (e.g., they would flip a coin), there would be some agreement between them caused only by chance (called expected agreement rate). What is worse is that this expected agreement rate depends on sample distribution. Imagine that the two raters from previous example assess pelvic bones from two mass graves. In the first grave, civilians were buried with men and women approximately equally represented. In the second grave,

51

52

CHAPTER 1.4 Initial assessment: Measurement errors

Table 1 Examples of rating the sample (n ¼ 100) by two raters. po=0.8

pe=0.5

Observed Rater B M F Total

Rater M 40 10 50

A F 10 40 50

Total 50 50 100

Observed Rater B M F Total

Rater M 75 10 85

A F 5 10 15

Total 80 20 100

Expected Rater B M F Total

Rater M 25 25 50

A F 25 25 50

Total 50 50 100

Expected Rater B M F Total

Rater M 68 17 85

A F 12 3 15

Total 80 20 100

K=0.6

Second example

k= 0.483

First example

po=0.85 pe=0.71

Note: M, male morphology; F, female morphology; po, observed agreement between raters; pe, expected agreement between raters; κ, Cohen’s Kappa.

however, soldiers were buried. Thus, in this grave, there were only few women, since most of the soldiers were men. As can be seen in Table 1, the proportion of observed agreement between the two raters (po) is 0.8 (80%) in the first example (civilian grave) and 0.85 (85%) in the second example (soldiers’ grave). If we did not have the background information about the graves, we could conclude that the agreement between raters in the second example is higher and therefore the raters performed better in the second example. In the first example, however, the expected agreement by chance (pe) is 0.5 (50%), that is, 0.3 lower than po. On the other hand the pe in the second example is 0.71 (71%), that is, 0.14 lower than po. High pe in the second example is given by the fact that the sample consists mostly of male soldiers. Moreover, considering their occupation, they were probably of more robust constitution than men randomly selected from normal population. Regardless of causes, this result means that the improvement in interrater agreement compared with the expected agreement rate is higher in the first example. For this reason, Cohen’s (Cohen, 1960) Kappa (κ) or other variant of Kappa (see Hallgren, 2012 for review) is recommended for data in nominal scale instead of po. Cohen’s Kappa may be calculated as κ¼

po  pe , 1  pe

(7)

when substituting po and pe with decimal numbers (not with percentage). However, even Cohen’s Kappa has its weaknesses as was repeatedly demonstrated (e.g., Byrt et al., 1993; Di Eugenio and Glass, 2004). The first problem is called the prevalence effect. The prevalence effect causes significant reduction in the value of Cohen’s Kappa when cases are classified by the raters into one category with much higher frequency than into another. In Table 2, two examples with the same values of po can be seen. In the third example the agreements are distributed

Assessment of the inter-rater reliability for observational data

Table 2 Examples of rating the sample (n ¼ 100) by two raters illustrating the prevalence problem of Cohen’s Kappa. Third example

po=0.7

pe=0.5

Observed Rater B M F Total

Rater M 35 15 50

A F 15 35 50

Expected Rater B M F Total

Rater M 25 25 50

A F 25 25 50

Fourth example

po=0.7

p e= 0.745

k= -0.176

Total 50 50 100

Observed Rater B M F total

Rater M 70 15 85

A F 15 0 15

Total 85 15 100

Total 50 50 100

Expected Rater B M F Total

Rater M 72.25 12.75 85

A F 12.75 2.25 15

Total 85 15 100

K=0.4

Note: M, male morphology; F, female morphology; po, observed agreement between raters; pe, expected agreement between raters; κ, Cohen’s Kappa.

symmetrically leading to a fair Cohen’s Kappa (κ ¼ 0.4). In the fourth example, however, all agreements occurred only in male category. This resulted in a low value of Cohen’s Kappa (κ ¼  0.176), even though the agreement rate was the same as in the third example. The second problem is called the bias effect, which causes significant increase in the value of Cohen’s Kappa when the frequency of cases in categories substantially differs between raters. That means, for example, that one rater tends to consider more pelvic bones as female that the other rater. In Table 3, two examples

Table 3 Examples of rating the sample (n ¼ 100) by two raters illustrating the bias problem of Cohen’s Kappa. Fifth example

po=0.3

pe=0.5

Observed Rater B M F Total

Rater M 15 35 50

A F 35 15 50

Expected Rater B M F Total

Rater M 25 25 50

A F 25 25 50

Sixth example

po=0.3

p e= 0.255

Total 50 50 100

Observed Rater B M F total

Rater M 15 0 15

A F 70 15 85

Total 85 15 100

Total 50 50 100

Expected Rater B M F Total

Rater M 12.75 2.25 15

A F 72.25 12.75 85

Total 85 15 100

K=-0.4

k=0.06

Note: M, male morphology; F, female morphology; po, observed agreement between raters; pe, expected agreement between raters, κ, Cohen’s Kappa.

53

54

CHAPTER 1.4 Initial assessment: Measurement errors

with the same values of po can be seen. In the fifth example the agreements are distributed symmetrically leading to a low value of Cohen’s Kappa (κ ¼  0.4). In the sixth example, however, all disagreements occur only in one cell: all the pelvic bones considered as female by the rater A are considered as male by the rater B, but the agreement rate remains unchanged. This modification resulted in a higher value of Cohen’s Kappa (κ ¼ 0.06) than in the fifth example. In both of these cases (i.e., when there is prevalence or bias effect), the use of Cohen’s Kappa should be avoided, and there are some alternative variants of Cohen’s Kappa suitable for these cases (for review, see Hallgren, 2012). Cohen’s Kappa can be used even for data in ordinal scale. In such case the use of weighed Cohen’s Kappa (Cohen, 1968) is recommended. Imagine the situation when two raters have a task to assess if some trait on the pelvis (e.g., preauricular surface) has female, male, or intermediate form. Such variable is in ordinal scale with the preauricular surface classified as the intermediate form being more masculine that the female form, but less masculine than the male form. In Table 4, you can see an example of possible weight allocation. When calculating weighed Cohen’s Kappa, first you need to calculate the observed and expected proportions, then multiply each proportion by appropriate weight (given in the same cell as the multiplied proportion). In the seventh example, we used weight 1 in the case of agreement between raters and 0.5 in the case when one rater considered the trait to be male (or female) and the other rater selected the intermediate category. Using the weight 0.5, we consider the situation as a half agreement between the raters. In this way the disagreement between adjacent categories may be given a higher weight than the disagreement between more distant categories. Use of weights means that if differences in rating are due only to one category shift, the value of Cohen’s Kappa will be higher compared with the situation when there are differences of more than one category. Finally, it should be said that with the use of Cohen’s Kappa we can evaluate interrater and intrarater reliability for variables in nominal or ordinal scale, but not the probability that one sample is ranked correctly. We cannot even establish the confidence interval for sample’s rating like in variables in interval or ratio scale. On the other hand, such a confidence interval would probably be of no use, because the probability of systematic bias during classification of observational data is very high due to significant subjectivity of such a classification. There are also other ways of establishing the inter- (and intra-)rater agreement or reliability, including intraclass correlation coefficient (ICC), Kendall’s coefficient of concordance (W), or Bland-Altman plots. There are a number of publications, which can be consulted for more detail on these approaches topic, including Hallgren (2012), Gisev et al. (2013), and Kottner et al. (2011).

Table 4 Examples of rating the sample (n ¼ 100) by two raters illustrating the bias problem of Cohen’s Kappa. Seventh Example

po=0.82

Weights

Rater

A

Rater B M I F Total

M 1 0.5 0 40

I 0.5 1 0.5 33

rater

A

M 0.3 0.1 0 0.4

I 0.04 0.2 0.08 0.32

Rater

A

M 0.3 0.05 0 0.35

I 0.02 0.2 0.04 0.26

Observed proportions Rater B M I F Total Observed weighted proportions Rater B M I F

pe = 0.5712

K=0.58

F 0 0.5 1 27

F 0.02 0.1 0.16 0.28

F 0.00 0.05 0.16 0.21

Total 37 40 23 100

total 0.36 0.40 0.24 1

0.32 0.3 0.20 0.82

Oobserved frequencies rater B M I F Total Expected proportions Rater B M I F Expected weighted proportions Rater B M I F

Rater

A

M 30 10 0 40

I 4 20 8 32

Rater

A

M 0.144 0.16 0.096 0.4

I 0.1152 0.128 0.0768 0.32

Rater

A

M 0.144 0.08 0 0.224

I 0.0576 0.128 0.0384 0.224

F 2 10 16 28

Total 36 40 24 100

F 0.1008 0.112 0.0672 0.28

0.36 0.40 0.24 1

F 0 0.056 0.0672 0.1232

0.2016 0.264 0.1056 0.5712

Note: M, male morphology; I, intermediate morphology; F, female morphology; po, observed agreement between raters; pe, expected agreement between raters; κ, Cohen’s Kappa.

56

CHAPTER 1.4 Initial assessment: Measurement errors

References Byrt, T., Bishop, J., Carlin, J.B., 1993. Bias, prevalence and kappa. J. Clin. Epidemiol. 46, 423–429. Cohen, J., 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37–46. Cohen, J., 1968. Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70, 213–220. Di Eugenio, B., Glass, M., 2004. The kappa statistic: a second look. Comput. Linguist. 30, 95–101. Gisev, N., Bell, J.S., Chen, T.F., 2013. Interrater agreement and interrater reliability: key concepts and applications. Res. Social Adm. Pharm. 9, 330–338. Hallgren, K.A., 2012. Computing inter-rater reliability for observational data: an overview and tutorial. Tutor. Quant. Methods Psychol. 8, 23–34. Kottner, J., Audige, L., Brorson, S., Donner, A., Gajewski, B.J., Hro´bjartsson, A., Roberts, C., Shoukri, M., Streiner, D.L., 2011. Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. J. Clin. Epidemiol. 64, 96–106. Kundel, H.L., Polansky, M., 2003. Measurement of observer agreement. Radiology 228, 303–308. Mueller, W.H., Martorell, R., 1988. Reliability and accuracy of measurement. In: Lohman, T.G., Roche, A.F., Martorell, R. (Eds.), Anthr Stand Ref Man. Human Kinetics Books, Champaign, pp. 83–86. Stevens, S.S., 1946. On the theory of scales of measurement. Science 103, 677–680. Ulijaszek, S.J., Kerr, D.A., 1999. Anthropometric measurement error and the assessment of nutritional status. Br. J. Nutr. 82, 165–177. Ulijaszek, S.J., Lourie, J.A., 1994. Intra- and inter-observer error in anthropometric measurement. In: Ulijaszek, S.J., Mascie-Taylor, C.G.N. (Eds.), Anthropometry: The Individual and the Population. Cambridge University Press, pp. 30–55. Velleman, P.F., Wilkinson, L., 1993. Nominal, ordinal, interval, and ratio typologies are misleading. Am. Stat. 47, 65–72.

CHAPTER

General considerations about data and selection of statistical approaches

2.1 Pascal Adalian

UMR 7268 ADES - Aix-Marseille University, CNRS, EFS, Marseille, France

“The numbers are where the scientific discussion should start, not end” (S.N. Goodman)

Introduction The general public largely accepts the idea that forensic science is synonymous with “scientificity.” In a way, that is accurate, but to put it another way, forensic science is about “deciding to tell the truth.” So even though forensic specialists and, among them, forensic anthropologists strive to continuously improve their skills and aim to produce research that meets scientific standards, thus ensuring quality and consistency in their professional practice, comprehensive best practice guidelines for forensic anthropology are mostly lacking on national and international level. By “comprehensive,” we mean guidelines starting from the earliest stages of setting up research (sampling, study design, methodological validation of each step of the protocol, considerations of all kinds of “errors,” and statistical uncertainties) and continuing up the chain to applied expertise (mainly related to context and application constraints, including quality control). This makes the complexity of defining each of these steps and specifying the best methods for forensic application apparent for anybody who deals with forensic anthropology cases in their diversity. In fact, there is a bigger problem with what forensic anthropologists do. Forensic specialists do not tell the truth. They don’t have to, and they just can’t. They have to provide a scientific reading and understanding of the biological facts, by respecting scientific criteria in the analysis and presentation of their findings so that the judiciary in charge of deciding what “the truth” is can do so with full knowledge and understanding of these findings (see Fournet, 2017). As clearly stated by Christensen and colleagues, “the role of science within the judicial system is nothing novel; however, the focus has shifted to include the Statistics and Probability in Forensic Anthropology. https://doi.org/10.1016/B978-0-12-815764-0.00002-2 # 2020 Elsevier Inc. All rights reserved.

59

60

CHAPTER 2.1 General considerations about data

evaluation of methods and techniques rather than simply the expert’s interpretation of the results” (Christensen et al., 2014). In 2009, funded by the National Institute of Justice, a committee from the United States’ National Academy of Sciences published the National Research Council Report, Strengthening Forensic Science in the United States: A Path Forward. This report highlights that “research is needed to establish the validity of forensic methods, develop and establish quantifiable measures of the reliability and accuracy of forensic analyses, and develop quantifiable measures of uncertainty in the conclusions of forensic analyses” (Committee on Identifying the Needs of the Forensic Sciences Community—National Research Council, 2009). So following these cited recommendations, in this chapter, we intend to remind the reader of some important concepts that need to be considered when setting up, selecting, applying, and consequently interpreting the results of a method in forensic anthropology.

Considerations about data All basic statistic reference will certainly attest that a study sample on which data are collected must be representative of the population of interest and obtained through random selection. It should moreover be of adequate size to fulfill its purpose, for instance, allowing some inference about the general population. But it is sometimes easier said than done, and this has long been considered particularly true in forensic anthropology. In a general research context, if the project involves easily accessible data sets, a power analysis can determine the necessary number of individuals to include in a sample. However, in many cases in forensic anthropology, researchers do with what they have. In such instances, common sense has long been the most important ingredient of a correct sampling procedure, and extreme prudence is required for inference or decision-making (Madrigal, 2012). Fortunately, due to the increasing number and sharing of data from identified skeletal collections (Aleman et al., 2012; Cardoso, 2006; Ferreira et al., 2014; Go et al., 2017; Henderson and Cardoso, 2018) as well as better accessibility of medical imaging and reconstruction techniques for data acquisition (Stull et al., 2014), we more and more have the opportunity to use reliable and accurate data in forensic anthropology research.

Qualitative variables (categorical, nominal, or ordinal variables) Qualitative variables are observations of interest that classify subjects according to the type or quality of their attributes. They are not quantifiable, so no numerical significance should be attached to them, even if each category of a given variable can be coded (recorded) as a number. Of course, in this case, the order of the numbers has no significance, and it is very important to remember that these numbers do not carry any value information and that therefore no calculations such as means can be made. When defining the different categories of a qualitative variable, one should take care that these categories are mutually exclusive (each observation is placed in one and only one category) and that all observations can be categorized, that is, that the

Considerations about data

coding system is exhaustive (Madrigal, 2012). When these categories are anatomical traits, for instance, and that trait can be “absent,” it is important to differentiate this absence from the impossibility to observe the trait (if the concerned bone/anatomical region is missing, for instance). When different categories can be ordered from a lower to a higher rank, we talk about ordinal variables. However, the distance or interval between observations is not fixed or set. For example, in nonmetric sex assessment from the skull, the trait scoring goes from minimal expression (score ¼ 1) to maximal expression (score ¼ 5), without any fixed step or increase value between two stages (Langley et al., 2018a).

Quantitative variables Quantitative variables are measures or counts that define an observation of interest, so they are expressed numerically, regardless of the unit of measurement. When these numeric values are discrete (i.e., with no intermediate values between them) and when the distance between any two values is fixed (as opposed to ranked variables), we talk about discontinuous numeric variables. On the contrary, when measurements allow a theoretically infinite number of values between two points, we talk about continuous numeric variables. Data collection, analysis, and interpretation play a key role in the scientific process in forensic anthropology; it is important to remind ourselves that errors may arise at each step of the scientific process. Regarding qualitative variables, the main difficulty may be achieving consistency between observations made by two different observers (reproducibility) or made at different times by the same observer (repeatability). Indeed, since the categories in which the variables are distributed are “unmeasurable qualities,” this type of observation can be affected by some form of subjectivity, and the allocation of a category may depend on the observer’s experience or his ability to distinguish between two close categories. With quantitative variables, it is important to remember that the measurement of continuous variables is by defining an approximation of the true value, which may be unknowable (White and Folkens, 1991). This is why it is important to quantify the error of osteometric data in forensic anthropology (Langley et al., 2018b). There is an unavoidable technological error in each measurement, which can be defined as the difference between the value measured by a given instrument and the actual (true) value. This measurement “uncertainty” remains despite proper calibration of the instrument, so any measurement yields only an estimate of the true value and any measurement has an associated uncertainty. The observer’s experience also plays a role and can influence the value of that uncertainty. However, measurement uncertainty does not imply an error has occurred or a mistake had been made. Nevertheless, the uncertainty should be correctly described and included in the results of the analysis. This is an extremely important point to clarify in our reports or in court (Christensen and Crowder, 2009; Christensen et al., 2014). The uncertainty is a range and represents the dispersion we expect around our measurement(s) (Bell, 2017).

61

62

CHAPTER 2.1 General considerations about data

Accuracy, precision, trueness, and reliability The measurement uncertainty is important since it is closely related to accuracy, precision, trueness, and reliability, for which different authors favor different terms to refer to the same concept. Full definitions are given in international standards and guidelines of metrology (the science of measuring; Joint Committee for Guides in Metrology of the Bureau International des Poids et Mesures, 2012; International Organization for standardization—ISO, 1994). Nevertheless, the use of these notions in scientific processes or methodological approaches can be confusing. In addition, some languages do not offer distinct words for each of these different notions, which increase their misunderstanding (Menditto et al., 2007). According to ISO (1994) and the Joint Committee for Guides in Metrology of the Bureau International des Poids et Mesures (2012), accuracy is a qualitative performance characteristic, expressing the closeness of agreement between a measurement result and the true value of the variable of interest; precision is the closeness of agreement between independent test results obtained under stipulated conditions and can be quantified as the standard deviation of repeated measurements on the same sample using the same method; and trueness is the closeness of agreement between the average value obtained from a large series of test results and an accepted reference value. In fact, this means that measurement accuracy is the combination of trueness and precision (Menditto et al., 2007). We can also define measurement reliability as associated with the variation in repeated measures (Stull et al., 2014). Notably, some authors propose to use the term variability rather than precision to refer to the dispersion of the measurements to avoid any confusion between precision and accuracy (Bell, 2017). Let’s now use the same concepts to assess the results of a methodological process (e.g., age at death estimation) and not by a single measurement. A quantitative estimate of the accuracy of a result is essential to define the degree of confidence that can be placed in the result and the reliability of the decisions based on such result. We could consider the reliability of a result as Boolean: the decision that is made is correct or not. Following this the result’s reliability estimation is the percentage of correct answers, and we can set a threshold, let’s say a minimum of 95%, to assess whether the methodological process can be considered reliable. Reliability arises here from the entire methodological process and not just from the instrument used to obtain a number (Bell, 2017). Considering things this way clearly shows that the result’s reliability differs from its accuracy. A classical analogy to simply illustrate these notions is the image of shots on a target. In Fig. 1, A shows low precision and low trueness; B shows low precision but high trueness; C shows high precision but low trueness; and D shows high precision and high trueness, which sums up to good accuracy. These definitions pertain to both single measurements/features and final results of a method. If applied to results only, the reliability threshold set to be within the third circle on Fig. 1 (dashed line) equals to only 1 out of 20 shots being below the reliability threshold of 95%.

Method selection and evaluation

FIG. 1 (A) Low precision and low trueness; (B) low precision and high trueness, (C) high precision and low trueness; and (D) high precision and high trueness. So (D) shows good accuracy, and if we set the reliability threshold within the third circle from the center (dashed line), we can consider D as reliable since only 1 out of 20 shots is outside the threshold (reliability ¼ 95%).

In addition to these notions, we propose the notion of validity, which qualifies whether the entire scientific process has met both accuracy and reliability criteria and whether all the steps of “good science” have been considered.

Method selection and evaluation Study designs are important to consider, to appreciate their strengths or weaknesses. We assume that each expert does his best to respect all the scientific checkpoints and methodological rules, but different constraints, such as data availability, sampling method, handling of missing values, measurement accuracy and reliability, ethical considerations, the variety of statistical approaches (frequentist or Bayesian), or result presentation formats make for a wide diversity of methods used in publications on forensic anthropology. A downside to this diversity is that many methodological,

63

64

CHAPTER 2.1 General considerations about data

sampling, and statistical discrepancies between publications may lead to difficulties in method definition, application, and comparison of results among different studies. For instance, recent works attempting to evaluate methods currently available for subadult age estimation in forensic anthropology (Corron et al., 2018, 2019) presented analyses based on 10 descriptors covering five sampling and five statistical parameters that can be considered valid or invalid according to published methodological recommendations in forensic anthropology (Cunha et al., 2009; Konigsberg et al., 2008; Ritz-Timme et al., 2000; Rosing et al., 2007; Schmeling et al., 2007). These 10 parameters have been endorsed by the forensic anthropology community as being crucial for method selection. Therefore knowing whether methods fulfill these valid parameters can help users evaluate the methods they are applying. These descriptors present no hierarchy, meaning one valid descriptor is not “more important” nor does it carry more “statistically significant weight” than any other valid descriptor. The only hierarchy in this system is the binarity between validity and invalidity, which is the same for all descriptors. These descriptors are summarized in Table 1. Our analysis of 269 publications (from 1960 to the present day; SAMS is updated twice a year) on subadult age estimation in forensic and biological anthropology showed that the validity of all 10 descriptors have not been frequently met (Fig. 2). Of course, there is some subjectivity in the definition of what is “valid” for a descriptor. For instance, we decided for this study that a “valid” sample size Table 1 Summary of the descriptors used to evaluate the fulfillment of published recommendations for subadult age estimation in the publication of Corron et al. (2018). Parameter types Sampling

Statistical

Descriptors

Requirement to be considered as “valid”

Age Sex Uniformity of age distribution Uniformity of sex distribution Sample size Reliability of estimated age Accuracy of estimated age Standard error of estimation Repeatability and reproducibility Validation of the method

Known Known Even age distribution Even sex distribution Over 200 individuals Equal to or higher than 95% Known Known (or specified size of prediction interval) Both tested and higher than 95% Validated by cross validation and/or an independent test sample and/or an independent study

Method selection and evaluation

FIG. 2 Frequencies of valid and invalid descriptors for the five sampling parameters (age, age distribution, sex, sex distribution, and sample size) and the five statistical parameters (observer errors/repeatability and reproducibility, reliability, accuracy, standard error of estimation/see, and validation) in the corpus of publications according to recommendations. SEE, standard error of estimation; Repeat., repeatability of age indicators; Reprod., reproducibility of age indicators.

should include 200 or more individuals. This decision can be considered arbitrary, so this particular descriptor is arguably the most subjective of the five. Sample size is often subject to availability constraints depending on various factors, such as the number of variables obtained for each individual, the age range covered by the sample, the sex distribution in each age group, or whether the method is sexspecific. However, these constraints or sample descriptors relate to the sample’s representativeness of the reference population. In addition, the power of the study depends on the underlying sample size, and both would vary depending on the desired output, so it would need to be calculated on a case-by-case basis (Schmeling et al. 2007). Based on the literature review and published recommendations, a free online decisional tool called subadult aging method selection (SAMS) was developed to help forensic anthropologist select and evaluate subadult age estimation methods based on the ten standardized methodological descriptors presented in the preceding text (available at osteomics.com/SAMS). This tool provides scores for each of the publications included in a centralized database to inform the user about (a) how much the method matches their selection criteria (the relevance score) and (b) how much the method complies with published recommendations (the validity score). To mimic the proceedings in a real case, the user can first select “anatomical region,” “bone,” and “indicator type” to filter the database (Fig. 3). For a complete overview of the query parameters, see Corron et al. (2019).

65

66

CHAPTER 2.1 General considerations about data

FIG. 3 Example of a SAMS query. This query uses several features of the three search parameters to filter the database and part of the descriptive parameters that will appear in the results.

FIG. 4 Results of the query obtained by filtering the database using the SAMS algorithm. The features of the three search parameters are presented at the top. Users can evaluate the methods using relevance (R), validity (V), and score.

Fig. 4 shows the SAMS result interface. Based on an algorithm that calculates “relevance” and “validity” values for each publication in the database, SAMS generates a composite “score,” which orders the publications starting from those that comply most with the user’s demands (relevance) and methodological recommendations (validity).

Diving deeper into considerations on method selection and interpretation in forensic anthropology

Diving deeper into considerations on method selection and interpretation in forensic anthropology Tools like SAMS not only help forensic anthropologists to navigate the large pool of published methods and to familiarize themselves with some quite old or brand new ones but also may provide a type of quantitative approach to justify their decision in court as to why they have favored one method over another. But the conclusions drawn from the methods selected by SAMS fully remain the experts’ responsibility. To cite Aitken and Taroni (2004): “the scientist’s role is to testify as to the worth of the evidence, the role of statistics is to provide the scientist with a quantitative measure of this worth.”

Hypothesis testing and interpretation It seems trivial here to remind readers that scientific hypotheses explain observed facts in testable ways but that a hypothesis can never be proven true. It can however be rejected. This rejection is made with the knowledge that we could be committing an error (Type 1 error) and what the probability of making that error is. Indeed, this is how scientific knowledge progresses: researchers accept the most likely explanation of facts while being open about the possibility that such an explanation may be proven false in the future. Frederic Santos (a statistician working at the “UMR PACEA” anthropology department in Bordeaux, France) expresses this notion clearly in his teachings support (http://www.pacea.u-bordeaux.fr/IMG/pdf/poly_cours.pdf; in French): a statistical test must formally choose between two hypotheses—H0 (null hypothesis) and H1—that are the negation of each other and which can be summarized as a yes/no alternative. The test is based on one (or more) sample(s) of data that are assumed to follow a given law (often a normal distribution). The H0 hypothesis plays a particular role: the purpose of the test is to gather enough evidence within the data to demonstrate that it is false. If that’s the case, the H0 is rejected, and its negation H1 is accepted. Otherwise, for “lack of evidence,” H0 is not rejected, and it is stated that the data are not incompatible with this hypothesis… which does not mean it is true! According to Daudin et al. (1999, p. 66): “H0 benefits from the presumption of innocence and will only be convicted if there is enough evidence. Otherwise, the judge (the test) will release the suspect... which is not in itself proof of innocence!”

The use of P-values A P-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the mean difference between two sample groups) would be equal to or more extreme than its observed value (Wasserstein and Lazar, 2016). In 2014 Nature published the article entitled Scientific Method: Statistical Errors (Nuzzo, 2014, p. 151), which says, “A P-value measures whether an observed result can be attributed to chance. But it cannot answer a researcher’s real question: what

67

68

CHAPTER 2.1 General considerations about data

are the odds that a hypothesis is correct? Those odds depend on how strong the result was and, most importantly, on how plausible the hypothesis is in the first place.” This led the American Statistical Association to write a statement paper about P-values (Wasserstein and Lazar, 2016), whose main points are the following: 1. P-values can indicate how incompatible the data are with a specified statistical model. 2. P-values do not measure the probability that the studied hypothesis is true or the probability that the data were produced by random chance alone. 3. Scientific conclusions and business or policy decisions should not be based only on whether a P-value passes a specific threshold. 4. Proper inference requires full reporting and transparency. 5. A P-value, or statistical significance, does not measure the size of an effect or the importance of a result. 6. A sole P-value does not provide a good measure of evidence regarding a model or hypothesis. Wasserstein and colleagues go even further and propose to “move to a world beyond P-values,” and to totally avoid using the term “statistically significant” (Wasserstein et al., 2019). This is also strongly supported by Amrhein et al. (2019). These publications remind us of the importance of scientific reasoning and the need to reset reflection and common sense as the highest values in science and confirm that numbers alone mean little, whether they are statistical results or probabilities. The validity of scientific conclusions depends on more than the statistical methods themselves. Of course, appropriately chosen techniques, properly conducted analyses and correct interpretation of statistical results also play a key role in ensuring that conclusions are sound and that the uncertainty surrounding them is represented properly (Wasserstein and Lazar, 2016). We need to be careful about bringing elements of scientific judgment about the plausibility of a hypothesis and study limitations in our analyses (Nuzzo, 2014). To get rid of the misconceptions concerning P-values, one possibility is to go toward methods that emphasize estimation over testing, such as methods using the Bayesian approach. This kind of approach may more directly address the size of an effect (and its associated uncertainty) or whether the hypothesis is true (Wasserstein and Lazar, 2016). Bayes’ rule describes how to think about probability as the plausibility of an outcome, rather than as the potential frequency of that outcome. This entails a certain subjectivity, but the Bayesian framework makes it comparatively easy for observers to incorporate what they know about the world into their conclusions and to calculate how probabilities change as new evidence arises (Nuzzo, 2014). This approach uses previous knowledge and prior estimates to calculate posterior estimates, which allow us to estimate the probability that a hypothesis is true. An excellent example of how we all engage in Bayesian probability is provided by Steadman et al. (2006), who note that the jury in a criminal case is engaging in Bayesian probability as it listens to witnesses and sees evidence, so

Conclusion

that the jury is continuously updating its estimates of the posterior probability of the events discussed in the court room (Steadman et al., 2006).

Errors and their meaning in statistics and methodological approaches Talking about courtroom cases, as expert witnesses in criminal proceedings, forensic anthropologists must be aware of the discipline’s methodological variation and limitations (Grivas and Komar, 2008). More than just considering data and method selection, they also have to develop strategies to minimize the risk of error through quality assurance (i.e., proper training, method validation, accreditation, and certification). Quality assurance in forensic anthropology can be established through validation studies of analytical methods to determine method reliability (precision and accuracy) and through the development of professional standards in the form of best practice protocols (Christensen and Crowder, 2009). One important limitation of our discipline’s methodological approach is the management and expression of “errors,” which can be confused with uncertainty by nonscientific professionals in charge of legal decisions (Christensen et al., 2014).

Conclusion The NAS report (Committee on Identifying the Needs of the Forensic Sciences Community—National Research Council, 2009, p. 7) notes that “The level of scientific development and evaluation varies substantially among the forensic science disciplines..” and p. 182: “Wide variability exists across forensic science disciplines with regard to techniques, methodologies, reliability, error rates, reporting, underlying research, general acceptability, and the educational background of its practitioners”. The report then calls for research to address the issues of accuracy, reliability, and validity in the forensic science disciplines. In particular, research needs to include the following: (a) Conducting studies establishing the scientific basis for demonstrating the validity of forensic methods. (b) Developing and establishing quantifiable measures of reliability and accuracy of forensic analyses. The corresponding studies should reflect the actual practice as closely as possible using realistic case scenarios and should develop estimates of performance measures averaged across a representative sample of forensic scientists and laboratories. (c) Developing quantifiable measures of uncertainty in the conclusions of forensic analyses. (d) Developing automated techniques capable of enhancing forensic technologies. (e) Conducting studies on human observer bias and the sources of human error and contextual bias in forensic examinations.

69

70

CHAPTER 2.1 General considerations about data

Moreover the NAS report stresses that research in forensic sciences should be peer reviewed and published in respected scientific journals. On this note, many recent publications remind us about keeping common sense in our scientific approach and in the way we interpret the results of our studies and our cases. Addressing uncertainty associated with data collection, selecting a method with a well-argued approach, or thinking about the limitations of numbers and P-values associated with hypothesis testing are undoubtable steps forward to reach quality assurance and the establishment of best practice protocols in our discipline. As stated by Kirk and Kingston (1964, p. 435), scientists and jurists have to “abandon the idea of absolute certainty” to approach the identification process in a fully objective manner. “If it can be accepted that nothing is absolutely certain, then it becomes logical to determine which degree of confidence may be assigned to a particular belief.” This will help us think about how to design fit-for-purpose studies and to state accuracy, reliability, and validity in a manner that is not only statistically significant but also forensically meaningful.

References Aitken, C.G.G., Taroni, F., 2004. Statistics and the Evaluation of Evidence for Forensic Scientists. Wiley. Aleman, I., Irurita, J., Valencia, A.R., Martinez, A., Lopez-Lazaro, S., Viciano, J., Botella, M.C., 2012. Brief communication: The Granada osteological collection of identified infants and young children. Am. J. Phys. Anthropol. 149, 606–610. Amrhein, V., Greenland, S., McShane, B., 2019. Scientists rise up against statistical significance. Nature 567, 305–307. Bell, S., 2017. Measurement Uncertainty in Forensic Science: A Practical Guide. CRC Press, Taylor & Francis Group, Boca Raton. Cardoso, H.F., 2006. Brief communication: The collection of identified human skeletons housed at the Bocage museum (National Museum of Natural History), Lisbon, Portugal. Am. J. Phys. Anthropol. 129, 173–176. Christensen, A.M., Crowder, C.M., 2009. Evidentiary standards for forensic anthropology. J. Forensic Sci. 54, 1211–1216. Christensen, A.M., Crowder, C.M., Ousley, S.D., Houck, M.M., 2014. Error and its meaning in forensic science. J. Forensic Sci. 59, 123–126. Committee on Identifying the Needs of the Forensic Sciences Community—National Research Council, 2009. Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press, Washington, DC. Corron, L., Marchal, F., Condemi, S., Adalian, P., 2018. A critical review of sub-adult age estimation in biological anthropology: Do methods comply with published recommendations? Forensic Science International 288, 328.e1–328.e9. Corron, L., Adalian, P., Condemi, S., Marchal, F., Navega, D., 2019. Sub-adult aging method selection (SAMS): A decisional tool for selecting and evaluating sub-adult age estimation methods based on standardized methodological parameters. Forensic Sci Int. 109897. Cunha, E., Baccino, E., Martrille, L., Ramsthaler, F., Prieto, J., Schuliar, Y., Lynnerup, N., Cattaneo, C., 2009. The problem of aging human remains and living individuals: a review. Forensic Sci. Int. 193, 1–13.

References

Daudin, J.-J., Robin, S., Vuillet, C., 1999. Statistique inferentielle: idees, demarches, exemples. Societe franc¸aise de statistique, Presses universitaires de Rennes, [S.l.], Rennes. Ferreira, M.T., Vicente, R., Navega, D., Goncalves, D., Curate, F., Cunha, E., 2014. A new forensic collection housed at the University of Coimbra, Portugal: The 21st century identified skeletal collection. Forensic Sci. Int. 245 (202), e1–e5. Fournet, C., 2017. Forensic truth? Scientific evidence in international criminal justice (http:// humanityjournal.org/blog/forensic-truth/). Go, M.C., Lee, A.B., Santos, J.A.D., Vesagas, N.M.C., Crozier, R., 2017. A newly assembled human skeletal reference collection of modern and identified Filipinos. Forensic Sci. Int. 271, 128. e1-128 e5. Grivas, C.R., Komar, D.A., 2008. Kumho, Daubert, and the nature of scientific inquiry: Implications for forensic anthropology. J. Forensic Sci. 53, 771–776. Henderson, C.Y., Cardoso, F.A., 2018. Identified Skeletal Collections: The Testing Ground of Anthropology? Oxford, Archaeopress Publishing Ltd, Summertown. International Organization for standardization—ISO, 1994. Accuracy (Trueness and Precision) of Measurement Methods and Results – Part 1: General Principles and Definitions. International Organization for Standardization. Joint Committee for Guides in Metrology of the Bureau International des Poids et Mesures 2012. International vocabulary of metrology—Basic and general concepts and associated terms (VIM). Kirk, P.L., Kingston, C.R., 1964. Evidence evaluation and problems in general criminalistics. J. Forensic Sci. 9, 434–444. Konigsberg, L.W., Herrmann, N.P., Wescott, D.J., Kimmerle, E.H., 2008. Estimation and evidence in forensic anthropology: age-at-death. J. Forensic Sci. 53, 541–557. Langley, N.R., Dudzik, B., Cloutier, A., 2018a. A decision tree for nonmetric sex assessment from the skull. J. Forensic Sci. 63, 31–37. Langley, N.R., Meadows Jantz, L., McNulty, S., Maijanen, H., Ousley, S.D., Jantz, R.L., 2018b. Error quantification of osteometric data in forensic anthropology. Forensic Sci. Int. 287, 183–189. Madrigal, L., 2012. Statistics Anthropology. In: Biological Anthropology and Primatology, second ed. Cambridge University Press, The Edinburgh Building, Cambridge. Menditto, A., Patriarca, M., Magnusson, B., 2007. Understanding the meaning of accuracy, trueness and precision. Accred. Qual. Assur. 12, 45–47. Nuzzo, R., 2014. Scientific method: statistical errors. Nature 506, 150–152. Ritz-Timme, S., Cattaneo, C., Collins, M.J., Waite, E.R., Schutz, H.W., Kaatsch, H.J., Borrman, H.I., 2000. Age estimation: the state of the art in relation to the specific demands of forensic practise. Int. J. Legal Med. 113, 129–136. Rosing, F.W., Graw, M., Marre, B., Ritz-Timme, S., Rothschild, M.A., Rotzscher, K., Schmeling, A., Schroder, I., Geserick, G., 2007. Recommendations for the forensic diagnosis of sex and age from skeletons. Homo 58, 75–89. Schmeling, A., Geserick, G., Reisinger, W., Olze, A., 2007. Age estimation. Forensic Sci. Int. 165, 178–181. Steadman, D.W., Adams, B.J., Konigsberg, L.W., 2006. Statistical basis for positive identification in forensic anthropology. Am. J. Phys. Anthropol. 131, 15–26. Stull, K.E., Tise, M.L., Ali, Z., Fowler, D.R., 2014. Accuracy and reliability of measurements obtained from computed tomography 3D volume rendered images. Forensic Sci. Int. 238, 133–140. Wasserstein, R.L., Lazar, N.A., 2016. The ASA’s statement on p-values: context, process, and purpose. Am. Stat. 70, 129–131.

71

72

CHAPTER 2.1 General considerations about data

Wasserstein, R.L., Schirm, A.L., Lazar, N.A., 2019. Moving to a world beyond “p < 0.05”. Am. Stat. 73, 1–19. White, T.D., Folkens, P.A., 1991. Human osteology. Academic Press, San Diego.

Recommended reading Aitken, C.G.G., Taroni, F., 2004. Statistics and the Evaluation of Evidence for Forensic Scientists. Wiley. Bell, S., 2017. Measurement Uncertainty in Forensic Science: A Practical Guide. CRC Press, Taylor & Francis Group, Boca Raton. Madrigal, L., 2012. Statistics anthropology. In: Biological anthropology and primatology, second ed. Cambridge University Press, The Edinburgh Building, Cambridge. Wasserstein, R.L., Schirm, A.L., Lazar, N.A., 2019. Moving to a world beyond p < 0.05. American Statistician 73, 1–19.

CHAPTER

Probability distributions, hypothesis testing, and analysis

2.2

Zuzana Obertova´a,b and Alistair Stewartc € Forensic Anthropologist, Visual Identification of Persons, Zurich Forensic Science Institute, € Zurich, Switzerland b Centre for Forensic Anthropology, School of Social Sciences, The University of Western Australia, Australia c Retired, School of Population Health, The University of Auckland, Auckland, New Zealand a

Introduction Probability and statistics are overlapping but conceptually distinct notions. Probability can be seen as a special aspect of logical reasoning, and it is useful for understanding and interpretation of forensic (and statistical) evidence. Statistics are based on the collection and analysis of empirical data. Probabilistic reasoning is deductive, since it argues from general assumptions to particular outcomes. Statistical reasoning is inductive, since it argues from empirical occurrences (observed events) and summarizes them into generalizations about the population. Unless we would be able to include the entire population of interest in our study, there will always be some uncertainty in the results. The most useful measure of uncertainty is probability, which measures the uncertainty on a scale from 0 or 0% (the event/feature is impossible and is certain not to occur) to 1 or 100% (the event/feature is certain to occur). But there are also other ways to determine uncertainty, such as confidence intervals. There are some basic rules regarding the use of probability of occurrence of events/ features. If two events (designated as A and B) are mutually exclusive (i.e., cannot cooccur), the probability of event A or B is calculated as P(A or B) ¼ P(A) + P(B). If the events can cooccur, the probability of event A or B is P(A or B) ¼ P(A) + P (B)  P (A and B). For the cooccurrence of (conditionally) independent events, the probabilities of each event can be multiplied (P(A and B) ¼ P(A)  P(B)). Also, especially in forensic sciences, we are often interested in the probability of occurrence of event A, given that B has already occurred (A conditional on B), which is calculated by PðBj AÞPðAÞ using the Bayes formula PðAj BÞ ¼ PðBÞ . Statistics and Probability in Forensic Anthropology. https://doi.org/10.1016/B978-0-12-815764-0.00011-3 # 2020 Elsevier Inc. All rights reserved.

73

74

CHAPTER 2.2 Probability distributions, hypothesis testing

Bayesian versus frequentist analysis Frequentist analysis has been THE approach used by forensic anthropologists and pathologists. The approach is based on the relative frequency definition of probability. The probability that a particular event occurs is defined as the relative frequency of the number of occurrences of the event compared with the total number of occurrences of all possible events within repeated sets of observations conducted under identical conditions. Bayesian analysis and way of thinking have now started to be more and more utilized. And for good reason. Bayesian analysis answers questions based on the distribution of population parameters given the observed data sample. By using the Bayes theorem bayesian modeling describes the probability distribution of one or more parameters of interest, which is calculated from the prior (before observing the data) distribution of the parameter(s) and the likelihood function of the observed data given the parameter. In contrast, frequentist analysis answers questions based on the distribution of data obtained from repeated hypothetical samples, which are assumed to be reflective of the population parameters. This is not to say that frequentist statistics have lost their utility but the researchers now need or get to choose the approach, which is better suitable for their specific research question. However, if the question is what is the probability that the parameter in question belongs to some prespecified interval, the Bayesian framework will be the right choice, because such probability cannot be estimated by using the frequentist statistics. While frequentist analysis is data driven and strongly depends on whether (or not) the statistical assumptions are met, Bayesian analysis provides more robust estimations by using the observed data along with the existing information/knowledge about the model parameters. The differences between frequentist and Bayesian analysis are summarized in Table 1. Frequentist inference is the process of making inductive generalizations from a sample to the population. It is based on sampling distributions of sample estimators of population parameters (e.g., means, standard errors, or confidence intervals). The actual sampling distributions are rarely known but are usually approximated by a large-sample normal distribution. In hypothesis testing an assumption about a specific population parameter is evaluated based on data on the test statistic (the equivalent to the population parameter in question) from a sample (which determines whether the null hypothesis is to be rejected). Since both the null and the alternative hypotheses are related to the population parameter, the conclusions of hypothesis testing are interpreted in relation to the population, although the evidence came from a sample. What we cannot say is that the probability of the true population value lying within the upper and lower limits of the confidence interval is 95%. To obtain an interval that can be interpreted as a probabilistic range for the parameter of interest, the Bayesian approach needs to be applied. Bayesian inference is based on posterior distribution of population parameters and provides summaries of this distribution including posterior estimates (such as means), their Markov chain Monte Carlo (MCMC) standard errors, and credible

Bayesian versus frequentist analysis

Table 1 Frequentist versus Bayesian analysis. Model parameters Data sample Parameter value Confidence/ credible interval

Sample size

Hypothesis testing

Inference

Summary statement

Frequentist analysis

Bayesian analysis

Unknown but fixed (plus constant across repeated samples) Random, repeatable

Random Fixed

Approximated by estimators

Parameter distribution provided

Confidence interval (CI): the probability that the CI contains the true population value is either zero or one (but we do not know which); in 95% CI, 95 out of 100 confidence intervals will include the true population parameter Sample size relevant for choice of tests, small sample size may be limitation (assumption of normal distribution may not hold) Uses a prespecified significance level, which determines whether to accept or reject the null hypothesis based on the observed data assuming that the null hypothesis is true; the decision is based on the P-value computed from the observed data; the P-value is NOT the probability of the null hypothesis; the testing answers the question “How likely are the observed data given that the null hypothesis is true?” (The null hypothesis is considered true unless proven otherwise.) Based on the sampling distribution of the data or of the data characteristics; relies on a variety of methods, each specific for a given statistical problem

Credible interval: the probability that the population parameter lies in, for example, 95% credibility interval, is 95%

Does not allow probability statements about unknown population parameters

Flexible in accommodating different sample sizes

The probability of any hypothesis of interest is computed, answering the question “How likely is the null (or any) hypothesis given the observed data?” (Two or more hypotheses are tested and can be accepted or rejected.)

Based on the estimation of posterior distribution of parameters given the observed data and the prior distribution of parameters; relies on a single rule (the Bayes theorem) ¼ universal Allows probability statements about unknown population parameters

intervals. Even though actual posterior distributions are known only in some cases, general posterior distributions can be estimated through MCMC sampling without any large-sample approximation.

75

76

CHAPTER 2.2 Probability distributions, hypothesis testing

Bayesian analysis is a powerful approach for statistical modeling, interpretation of results, and prediction of data. It is more comprehensive and flexible than frequentist analysis, but, on the other hand, the methods for simulating Bayesian models are often more demanding regarding the computational power. Notably, statistical inference largely depends on the appropriateness of the study design and all related procedures, including data collection strategies. Moreover, correct inference can be done only if an appropriate (hypothesis) test is selected. Bayesian inference is now considered to be the most appropriate approach to data interpretation and decision-making in forensic sciences, but the frequentist approach is still commonly used in population studies in forensic anthropology. Although this approach has been used for several decades, there still seem to be gaps in understanding of the statistical background. Therefore the subsequent sections shortly summarize the basic rules of frequentist statistical testing and modeling.

Statistical testing and modeling As described in Chapter 1.1, before we can proceed to statistical testing and modeling, we need to pose the research and the null hypothesis, determine the sample size and type I and type II error rate, and also decide which test is appropriate for our data type, data (probability) distribution, and study design. Following data collection and before any statistical testing and modeling are performed, the exploratory data analysis (EDA) can give us an idea about some of the characteristics of the collected data, which help in selecting the appropriate method for analysis. EDA is based on the visualization of data and thus helps detect the underlying data distribution, potential outliers (including measurement errors), or simple trends. Apart from depicting the data distribution pattern, EDA can also indicate whether data transformation (e.g., square root or logarithmic) are needed to normalize the data distribution. More information on how to use EDA can be found in Chapter 3.1. The selection of the statistical method largely depends on the data distribution (usually normal versus other) and, of course, on the research hypothesis, given that the statistical method needs to be able to test what the researcher wants to test.

Probability distributions If we collect data on some characteristic for a sample of individuals, probability distribution is used to describe the possible values of this characteristic and how likely each value is to occur. Often, we assume that certain characteristics, for example, human stature or weight, conform to the so-called normal distribution. The normal distribution is described by a bell-shaped or Gaussian curve, with the area under the curve equal to 1. The normal distribution can be fully described by two parameters, the mean (the center of the distribution, with all other values

Probability distributions

symmetrically distributed around it) and the spread, for example, standard deviation (SD). The normal distribution is common in statistics since according to the central limit theorem as sample size increases, any probability distribution tends to become normal. There are a number of ways to test for normality, including graphical methods, such as Q-Q plots, or numerical tests, such as Shapiro-Wilk test or tests for skewness/kurtosis. A mathematical property of the normal distribution is that exactly 95% of the occurrences of the distribution lie between mean  1.96  SD and mean + 1.96  SD (or simply said within two standard deviations from the mean), and 99.7% of the occurrences lie within three standard deviations (Fig. 1). For continuous variables, if we have the estimated mean and the SD of the population, we can calculate the probability of any value by using the Z tables (or statistical software) as Z ¼ ?ValueMean . However, we need to be aware that this SD calculation will answer the question “What is the probability that (for example) a randomly selected person has a characteristic (such as stature) greater than the value of interest (?value)?” and not “What is the probability that (for example) a randomly selected person has a characteristic (such as stature) of that particular value of interest?” This is because for continuous variables, there is theoretically an infinite number of possible values, so the probability of a particular value of interest is zero. However, we still can calculate the (infinitesimal) probability of the value of interest by using the probability density function (pdf), and by integrating the pdf over a specific interval, we can also calculate the probability that the value of interest lies within a given interval. In addition, by using the cumulative distribution function, the probability that the value of interest is no larger than a given value can be established. For discrete variables, it is possible to assign a probability to each outcome (or value of interest) by using the probability mass function. The cumulative distribution function for discrete variables gives the probability that the value of interest is less than or equal to a given value.

FIG. 1 Bell-shaped curve representing normal distribution. (Credit: Zuzana Obertova´.)

77

78

CHAPTER 2.2 Probability distributions, hypothesis testing

Apart from the normal distribution, other commonly encountered probability distributions in forensic anthropology are the binomial distribution, the Poisson distribution, the chi-square distribution, and the F-distribution. Binomial distribution describes the distribution of binary data in a finite sample (e.g., the probability of two people having a button osteoma on the frontal bone in a sample of 100 residents of a retirement home). The Poisson distribution describes discrete quantitative data, for which the population size is large and the probability of an individual event occurring in a fixed time interval is small (e.g., adverse events, such as fractures or diseases in a population). A mathematical property of the Poisson distribution is that mean equals variance, so as the mean increases the variance does as well. The chi-square distribution describes nominal variables, and its shape is defined by the number of degrees of freedom (for a 2  2 table, degrees of freedom [df] ¼ 1, while for an r  c table, df ¼ (r  1)  (c  1)). The F-distribution is used in ANOVA and describes the ratio of two (normalized) chi square-distributed random variables.

Parametric and nonparametric tests When choosing a test for our hypothesis, we need to know what type of (outcome) data we have and whether these data are independent. Matched or highly correlated data of two or more data sets collected on the same individual are not independent and therefore have to be tested differently to independent data. (Remember that the aim of matching samples is to reduce variability among the subjects/items to increase the statistical power of the study.) We also need to know whether the data are normally distributed or not. The main difference between parametric and nonparametric tests is that parametric tests require that the data are continuous and normally distributed, while nonparametric tests can be used for a variety of probability distributions and all types of variables (nominal, ordinal or discrete, or for continuous variables, which are not normally distributed). Parametric tests utilize the mean and variance for comparisons, while nonparametric tests use the median (which they find by assigning ranks to data) and then compare these ranks (without accounting for the actual size difference between sample values). Since nonparametric tests are more robust (are based on weaker assumptions) than parametric tests and are also valid for normally distributed data, the question may arise why use parametric tests at all. The answer is that parametric tests have more statistical power than nonparametric tests, so we are more likely to detect significant differences when they actually exist with parametric tests. The parametric and nonparametric tests are summarized in Table 2 by variable type and the number and type of samples. In forensic anthropology, we often work with nominal variables or ordinal and discrete variables reduced to the nominal scale by creating contingency (2  2) tables. Therefore one of the most commonly used tests is the chi-square test. This

Table 2 Parametric (bold) and nonparametric tests by (outcome) variable type and number and type of samples. Variable type/ number and type of samples Two independent samples

Three and more independent samples Two paired/ matched samples

Quantitative continuous normally distributed

Qualitative nominal

Qualitative ordinal

Chi-square test Fisher’s exact test Chi-square test

Wilcoxon (rank sum)- MannWhitney U test

t-Test

Kruskal-Wallis test

Analysis of variance (ANOVA)

McNemar’s test

Wilcoxon signed-rank test

Paired t-test

Quantitative continuous not normally distributed

Quantitative discrete

Wilcoxon (rank sum)Mann-Whitney U test Kolmogorov-Smirnov test Kruskal-Wallis test

Wilcoxon (rank sum)- MannWhitney U test Kruskal-Wallis test

Wilcoxon signed-rank test

Wilcoxon signed-rank test

80

CHAPTER 2.2 Probability distributions, hypothesis testing

test can answer the question whether there is a relationship between two samples by generating expected frequencies and comparing the observed frequencies to them or it can be the so-called “goodness of fit” test, which determines the fit on the basis how closely the data from the samples match the expected results as a function of the chi-square distribution. In relation to the chi-square test, the Fisher’s exact test should be mentioned, which can be used as an alternative to chi-square test for small sample size. The test is called exact because it returns an exact probability value based on the notion that there are only a limited number of possible data configurations more extreme (i.e., against the null hypothesis) than the observed 2  2 table, and by enumerating all the combinations, the probability associated with each of them will result in the exact P-value. The Fisher’s exact test is used for samples of small size and with expected values in any cells below 5 or 10. For continuous variables the most commonly used test is the Student’s t-test, assuming the data are normally distributed, not matched, and have equal variances. The last assumption is particularly important when the two samples differ in size. If the variances differ, an alternative test can be used, such as Welch’s t-test. Most of the tests can be calculated as either two- or one-sided versions. However, for instance the McNemar test (for two matched samples of nominal data) only includes two-sided hypothesis testing. Although one-sided tests are less common in general, they have their place in forensic anthropology when testing research hypotheses, such as “the maximum femur length is greater in males than in females” (the null hypothesis would be “the maximum femur length is not greater in males than in females”). This is a valid hypothesis, since we have strong prior knowledge that males are usually taller/have longer legs than females, and therefore the direction of the effect can be established within the hypothesis. With the one-sided hypothesis test, we gain more power to detect an effect but sacrifice the ability to detect a difference in the opposite direction at the cost of this extra power. The two-sided version of the research hypothesis can be phrased as “the maximum femur length is different between males and females” (with a “no difference” null hypothesis).

Measures of strength of association Notably the results of hypothesis testing do not convey anything about the strength of the association between variables (outcomes and predictors). The magnitude of the association can be assessed by the so-called measures of strength of association, which include correlation, odds, or risk ratio. The measures of strength of association depend on the type of variable and the study design used. For example, for a case-control study, the odds ratio for the groups with the presence and the absence of the outcome can be calculated. Correlation coefficients (r) are calculated to assess how two quantitative variables change in relation to each other. If one variable increases as the other increases, the correlation

Statistical models

coefficient is positive (+1 indicates a perfect positive correlation). If there is no relationship between the two variables, the correlation coefficient equals zero. If one variable decreases as the other increases, the correlation coefficient is negative (1 indicates a perfect negative correlation). If the two variables are continuous, the Pearson (product-moment) correlation coefficient can be calculated to assess their linear relationship. If they are binary (coded as 0 and 1), the correlation coefficient is equal to the so-called phi coefficient. For ranked variables the Spearman’s or Kendall rank correlation coefficients can be used, which assess also nonlinear relationships. We can test whether the correlation coefficient is statistically significantly different from zero, which is the value of “no correlation.” However, it is important to know that the larger the sample size, even small correlation coefficient values become significant. In addition, we need to be aware of the fact that correlation is not equal causation, since correlation between two variables may occur because they are both associated with a third unknown variable. While correlation coefficients reflect the strength and direction of the relationship between two variables, it does not quantify this relationship, that is, how much does one variable change on average with a unit change in the other. The quantification can be done by linear regression.

Statistical models Statistical models are mathematical descriptions of an assumed probability distribution. For example, a regression model is a mathematical formula that describes the probability distribution of an outcome variable in terms of a set of predictor variables. If the model fits the data reasonably well, it can be used to obtain values, probabilities, and uncertainties associated with the outcome for a given set of predictors (i.e., predict the outcome) and also to elicit the effects of each of the predictors on the outcome. Formulae are used to estimate a population parameter (e.g., the mean) from a characteristic of the sample data, which is called model fitting. Notably, assuming the normal distribution, the population mean can be estimated from both the sample mean and the sample median (mean ¼ median in normal distribution). However, considering that we want to achieve the best possible estimate (i.e., the most efficient) by adhering to the maximum likelihood principle, the mean has smaller standard error than the median and would therefore be the so-called maximum likelihood estimator for the population mean (in normal distribution). Maximum likelihood is defined as the value that maximizes the likelihood (probability) of a particular data configuration. It is the basis of many regression models, such as logistic or Poisson. In ordinary linear regression the least squares estimators of regression parameters are equivalent to maximum likelihood estimators (assuming the normal distribution of residuals). There are several types of regression models, of which we shortly mention the simple linear (or ordinary least squares) regression, logistic regression, and Poisson

81

82

CHAPTER 2.2 Probability distributions, hypothesis testing

regression. A number of regression models can be put into the generalized linear model (GLM) framework. The generalized linear model (GLM) is a generalization of simple linear regression that allows for outcome variables that have probability distribution models other than the normal distribution. The GLM allows for the linear model to be related to the outcome variable through the link function, a function of the outcome variable that varies linearly with the predicted values (rather than assuming that the outcome variable as such varies linearly). The link function describes the relationship between the linear predictor and the mean of the outcome of the given distribution function. The GLM also allows for the magnitude of the variance of each sample value to be a function of its predicted value. The GLM consists of three elements: the linear predictor (information in predictors is reduced to a linear score), the family of models of probability distributions of the outcome variable given the predictor(s), and the link function. For example, as well as the linear predictor, the family of simple linear regression is normal, the link function is identity, the family of logistic regression is binomial, and the link function is logit, and the family of Poisson regression is Poisson, and the link function is log. Simple linear regression predicts or models the relationship between the expected value of a given continuous outcome variable as a linear combination of the observed predictors. When the outcome variable is normally distributed, a constant change in a predictor leads to a constant change in the outcome variable (but not necessarily vice versa as would be the case in correlation). The model fits a straight line through the data, which is expressed as y ¼ α + βx + ε,

where y is the outcome variable; x is the predictor; α is the intercept, which equals the predicted value of y when x ¼ 0; β is the slope of the line (i.e., the increase in y for a unit increase in x); and ε is the residual, which is the difference between the outcome value predicted by the model and the observed value or the distance of each data point from the fitted regression line. Logistic regression is used when the outcome variable is binary (coded as 0 and 1). When there is a binary outcome and one binary predictor, logistic regression gives the same result as a chi-square test. Logistic formula with binary predictors leads easily to the probability of outcome for different combinations of predictors (coded as 0 and 1 as well). The probability of the outcome equals eQ/(1 + eQ), where Q is the regression equation (Q ¼ α + β1x1 + β2x2 + … + βnxn) and e is the exponential constant (¼ 2.7182818….). For example if we want to know the probability of calcaneal spurs with the predictors sex (male coded as 1 and female coded as 0) and age (grouped into individuals aged 20–50 years [coded 0] and 50 years and older [coded 1]), the probability of calcaneal spurs in 50+ years old males is eα + β1 +β2/(1 + eα + β1 +β1), in 50+ years old females eα + β2/(1 + eα + β2), in 20–50 years old males eα + β1/(1 + eα + β1), and in 20–50 years old females eα /(1 + eα). The Poisson regression is used when the outcome variable is a count. While the Poisson model assumes that the mean equals variance, the negative binomial model does not have this restriction, and that’s why it is often used instead to model count data.

Statistical models

Generalized linear mixed models (GLMMs) are an extension to GLMs that includes random effects in the linear predictor. The resulting “subject-specific” parameter estimates are useful when we want to estimate the effect of changing one or more components of the predictors for a given individual. Multiple regression analysis (with one outcome variable and two or more predictors assessed simultaneously) is useful when we need to adjust for confounding. The predictors are combined in additive manner. In multiple regression analysis, one of the assumptions includes no multicollinearity. Multicollinearity means that one predictor can be linearly predicted from another predictor, that is, these two predictors are nearly perfectly correlated (so the two basically contain the same information about the outcome variable). This may result in coefficient estimates changing erratically in response to minimal changes in the data, thus resulting in big changes to the model. The standard errors of the coefficient estimates of the collinear variables tend to be large, which may lead to a failure to reject a false null hypothesis of no effect of the predictors (type II error). Multicollinearity may be indicated by large changes in the estimated regression coefficients when a predictor variable is added or deleted or by detecting that a coefficient of a certain predictor is significant in univariate regression but the coefficient of the same predictor is insignificant in a multivariable regression. There are some postestimation tests, such as the variance inflation factor (VIF) or condition number test, which may help to formally help detect multicollinearity. The latter test will even show which variables are the collinear ones. There are different solutions how to deal with multicollinearity, depending what we want to achieve with the models—one of the multicollinear variables can be dropped from the model, the model can be left as it is (but the multicollinear variables are mentioned for future applications), or another type of regression analysis (such as partial least squares) may be used. Normally, multicollinearity is not complete, so it does not reduce the reliability of the model (at least within the sample data set) although it does affect calculations regarding individual predictors. Since in forensic anthropology existing regression equations are often applied to data other than the sample data set, using imprecise coefficient estimates would result in imprecise predictions in such cases. Another type of modeling is the so-called survival analysis, which can be used for time-related and age-related outcomes, such as the occurrence of developmental stages or fracture dating. The survival function in survival analysis is the probability that a subject survives (or event lasts) longer than a given time. The survival function is usually estimated by the Kaplan-Meier curve. In comparison the hazard function is the event (e.g., death) rate at a given time conditional on survival until a given time or later; in other words, it is the probability that a subject (or event) will not survive (last) longer than a particular time (but has survived for a given time already). In addition to the Kaplan-Meier method, which does not allow for adjusting for confounding variables, survival can be estimated by the fully parametric Weibull or log-normal model (which allows for increasing/decreasing hazards) or the Cox proportional hazards regression model, which allows for adjusting for confounding variables and assumes that hazards increase or decrease from a baseline hazard, which depends on the predictors.

83

84

CHAPTER 2.2 Probability distributions, hypothesis testing

Testing a test/method In forensic anthropology and forensic sciences in general, the question often arises how well the method/test performs that we use for our estimations or what is the probability of the test actually giving correct estimates. For binary variables a simple 2  2 table and special terminology can already bring many answers (Table 3), providing that we have data, which we can use to assess the performance of the method. Let’s say we have an osteological collection of known sex and we were asked how well one of the features of the Phenice method of sex estimation (Phenice, 1969), the ventral arc (a bony ridge on the inferior aspect of the anterior surface of the os pubis) performs in distinguishing males and females. When the sample is representative of the overall population in terms of the proportions between males and females, the results of the 2  2 table should be valid. Sensitivity is the ability of the test to identify actual positives as such (preferably false negatives should be few), and specificity is the ability of the test to identify actual negatives as such (preferably false positives should be few). The ventral arc is considered to be an indicator for an os pubis belonging to a female. So if the ventral arc would be a perfect predictor, it would be 100% sensitive (i.e., it would be only present in female os pubis and thus correctly identify all os pubis as female) and 100% specific (i.e., it would be absent in all female os pubis, and thus when absent, we would conclude that such os pubis is not female). However, in reality, ventral arc may be absent in some females and present in some males, so the method will be neither 100% sensitive nor 100% specific. Positive predictive value is the probability that a person who has a ventral arc is actually female, while negative predictive value is the probability that a person who does not have a ventral arc is actually male. Overall accuracy is the probability that a randomly selected subject is Table 3 Terminology of findings when tabulating actual versus predicted occurrences in a method/test. Occurrence negative (ON)

Prevalence 5 Σ OP/Σ Total

Occurrence positive (OP)

Predicted occurrence positive (POP) Predicted occurrence negative (PON)

True positive (TP)

False positive (FP) Type I error

False negative (FN) Type II error

True negative (TN)

True positive rate (TPR) ¼ sensitivity ¼ power ¼ Σ TP/Σ OP

False positive rate (FPR) ¼ Σ FP/Σ ON

Accuracy 5 Σ TP + Σ TN/Σ Total Positive predictive value (PPV) ¼ precision ¼ Σ TP/Σ POP Negative predictive value (NPV) ¼ Σ TN/Σ PON Positive likelihood ratio ¼ TPR/FPR

Conclusion

correctly identified through the presence/absence of the feature of interest (the ventral arc). Since we only assess the presence or absence of the ventral arc, the results from the cross-tabulation will be the answer regarding the test performance. However, let’s say we decide to measure the length or thickness of the ventral arc to better differentiate between males and females. We would then want to find a cutoff value with preferably both few false negatives (high sensitivity) and few false positives (high specificity). The receiver operating characteristic (ROC) curve, which graphically represents sensitivity versus 1  specificity, may be of assistance not only in finding such a cutoff but also in addressing the spectrum of values if we want to find the most acceptable trade-off between sensitivity and specificity. For more advanced assessment, we can then use discriminant analysis with the cross-validation approach to test whether we can extend the results of our study to other samples. It may be helpful to know that higher specificity means lower type I error (for a fixed sample size), that is, fewer false positives. In contrast, higher sensitivity means lower type II error, that is, fewer false negatives. Often, researchers need to decide what is more acceptable for them—fewer false positives or fewer false negatives. For example, in medical studies, it is often more acceptable to have (more) false positives than false negatives because in the latter case persons with disease may not be detected and in worst case die as a consequence of that undetected disease. In forensics, having (more) false positives may be worse, for example, in cases of unknown identity if the remains are identified false positively, the chance of correct identification may be lost, and what is worse another person may be misidentified as a consequence of the initial incorrect identification. In general, if someone is false positively identified as being the perpetrator of a crime, this has far-reaching negative consequences, while if a perpetrator is not identified as such (false negative), this may be more accepted as a case in line with “in dubio pro reo.”

Conclusion Forensic anthropologists and pathologists as scientists and forensic experts have the responsibility to both perform their research to the highest standard and to interpret and present their results for judicial purposes. This means that they need to understand the statistical background of their studies to be able to explain it within the context of legal proceedings. The importance of appropriate study design was shown in Chapter 1.1, while this chapter highlights the underlying workings of statistical testing and the meaning of probability (distributions). In summary, it is important to use statistical tests that are appropriate for the collected or available data, and it is even more important that the conclusions are appropriate for the data on which they are based. Moreover, we need to understand our data to use the multitude of software functions of the statistical software packages appropriately.

85

86

CHAPTER 2.2 Probability distributions, hypothesis testing

Reference Phenice, T.W., 1969. A newly developed visual method of sexing in the os pubis. Am. J. Phys. Anthropol. 30, 297–301.

Recommended reading Aitken, C.G.G., Taroni, F., 2004. Statistics and the Evaluation of Evidence for Forensic Scientists, second ed John Wiley & Sons, Ltd., Chichester. Committee on Identifying Needs of the Forensic Sciences Community, National Research Council, 2009. Strengthening forensic science in the United States: a path forward. In: National Academy of Sciences. The National Academies Press, Washington, D.C. Cox, V., 2017. Translating Statistics to Make Decisions. A Guide for the Non-Statistician. Apress/Springer, New York. European Network of Forensic Science Institutes (ENFSI), 2010. ENFSI Guideline for Evaluative Reporting in Forensic Science. ENFSI, Brussels. Lucy, D., 2005. Introduction to Statistics for Forensic Scientists. John Wiley & Sons, Ltd., Chichester. Nikita E. 2017. Statistical methods in human osteology. In: Osteoarchaeology. A Guide to the Macroscopic Study of Human Skeletal Remains. London: Elsevier; p. 355-442.

CHAPTER

Data mining and decision trees

2.3 Efthymia Nikitaa and Panos Nikitasb

a

Science and Technology in Archaeology and Culture Research Center, The Cyprus Institute, Nicosia, Cyprus b Laboratory of Physical Chemistry, Department of Chemistry, Aristotle University of Thessaloniki, Thessaloniki, Greece

Introduction Forensic anthropology aims at creating a biological profile for deceased individuals, which includes ancestry, sex, age-at-death, and stature estimation. Different statistical methods allow not only the estimation of these parameters but also the probability linked to this estimation. For example, sex estimation may be based on binary logistic regression and/or linear discriminant analysis (LDA) (Giles and Elliot, 1963; Walker, 2008) and ancestry estimation on LDA (Giles and Elliot, 1962; Liebenberg et al., 2015); Bayesian approaches have been used for age-at-death estimation (Konigsberg and Frankenberg, 1992), whereas stature is estimated using regression techniques (Ruff et al., 2012). The last few years have witnessed an increased shift from traditional statistics to machine learning. In this respect, Hefner et al. (2014) have demonstrated the utility of machine learning to estimate ancestry from cranial metric and morphoscopic data, while Navega et al. (2015) developed the software named AncesTrees for ancestry estimation using the random forest algorithm. In addition, neural networks, decision trees, and other machine learning methods have been used for sex estimation (Darmawan et al., 2015; Du Jardin et al., 2009; Langley et al., 2017; Nikita and Nikitas, 2020).

Data mining Data mining is an interdisciplinary field at the intersection of artificial intelligence, machine learning, statistics, and database systems (Chakrabarti et al., 2006). Data mining can be defined in different ways. Many researchers treat data mining as a synonym for the popular term “knowledge discovery in databases (KDD),” which Statistics and Probability in Forensic Anthropology. https://doi.org/10.1016/B978-0-12-815764-0.00012-5 # 2020 Elsevier Inc. All rights reserved.

87

88

CHAPTER 2.3 Data mining and decision trees

is the process of digging through a database to discover hidden patterns and predict future trends. In fact, KDD consists of the following steps (Han and Kamber, 2001): data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge presentation. So, data mining is actually a step within KDD. However, it is the most important step, where intelligent techniques are applied to extract data patterns. A more conservative definition considers data mining as the use of machine learning algorithms to find patterns and correlations within large data sets to predict outcomes. Data mining tasks are generally divided into two broad categories: descriptive and predictive (Rokach and Maimon, 2015; Tan et al., 2006). Descriptive data mining tasks are often exploratory. Their objective is to identify correlations, associations, clusters, patterns, and anomalies that summarize the underlying relationships in the database. The objective of predictive tasks is to predict the value of a particular dependent variable based on the values of explanatory or independent variables. Predictive tasks may be accomplished by regression or classification. Regression is used to build a model between the dependent variable (response variable) and the independent variables (predictors), while classification assigns objects to one of several predefined categories (Dunham, 2002). Common algorithms used for classification include decision trees, random forest, k-nearest neighbor, support vector machines, naive Bayesian classification, and neural network-Bayesian networks. As mentioned above, forensic anthropology is the systematic study of human skeletal remains and the estimation of specific features of an individual, such as ancestry, sex, age at death, and stature. Therefore, in forensic anthropology, the most widely adopted data mining tasks are regression and classification. Regression is used for stature estimation, whereas ancestry and sex may be treated using classification techniques. From these techniques, the current chapter examines decision trees and its extension to random forest. Note that since data mining combines different techniques from various disciplines, such as statistics and machine learning, the conventional Bayesian age-at-death estimation technique may be considered as a special branch of the predictive data mining.

Decision trees A decision tree is a tree-shaped structure that represents sets of decisions. It can be adopted in both regression and classification problems. An example of a typical classification tree is given in Fig. 1 and concerns sex assessment from three cranial morphological traits: glabella, mastoid process, and zygomatic extension (Langley et al., 2017). According to this tree, a researcher should first assess if the glabella has a score above or below the cutoff value of 4. If the glabella is above or equal to 4, then the zygomatic extension is examined. If the zygomatic extension is greater than or equal to 3, then the sample has a high probability of being male; otherwise the skull is likely to belong to a female. On the other hand, if the glabella has a score of less than 4, the

Decision trees

FIG. 1 Example of a decision tree. Adapted from Fig. 2 in Langley, N.R., Dudzik, B., Cloutier, A., 2017. A decision tree for nonmetric sex assessment from the skull. J. Forensic Sci. 63, 31–37.

researcher also needs to assess the zygomatic extension. If the zygomatic extension score is less than 3, then the skull is likely to be female; otherwise the researcher should assess the mastoid process. A mastoid score less than 4 indicates that the skull is likely female, whereas if the mastoid score is 4, the skull is likely male. According to Langley et al. (2017), this decision tree yielded 93.5% accuracy on the training sample, 94% on the cross-validated sample, and 96% on a holdout validation sample. There are two types of decision trees: classification and regression trees. Classification trees are used when the response variable is categorical to separate the dataset into classes belonging to the response variable. In contrast, regression trees are used when the response variable is continuous. In both types of decision trees, the explanatory variables may be both categorical and continuous. In every decision tree, there is one root node that represents the entire population or sample. The root node is divided into two or more subnodes. When a subnode splits into further subnodes, it is called a decision node. When a subnode does not split, it is called a terminal or a leaf node (Fig. 2). A node divided into subnodes is called a parent node, whereas each of the subnodes is called the child of the parent node. All nodes with one incoming edge and two or more outgoing edges are called internal nodes. A subsection of an entire tree is called branch. The basic processes in creating a decision tree are splitting and pruning. Splitting is dividing a node into two or more subnodes, while pruning removes subnodes of a decision tree. The construction of an optimal decision tree depends upon the specific algorithm adopted, and this issue is discussed in detail in the example in the succeeding text. Here, we present some general rules concerning the splitting and pruning processes. The splitting measures that allow a tree to grow are related to the impurity of the nodes. Node impurity is 0 when all patterns in the node are of the same category, that

89

90

CHAPTER 2.3 Data mining and decision trees

FIG. 2 Elements of a decision tree.

is, the node is completely homogeneous, whereas impurity becomes maximal when all the classes at the node are equally likely. For classification trees, impurity is defined in terms of the percentage of class i in a node, ci, and it is related to the following two indices: (1) The information or entropy index: Entropy ¼ 

n X

ci lnci

i¼1

(2) The Gini index: Gini ¼ 1 

n X

c2i

i¼1

where the summation is over classes. Both of these indices take the value of zero for completely pure (homogeneous) nodes and increase as homogeneity decreases, that is, as impurity increases. When building a classification tree, either the Gini or the entropy index is typically used to evaluate the quality of a specific split. The major advantage of using decision trees is that they are easy to explain. They can handle both continuous and categorical variables, they are nonparametric and nonlinear, and they implicitly assess the importance of the explanatory variables. On the other hand, decision trees generally do not have the same level of predictive accuracy as other classical approaches and machine learning algorithms. They are prone to overfitting and classification errors if there are many classes and a relatively small number of training data. In addition, decision trees suffer from high variance,

Example: Decision tree for cranial morphological sex assessment

which means that a small variance in the data can cause a large change in the structure of the decision tree. To increase the predictive performance of decision trees, many decision trees may be aggregated using the random forests method (see the section “Improving the decision trees” in the succeeding text).

Example: Decision tree for cranial morphological sex assessment The first attempt to use classification trees for sex assessment using cranial morphological traits was carried out by Langley et al. (2017) using the glabella, mastoid process, and zygomatic extension. In a recent paper the authors (Nikita and Nikitas, 2020) have examined the performance of several classical and machine learning classification methods based on cranial and pelvic sexually dimorphic traits recorded in a modern documented collection, the Athens Collection, and the same collection is used in the current example. The collection consists of two parts: skeletons that form the original core of the collection, coded as WLH, and skeletons that were added to the collection later, coded as ABH. In the present example, we used 59 skeletons with WLH coding (39 males and 20 females) and 132 skeletons with ABH coding (67 males and 65 females). The ABH sample was used as training sample and the WLH sample as target sample. There are eight predictors in these two datasets, which include the pelvic traits ventral arc (VA), subpubic concavity (SPC), medial aspect of the ischiopubic ramus (MAR), and the cranial traits mental eminence (ME), supraorbital margin (ORB), supraorbital ridge/glabella (GL), nuchal crest (NC), and mastoid process (MA). All analyses were carried out in R. For a brief introduction to R, see Chapter 7.2. To input the various datasets (training/target samples) in R, we first created a tab delimited .txt file for each sample (Fig. 3). Note that the .txt files contain the title of each column in the first row and the sex variable is denoted by sex and is coded as 0 for males and 1 for females. We input these data in R using the command: mydata <  read.table(choose.files(caption ¼ ‘Select training/target file’), head ¼ 1, sep ¼ “\t”). To build a decision tree and then make predictions, we used the rpart() function of the homonymous R package, for which the packages rpart, rpart.plot, rattle, and ColorBrewer need to be installed. To build a decision tree on the training data, we enter or copy and paste in the R console the following commands: train