Biostatistics Using JMP : A Practical Guide 9781629603834, 9781635262414, 9781635262421, 9781635262438

Analyze your biostatistics data with JMP! Trevor Bihl's Biostatistics Using JMP: A Practical Guide provides a prac

496 87 60MB

English Pages 356 Year 2017

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Biostatistics Using JMP : A Practical Guide
 9781629603834,  9781635262414,  9781635262421,  9781635262438

Citation preview

The correct bibliographic citation for this manual is as follows: Bihl, Trevor. 2017. Biostatistics Using JMP®: A Practical Guide. Cary, NC: SAS Institute Inc. Biostatistics Using JMP®: A Practical Guide Copyright © 2017, SAS Institute Inc., Cary, NC, USA ISBN 978-1-62960-383-4 (Hard copy) ISBN 978-1-63526-241-4 (EPUB) ISBN 978-1-63526-242-1 (MOBI) ISBN 978-1-63526-243-8 (PDF) All Rights Reserved. Produced in the United States of America. For a hard copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication. The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others’ rights is appreciated. U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication, or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a), and DFAR 227.7202-4, and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government’s rights in Software and documentation shall be only those set forth in this Agreement. SAS Institute Inc., SAS Campus Drive, Cary, NC 27513-2414 September 2017 SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. SAS software may be provided with certain third-party software, including but not limited to open-source software, which is licensed under its applicable third-party software license agreement. For license information about third-party software distributed with SAS software, refer to http://support.sas.com/thirdpartylicenses.

Dedication To the memory of Gregory Boivin, DVM, MBA, who provided encouragement and much needed data for this endeavor

iv Biostatistics Using JMP: A Practical Guide

Contents Dedication ............................................................................................................................... iii Acknowledgments ................................................................................................................. xi About This Book ................................................................................................................... xiii About the Author ................................................................................................................. xvii

Chapter 1: Introduction ......................................................................... 1 1.1 1.2 1.3 1.4 1.5

Background and Overview .............................................................................................. 1 Getting Started with JMP ................................................................................................ 2 General Outline ................................................................................................................ 4 How to Use This Book ..................................................................................................... 5 Reference.......................................................................................................................... 5

Chapter 2: Data Wrangling: Data Collection ......................................... 7 2.1 Introduction ...................................................................................................................... 7 2.2 Collecting Data from Files............................................................................................... 8 2.2.1 JMP Native Files ..................................................................................................... 8 2.2.2 SAS Format Files .................................................................................................... 9 2.2.3 Excel Spreadsheets.............................................................................................. 10 2.2.4 Text and CSV Format ........................................................................................... 11 2.3 Extracting Data from Internet Locations ..................................................................... 14 2.3.1 Opening as Data ................................................................................................... 14 2.3.2 Opening as a Webpage ........................................................................................ 15 2.4 Data Modeling Types ..................................................................................................... 17 2.4.1 Incorporating Expression and Contextual Data ................................................ 18 2.5 References...................................................................................................................... 19

Chapter 3: Data Wrangling: Data Cleaning ......................................... 21 3.1 Introduction .................................................................................................................... 21 3.2 Tables .............................................................................................................................. 21 3.2.1 Stacking Columns ................................................................................................ 24 3.2.2 Basic Table Organization..................................................................................... 26 3.2.3 Column Properties ............................................................................................... 31 3.3 The Sorted Array ............................................................................................................ 32

vi

3.4 Restructuring Data......................................................................................................... 34 3.4.1 Combining Columns ............................................................................................. 35 3.4.2 Separating Out a Column (Text to Columns) ..................................................... 36 3.4.3 Creating Indicator Columns ................................................................................ 36 3.4.4 Grouping Inside Columns .................................................................................... 38 3.5 References ...................................................................................................................... 41

Chapter 4: Initial Data Analysis with Descriptive Statistics ................. 45 4.1 Introduction .................................................................................................................... 45 4.2 Histograms and Distributions ....................................................................................... 45 4.2.1 Histograms ............................................................................................................ 46 4.2.2 Box Plots ............................................................................................................... 55 4.2.3 Stem-and-Leaf Plots ............................................................................................ 57 4.2.4 Pareto Charts ........................................................................................................ 58 4.3 Descriptive Statistics ..................................................................................................... 64 4.3.1 Sample Mean and Standard Deviation ............................................................... 66 4.3.2 Additional Statistical Measures .......................................................................... 67 4.4 References ...................................................................................................................... 69

Chapter 5: Data Visualization Tools ..................................................... 71 5.1 Introduction .................................................................................................................... 71 5.2 Scatter Plots ................................................................................................................... 72 5.2.1 Coloring Points ..................................................................................................... 75 5.2.2 Copying Better-Looking Figures......................................................................... 77 5.2.3 Multiple Scatter Plots ........................................................................................... 79 5.3 Charts .............................................................................................................................. 81 5.4 Multidimensional Plots .................................................................................................. 84 5.4.1 Parallel Plots ......................................................................................................... 84 5.4.2 Cell Plots ............................................................................................................... 87 5.5 Multivariate and Correlations Tool ............................................................................... 89 5.5.1 Correlation Table .................................................................................................. 91 5.5.2 Correlation Heat Maps ......................................................................................... 92 5.5.3 Simple Statistics ................................................................................................... 93 5.5.4 Additional Multivariate Measures ....................................................................... 93 5.6 Graph Builder and Custom Figures .............................................................................. 94 5.6.1 Graph Builder Custom Colors ............................................................................. 96 5.6.2 Incorporating Contextual Data............................................................................ 98 5.7 References ...................................................................................................................... 99

vii

Chapter 6: Rates, Proportions, and Epidemiology ............................. 101 6.1 Introduction .................................................................................................................. 101 6.2 Rates ............................................................................................................................. 101 6.2.1 Crude Rates ........................................................................................................ 101 6.2.2 Adjusted Rates ................................................................................................... 105 6.3 Geographic Visualizations .......................................................................................... 108 6.3.1 National Visualizations ....................................................................................... 108 6.3.2 County and Lower Level Visualizations ........................................................... 116 6.4 References.................................................................................................................... 120

Chapter 7: Statistical Tests and Confidence Intervals ....................... 123 7.1 Introduction .................................................................................................................. 123 7.1.1 General Hypothesis Test Background ............................................................. 124 7.1.2 Selecting the Appropriate Method ................................................................... 125 7.2 Testing for Normality ................................................................................................... 126 7.2.1 Histogram Analysis ............................................................................................ 126 7.2.2 Normal Quantile/Probability Plot ...................................................................... 128 7.2.3 Goodness-of-Fit Tests ....................................................................................... 131 7.2.4 Goodness-of-Fit for Other Distributions .......................................................... 132 7.3 General Hypothesis Tests ........................................................................................... 133 7.3.1 Z-Test Hypothesis Test of Mean....................................................................... 133 7.3.2 T-Test Hypothesis Test of Mean ....................................................................... 135 7.3.3 Nonparametric Test of Mean (Wilcoxon Signed Rank) .................................. 136 7.3.4 Standard Deviation Hypothesis Test ................................................................ 140 7.3.5 Tests of Proportions........................................................................................... 141 7.4 Confidence Intervals .................................................................................................... 144 7.4.1 Mean Confidence Intervals................................................................................ 144 7.4.2 Mean Confidence Intervals with Different Thresholds ................................... 144 7.4.3 Confidence Intervals for Proportions ............................................................... 145 7.5 Chi-Squared Analysis of Frequency and Contingency Tables ................................ 146 7.6 Two Sample Tests........................................................................................................ 150 7.6.1 Comparing Two Group Means .......................................................................... 150 7.6.2 Paired Comparison, Matched Pairs.................................................................. 154 7.7 References.................................................................................................................... 156

Chapter 8: Analysis of Variance (ANOVA) and Design of Experiments (DoE) ................................................................................................. 159 8.1 Introduction .................................................................................................................. 159

viii

8.2 One-Way ANOVA .......................................................................................................... 161 8.2.1 One-Way ANOVA with Fit Y by X....................................................................... 161 8.2.2 Means Comparison, LSD Matrix, and Connecting Letters ............................ 165 8.2.3 Fit Y by X Changing Significance Levels .......................................................... 168 8.2.4 Multiple Comparisons, Multiple One-Way ANOVAs........................................ 169 8.2.5 One-Way ANOVA via Fit Model ......................................................................... 171 8.2.6 One-Way ANOVA for Unequal Group Sizes (Unbalanced) ............................. 176 8.3 Blocking ........................................................................................................................ 179 8.3.1 One-Way ANOVA with Blocking via Fit Y by X................................................. 179 8.3.2 One-Way ANOVA with Blocking via Fit Model ................................................. 182 8.3.3 Note on Blocking ................................................................................................ 183 8.4 Multiple Factors ........................................................................................................... 183 8.4.1 Experimental Design Considerations ............................................................... 184 8.4.2 Multiple ANOVA .................................................................................................. 188 8.4.3 Feature Selection and Parsimonious Models .................................................. 191 8.5 Multivariate ANOVA (MANOVA) and Repeated Measures ....................................... 196 8.5.1 Repeated Measures MANOVA Background .................................................... 196 8.5.2 MANOVA in Fit Model ......................................................................................... 197 8.6 References .................................................................................................................... 201

Chapter 9: Regression and Curve Fitting ........................................... 205 9.1 Introduction .................................................................................................................. 205 9.2 Simple Linear Regression ........................................................................................... 206 9.2.1 Fit Y by X for Bivariate Fits (One X and One Y) ................................................ 206 9.2.2 Special Fitting Tools ........................................................................................... 208 9.3 Multiple Regression ..................................................................................................... 211 9.3.1 Fit Model .............................................................................................................. 211 9.3.2 Stepwise Feature Selection ............................................................................... 214 9.3.3 Analysis of Covariance (ANCOVA) .................................................................... 222 9.4 Nonlinear Curve Fitting and a Nonlinear Platform Example .................................... 226 9.5 References .................................................................................................................... 232

Chapter 10: Diagnostic Methods for Regression, Curve Fitting, and ANOVA ............................................................................................... 233 10.1 Introduction ................................................................................................................ 233 10.2 Computing Residuals with Fit Y by X and Fit Model .............................................. 234 10.2.1 Fit Y by X............................................................................................................ 234 10.2.2 Fit Model ............................................................................................................ 234 10.3 Checking for Normality ............................................................................................. 235

ix

10.4 Checking for Nonconstant Error Variance (Heteroscedasticity) .......................... 236 10.5 Checking for Outliers................................................................................................. 238 10.6 Checking for Nonindependence ............................................................................... 242 10.7 Multiple Factor Diagnostics ...................................................................................... 243 10.8 Nonlinear Fit Residuals ............................................................................................. 245 10.9 Developing Appropriate Models ............................................................................... 246 10.10 References................................................................................................................ 247

Chapter 11: Categorical Data Analysis .............................................. 249 11.1 Introduction ................................................................................................................ 249 11.2 Clustering.................................................................................................................... 250 11.2.1 Hierarchical Clustering .................................................................................... 250 11.2.2 K-means Clustering ......................................................................................... 260 11.3 Classification .............................................................................................................. 263 11.3.1 JMP Data Preliminaries for Classification ..................................................... 265 11.3.2 Example Data Sets ........................................................................................... 267 11.4 Classification by Logistic Regression ..................................................................... 268 11.4.1 Logistic Regression in Fit Y by X .................................................................... 268 11.4.2 Logistic Regression in Fit Model .................................................................... 270 11.5 Classification by Discriminant Analysis................................................................... 273 11.5.1 Discriminant Analysis Loadings ...................................................................... 275 11.5.2 Stepwise Discriminant Analysis ...................................................................... 276 11.6 Classification with Tabulated Data .......................................................................... 277 11.7 Classifier Performance Verification ......................................................................... 280 11.8 References.................................................................................................................. 284

Chapter 12: Advanced Modeling Methods ......................................... 287 12.1 Introduction ................................................................................................................ 287 12.2 Principal Components and Factor Analysis ............................................................ 288 12.2.1 Principal Components in JMP......................................................................... 288 12.2.2 Dimensionality Assessment ............................................................................ 291 12.2.3 Factor Analysis in JMP .................................................................................... 293 12.3 Partial Least Squares ................................................................................................ 296 12.4 Decision Trees............................................................................................................ 302 12.4.1 Classification Decision Trees in JMP ............................................................. 303 12.4.2 Predictive Decision Trees in JMP ................................................................... 308 12.5 Artificial Neural Networks ......................................................................................... 310 12.5.1 Neural Network Architecture .......................................................................... 311 12.5.2 Classification Neural Networks in JMP .......................................................... 312 12.5.3 Predictive Neural Networks in JMP ................................................................ 315

x

12.6 Control Charts ............................................................................................................ 317 12.7 References .................................................................................................................. 321

Chapter 13: Survival Analysis ............................................................ 323 13.1 Introduction ................................................................................................................ 323 13.2 Life Distributions ........................................................................................................ 323 13.3 Kaplan-Meier Curves ................................................................................................. 327 13.3.1 Simple Survival Analysis .................................................................................. 327 13.3.2 Multiple Groups ................................................................................................ 330 13.3.3 Censoring .......................................................................................................... 331 13.3.4 Proportional Hazards ....................................................................................... 335 13.4 References .................................................................................................................. 336

Chapter 14: Collaboration and Additional Functionality .................... 339 14.1 Introduction ................................................................................................................ 339 14.2 Saving Scripts and SAS Coding................................................................................ 339 14.2.1 Saving Scripts to Data Table ........................................................................... 340 14.2.2 SAS Coding Functionality ................................................................................ 341 14.3 Collaboration .............................................................................................................. 342 14.3.1 Journals ............................................................................................................. 342 14.3.2 Web Reports ..................................................................................................... 344 14.4 Add-Ins ........................................................................................................................ 347 14.4.1 Finding Add-Ins................................................................................................. 347 14.4.2 Developing Add-Ins .......................................................................................... 348 14.4.3 Example Add-In: Forest Plot / Meta-analysis ............................................... 348 14.4.4 Add-In Version Control .................................................................................... 351 14.5 References .................................................................................................................. 352

Index ................................................................................................. 331

Acknowledgments I would like to thank my wife, Ji, and daughter, Talia, for their support and understanding while I worked on this book. Additionally, my wife was an excellent editor and subject matter expert on various biomedical topics. The education I received from working with Kenneth Bauer pushed me into statistics and thus this book. Motivation by Camilla Mauzy, David Smallenberger, and Michael Gibb also needs mentioning, since it helped move this project from ad hoc student reference materials to a completed book. This motivation was furthered along by Bill Worley, of SAS Institute Inc., who was instrumental in getting me to submit a book proposal as well as being a sounding board for ideas and a source of knowledge on the finer points of JMP. Support and motivation by Stacey Hamilton, my editor at SAS, were also extremely helpful and appreciated. Finding new and relevant data is always particularly challenging, and considerable thanks goes to those who were willing and able to share their data. Gregory Boivin was very helpful in this regard and provided a very useful dataset on mouse tendon strength, which is used throughout the book; in addition to thanking him, considerable thanks also goes to Hamish Simpson and Michelle Ghert, the editors of Bone and Joint Research, who gave me permission to reuse Greg’s mouse tendon strength dataset. Similarly, Angie Brown, of the West Virginia Medical Journal, was very helpful in giving me permission to reuse data presented in their journal. A similar debt of gratitude goes to Teresa Hawkes, Otilia Banji, and Kranthi Kumar, who shared their own datasets with me. Finally, thanks goes to my reviewers, including Teresa Hawkes, Amanda King, and Richard Zink, and my editor, Stacey Hamilton, who greatly helped improve the quality over the initial drafts.

xii Biostatistics Using JMP: A Practical Guide

About This Book Rationale for This Book This book focuses on the basics of statistical data analysis of biomedical/biological data using JMP. After both teaching and consulting in biostatistics, I saw a gap that existed between biostatistics books, which tend to be theoretical, and statistical software. To address this gap, I use statistical methods to analyze various biostatistics problems.

Importance of Statistical Analysis Analytics, data mining, data science, and statistics are essentially synonyms, and describe finding meaning in data by developing mathematical models to find and describe relationships in the data. While many biostatistical applications are simple in nature, for example, a t-test to evaluate the mean differences in response due to a treatment, a wide variety of methods exists.

Biostatistics Focus Biostatistics is the application of statistical methods to biological, or medical, data. While some methods see more frequent use in biostatistics, for example, survival analysis, these methods are not limited in use to just biostatistical problems. Essentially, all data is a matrix at the end of the day, and thus methods seen in biostatistical analysis can be applied to other domains.

The Power of JMP for Analytics Familiarity with statistical methods enables one to analyze data via methods familiar in textbooks. However, many textbook examples are simple in nature, but real-world data rarely is. Thus, applying methods in a textbook can be frustrating if you have to wrestle both with the data and software. This book was written with JMP due to the many advantages JMP has over other statistical software. JMP provides a GUI (graphical user interface) in which one can analyze data without coding algorithms. Additionally, the SAS underpinnings to JMP provide a wide and stable platform that can be trusted in its analysis. In total, JMP provides a tool that is easy to use and comes with a wide variety of built-in methods, the results of which can be trusted (something you can’t say about all statistical software). And, for those who wish to code boutique algorithms, JMP also supports this as well.

xiv About This Book

Who Should Read This Book This book is written for a variety of different persona groups. Although biostatistics is the focus, and is in the title, this book has broader appeal.

Biological/Medical Researchers and Laboratory Managers Researchers in the sciences, for example, biology and medicine, spend a large majority of their time performing experiments and a small fraction of their time analyzing data. Remembering how to use software that is only accessed a few times a year can be challenging. Thus, this book is aimed particularly at this group and provides a practical guide to analyzing collected biological/medical data.

Statisticians and Data Scientists This group might be interested in a broad look at how to use JMP to solve various problems and analyze data in JMP. While theory is light in this book, this group could easily learn the steps and nuances of JMP. Additionally, they would see practical data analysis and experimental data analysis using various JMP capabilities.

Students in Biostatistics or Statistics Classes Many biostatistical courses use excellent textbooks that cover the theory and examples for a wide variety of problems. However, these textbooks rarely discuss how to solve the problems, leaving students with the need to either code equations or learn various statistical software programs on the fly. This book is written from a general standpoint and can thus be combined with any biostatistical textbook. Additionally, since the statistical methods themselves can be used in many domains, this book can be combined with multiple statistics courses and textbooks.

Biostatistics Methods and JMP Functionality Covered in This Book This Book Covers the Following Biostatistics Methods ● ● ● ● ● ● ● ●

Data Cleaning – Data Wrangling – Descriptive Statistics – Data Visualization Rates – Proportion – Geographical Visualization – Epidemiology Confidence Intervals – Hypothesis Tests Linear Regression – Curve Fitting – General Linear Models Analysis of Variance (ANOVA) – Analysis of Covariance (ANCOVA) – Remedial Measures for Regression and ANOVA Cluster Analysis – Hierarchical Clustering – K-means Classification Analysis – Logistic Regression – Discriminant Analysis Survival Analysis – Meta Analysis – Control Charts – Neural Networks – Decision Trees

About This Book xv

Structure of This Book Chapter 1 introduces this book and mirrors some content in this section. Additionally, Chapter 1 introduces how to start using JMP. Chapters 2 and 3 introduce data-wrangling issues, such as data collection and cleaning. These chapters are very helpful when analyzing real-world data using JMP. The basics of descriptive statistics and data visualization are presented in Chapters 4 and 5. After Chapter 5, the focus of this book is on developing statistical models to describe data. Chapters 6 through 13 present various approaches, and your data and goals will drive which chapter you should read. Chapter 6 discusses epidemiology and geographical data analysis. Chapter 7 discusses hypothesis tests and confidence intervals. Chapters 8 to 10 present models such as analysis of variance, regression, curve fitting, and model validation. Chapter 11 discusses classification and clustering methods. Chapter 12 presents advanced modeling methods. Chapter 13 discusses survival analysis. Finally, Chapter 14 presents collaboration methods, incorporating custom JMP tools and meta-analysis, as an example.

Additional Resources For downloads of sample data presented in this book, please visit my author page at: https://support.sas.com/bihl This site also includes downloadable color versions of selected figures that appear in this book. Since this book is printed in black and white, you might find that some color figures are easier to interpret and understand. Please visit this site regularly, as I will provide updates on the content.

We Want to Hear from You SAS Press books are written by SAS Users for SAS Users. We welcome your participation in their development and your feedback on SAS Press books that you are using. Please visit sas.com/books to do the following: ● ● ● ●

Sign up to review a book Recommend a topic Request information on how to become a SAS Press author Provide feedback on a book

Do you have questions about a SAS Press book that you are reading? Contact the author through [email protected] or https://support.sas.com/author_feedback. SAS has many resources to help you find answers and expand your knowledge. If you need additional help, see our list of resources at sas.com/books.

xvi About This Book

About the Author Trevor Bihl is both a research scientist/engineer and an educator who teaches biostatistics, engineering statistics, and programming courses. He has been a SAS and JMP user since 2009 and provides various biostatistics and data mining consulting services. His background includes multivariate statistics, signal processing, data mining, and analytics. His educational background includes a BS and MS from Ohio University and a PhD from the Air Force Institute of Technology. He is the author of multiple journal and conference papers, book chapters, and technical reports.

Learn more about this author by visiting his author page at http://support.sas.com/bihl. There you can download free book excerpts, access example code and data, read the latest reviews, get updates, and more.

xviii About the Author

Chapter 1: Introduction 1.1 Background and Overview ................................................................... 1 1.2 Getting Started with JMP..................................................................... 2 1.3 General Outline.................................................................................... 4 1.4 How to Use This Book .......................................................................... 5 1.5 Reference ............................................................................................ 5

1.1 Background and Overview This book evolved from personal experiences in both teaching and consulting in biostatistics. Although many biostatistics textbooks show computer outputs and results, they rarely show how to generate the results. Biostatistics instruction is also commonly theoretical and based on solving simple problems by hand. However, real-world data is usually more complicated than the simple examples, and I found that frequent collaborators–PhD-educated researchers who performed and managed experiments–needed more understanding of software to analyze their data. This is difficult because such researchers often spend a majority of their time performing experiments and use statistical software sparingly. They also often do not have access to a dedicated biostatistician in their office or are competing for the time of their office’s single biostatistician. Although such researchers might know the mechanics of a statistical method, they might not how to generate meaningful results using software. Therefore, a practical, how-to guide to biostatistics was needed. There are many software applications available for statistics and biostatistics, so why JMP? As an educator, I found JMP an advantage to teaching. I could spend more time on theory and interpretation because JMP does not require scripts and syntax. As a collaborator and consultant, I found my colleagues would readily gravitate toward JMP and its results because of the graphical user interface (GUI) format and its ease of use. And finally, unless you want to code algorithms themselves, as a researcher, you will find JMP to be more user-friendly, correct, and developed when compared to many other competing packages. Incidentally, if you want to code, SAS programming abilities do exist in JMP. Thus, you can fully use JMP for analysis ranging from simple to complex and customized. This book presents and solves problems germane to biostatistics with easy-to-reproduce examples. The book is also a general biostatistics reference that leverages the topics found in leading biostatistics books. This chapter introduces JMP, presents a general outline of the book contents, and provides a brief guide to using this book.

2 Biostatistics Using JMP: A Practical Guide

1.2 Getting Started with JMP When you first run JMP, you will be greeted with a Tip of the Day (Figure 1.1). There are 62 tips of the day, and they show up whenever you start JMP. These tips can be useful to new JMP users in gaining familiarity with the software. However, if you don’t want to see these tips further, you can do the following:

1. Clear Show tips at start-up. 2. Click Close. Figure 1.1 Initial Tip of the Day

After you close the Tip of the Day, you are greeted with the primary JMP interface seen in Figure 1.2. Here, you can load data, create a new data table, or look for recently used files. If this is the first time you have opened JMP, there will be no recent files to consider. Thus, you must load or create a new data table. To load a file: ●

Click File ► Open.

or Click on the third icon on the taskbar. To create a blank data table: ● ●

Click File ► New.



Click on the first icon on the taskbar.

or

Chapter 1: Introduction 3 Alternatively, if you want to load a built in JMP example data file, you can do so. A variety of files are available. To load example data files:

1. Click Help ► Sample Data. 2. Select a data file under the method of interest. Also, you can select individual or multiple data tables in the Window List and then close all of these files. This is advantageous if you inadvertently opened many files, such as in a mistakenly setup Internet open. To close many open data tables:

1. Select the windows of interest. 2. Right-click and select Close. 3. You will then be prompted to save these files. Figure 1.2 JMP Primary Interface

If you create a new data table, you will be presented with Figure 1.3. Here, you see that there is a spreadsheet-like table, with a Column 1 ready for you to start considering. Also, when you have loaded and analyzed data, you can save these results to the JMP data table and instantly reload at a later date, as will be discussed in Section 14.2.

4 Biostatistics Using JMP: A Practical Guide Figure 1.3 New Data Table

1.3 General Outline With this basic usability knowledge from Section 1.2, you are now ready to consider biostatistical data analysis. Biostatistics covers a wide variety of topics ranging from simple hypothesis tests to complex nonlinear algorithms. This book aims to cover the range of methods with varying levels of detail. To do so, this book is organized sequentially as outlined in Table 1.1. Table 1.1 General Outline of Biostatistics Using JMP: A Practical Guide

Method Introduction Data Wrangling: Data Collection Data Wrangling: Data Cleaning Initial Data Analysis with Descriptive Statistics Data Visualization Tools Rates, Proportions and Epidemiology Statistical Tests and Confidence Intervals Analysis of Variance (ANOVA) and Design of Experiments (DoE) Regression and Curve Fitting Diagnostic Methods for Regression, Curve Fitting and ANOVA Categorical Data Analysis Advanced Modeling Methods Survival Analysis Collaboration and Additional Functionality

Chapter 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Chapter 1: Introduction 5

1.4 How to Use This Book In Chapters 2 and 3, this book moves to data-wrangling issues, such as data collection and cleaning. Since upward of 80% of your time can be spent in making messy data usable (Lohr, 2014), learning the tools JMP has for assembling and cleaning data is key and is covered in Chapters 2 and 3. In Chapters 4 and 5, you can learn the basics of descriptive statistics and data visualizations in JMP. Following this, the primary focus is on modeling, which involves creating a mathematical representation (a model) of data or a system in order to make inferences about it. After this, you have a few different paths available: Chapter 6 discusses epidemiological and geographical interpretations. Chapter 4 also discusses developing custom equations. Chapter 7 discusses the various hypothesis test and confidence interval methods. Chapters 8 to 10 discuss linear models such as analysis of variance (ANOVA), regression, and model validation. Because of the interrelation of the underlying methods of regression (Chapter 9) and ANOVA (Chapter 8), diagnostic and remedial measures for these methods are discussed in Chapter 10. Chapter 11 discusses classification methods, such as logistic regression, and clustering methods, such as k-means. Chapters 7 to 10 largely deal with a continuous dependent (e.g., Y, variable (prediction)). Methods to analyze a discrete dependent variable are presented in Chapter 11. Chapter 12 presents advanced modeling methods (e.g., factor analysis, neural networks, and control charts). Chapter 13 introduces the vast array of survival analysis methods in JMP. Chapter 14 presents methods that facilitate collaborating in addition to sources of additional functionality. If you are using a previously created data set, then it is advantageous to start with data-wrangling methods and then look at the various analytical tools this book discusses. However, if you are starting a new experiment and will be collecting data, then you should start looking at Section 8.4.1, which discusses experimental design considerations and how to develop and select factor levels for an experiment.

1.5 Reference Lohr, S. (2014, Aug. 18). For big-data scientists, ‘janitor work’ is key hurdle to insights. New York Times, p. B4.

6 Biostatistics Using JMP: A Practical Guide

Chapter 2: Data Wrangling: Data Collection 2.1 Introduction ......................................................................................... 7 2.2 Collecting Data from Files ................................................................... 8 2.2.1 JMP Native Files ............................................................................................ 8 2.2.2 SAS Format Files ........................................................................................... 9 2.2.3 Excel Spreadsheets .................................................................................... 10 2.2.4 Text and CSV Format .................................................................................. 11 2.3 Extracting Data from Internet Locations ............................................ 14 2.3.1 Opening as Data .......................................................................................... 14 2.3.2 Opening as a Webpage ............................................................................... 15 2.4 Data Modeling Types ......................................................................... 17 2.4.1 Incorporating Expression and Contextual Data ....................................... 18 2.5 References ........................................................................................ 19

2.1 Introduction Many examples in textbooks and manuals such as this book are very orderly and clean. However, real-world data is rarely orderly and clean (authors actually spend a lot of time searching for and fabricating good example data) and up to 80% of your time can be lost in “wrangling” to make messy data usable (Lohr, 2016). Medical and biological data is also becoming increasingly “big” in nature (Bihl, Young II, and Weckman, 2016), and thus wrangling the data can be of interest (Marx, 2013). To analyze messy real-world data, data wrangling comes into play. Data wrangling, conceptualized in Figure 2.1, involves taking raw data, extracting and cleaning it, and developing data features for analysis. The boundaries are not always clear when data wrangling ends and when statistical analysis begins. Thus, some overlap exists between data wrangling and statistical analysis when you begin to select/extract/analyze data features (e.g., the columns of a JMP data table). Figure 2.1 Data-Wrangling Overview, adapted from (Boehmke, 2016)

8 Biostatistics Using JMP: A Practical Guide Data wrangling is not frequently discussed in conjunction with analysis because books, including this one, focus on developing knowledge in using the presented methods. However, it is important to mention data-wrangling issues since the real world is not full of clean, textbook-style data. Thus, this book considers two data-wrangling tasks with Chapter 2 focusing on data collection in JMP and Chapter 3 focusing on data cleaning using JMP tools. Discussion of data analysis methods then makes up the majority Chapters 4 to 13 of this book. This chapter will present data collection methods built into JMP. JMP can load both JMP and SAS format files, in addition to loading Excel and CSV files. Moreover, data can be imported from text files and Internet webpages using the sophisticated data importation tools in JMP.

2.2 Collecting Data from Files JMP supports opening a wide variety of data file formats, as detailed in Table 2.1 with annotation on, provided that JMP can load or save in that format. By supporting a wide variety of data types, JMP facilitates work on older data sets, with users who do not use JMP, and with a wide audience. To aid in usability, this book will consider examples from some of these data types (*.txt, *.sas7bdat, *.csv). Table 2.1 Data Files Supported by JMP

Type Comma-separated Data files containing text ESRI shapefiles Flow Cytometry versions 2.0 and 3.0 HTML JSON files MATLAB Microsoft Excel Minitab Portable Worksheet Plain text R SAS transport SAS versions 7 through 9 SPSS files Tab-separated Teradata database Triple-S xBase data files

Extension *.csv *.dat *.shp *.fcs *.htm, *.html *.json *.m, *.M *.xls, *.xlsx *.mtp *.txt *.r *.xpt, *.stx *.sas7bdat, *.sas7bxat *.sav *.tsv *.trd *.sss, *.xml *.dbf

Load X X X X X X X X X X X X X X X X X X

Save X X

X X

X (*.xpt) X (*.sas7bdat) X

2.2.1 JMP Native Files JMP uses a propriety data format native to JMP software. As with any file format, this format describes how to handle the pieces of the file. However, not all software applications have the keys to unlock all data types. Thus, JMP knows how to assemble a data table from a *.jmp file, but other software applications likely do not. However, various advantages exist to the JMP native format. If you perform an analysis in JMP, you can save this analysis in the JMP file and reload it on a different computer exactly as you left it. Although you could save to a *.csv (or other formats, as shown in Table 2.1), the ability to reload an analysis would be lost by saving outside the *.jmp format.

Chapter 2: Data Wrangling: Data Collection 9

2.2.1.1 JMP Sample Data A wide variety of sample data sets are available in JMP to illustrate the application of specific methods and to provide data for analysis and demonstration. In JMP 13, 543 sample data files are included and available for use. Their applications range from simple to complex, and this enables you to become familiar with both JMP data and methods through exploring them. To access the folder containing all sample data available in your copy of JMP:

1. Select Help ► Sample Data Library. 2. In the Explorer window that opens to the data directory, double-click on the file of interest to load it. To access the sample data listed by method or domain of interest:

1. Select Help ► Sample Data. 2. Select the method or domain of interest (e.g., click Analysis of Variance). Note that you can further see the type of method these files best correspond with. 3. Select an example file of interest (e.g., Blood Pressure). 4. A data table for this file is then opened.

2.2.2 SAS Format Files SAS files are natively supported by JMP. However, a wide variety of SAS file extensions are in use (e.g., *.sas, *.ss2, *.sc2, etc.). Thus, some care and control might be of interest when loading a file from SAS into JMP. For an example, you can load a SAS data file to JMP:

1. Select File ► Open. 2. Find the SAS file of interest and click once on it. 3. After you select a SAS file, JMP enables you to specify how to open the file. (See Figure 2.2.) 4. Click Open when you are satisfied with your selection. 5. You will then be greeted with a JMP data table. As hinted at in Figure 2.2, JMP knows what to do with a SAS file and thus JMP is asking you whether you want your JMP data table to have column names from the SAS variables names or the SAS variable labels. After making this selection and opening the file, you will be greeted with a JMP data table.

10 Biostatistics Using JMP: A Practical Guide Figure 2.2 Opening a SAS File

2.2.3 Excel Spreadsheets Microsoft Excel is a spreadsheet software with worldwide use, and thus JMP readily handles the importation of data from Excel spreadsheets. Since Excel can contain data in multiple sheets of the same file, some care and user interaction might be necessary to import data from Excel. An example will be considered using a generic data file ExcelExample.xlsx, which contains nothing more than two columns of user-generated numbers. For an example, you can load an Excel data file to JMP:

1. Select File ► Open. 2. Find the Excel file of interest and click once on it. 3. After you select an Excel file, JMP enables you to specify how to open the file. (See Figure 2.3.) a. The key consideration at this point is whether Row 1 values are column/variable names or are numbers. 4. Click Open when you are satisfied with your selection. 5. You will then be greeted with the Excel Import Wizard. (See Figure 2.4.) a. At the top left, you can view the Excel data. b. At the top right, select the Excel sheet of interest. (This example has only one sheet with numbers.) c. Specify needed considerations using the options in the middle (e.g., where the data is located).

6. Click Import when you are satisfied that the view in the top left is what you want the data table to look like.

Chapter 2: Data Wrangling: Data Collection 11 Figure 2.3 Opening an Excel File

Figure 2.4 Excel Import Wizard Dialog Box

2.2.4 Text and CSV Format JMP can load text files as well. However, some care must be taken in this process since text files can be highly unstructured. To facilitate this reality, JMP incorporates a variety of options when opening a text file. For this example, a small file, TextExample.txt, was created with two tabseparated columns of data. Loading a *.csv file will be similar.

12 Biostatistics Using JMP: A Practical Guide To load a text file:

1. Select File ► Open. 2. Find the text file of interest and click once on it. 3. After a text file is selected, JMP enables you to specify how to open the file. (See Figure 2.5.) A variety of options exist, as shown in Figure 2.5. Selecting Data, using Text Import preferences lets JMP use whatever preferences have been selected. Selecting Data, using best guess lets JMP examine the data and use what it believes to be the best approach to open the data. Selecting Plain text into Script window will not load the file as a data table, but as a script text file. Here you can select Data with Preview, which will allow you more control of the text loading and show additional functions that can assist in loading text files. Figure 2.5 Opening a Text File

After you select Data with Preview and click OK, you are greeted with the Import dialog box, as shown in Figure 2.6. Here, you see the text file that you have loaded. For this text file, you see a blue line separating the Day and Meas columns, indicating which parts of the text file JMP will assign to data table columns. The text file under consideration has columns separated by tabs. JMP was able to detect this and its best guess is that the file is tab delimited. However, you could select a wide variety of delimited or fixed-width fields. In addition, you could have selected a subset of this file or a compatibility standard, as shown in the expanded sections at the bottom of Figure 2.6. To continue loading:

1. 2. 3. 4.

Ensure that the lines for data columns at the top of the Import dialog box are as desired. Click Next. Change the names of the columns, if needed. Click Import.

We are now greeted with the resultant data table as shown in Figure 2.7.

Chapter 2: Data Wrangling: Data Collection 13 Figure 2.6 The Text File Import Dialog Box with All Fields Displayed

Figure 2.7 Data Table from Text Imported File

14 Biostatistics Using JMP: A Practical Guide

2.3 Extracting Data from Internet Locations Data is not always conveniently in a spreadsheet or file. Frequently, data can be found on webpages, and thus you need to bring the data to the software. Various options exist to accomplish this task, including copying and pasting the data into a spreadsheet. However, this can be inefficient. To improve the process, you can directly load a webpage and extract its data using JMP. To collect data from the Internet in JMP, begin by selecting File ► Internet Open. The dialog box shown in Figure 2.8 will appear. Type or paste the URL into the field. For this example, consider the following URL: https://en.wikipedia.org/wiki/Reported_Road_Casualties_Great_Britain This is a Wikipedia article that has data from road casualties in Great Britain from 1926 to 2015. Figure 2.8 Internet Open

After you type the URL into the field, you must specify how this will be opened via the Open As drop-down menu. JMP supports three options, listed in Table 2.2. For this example, examine opening as data or text. Table 2.2 Internet Open Data Options

Option Data Text Webpage

Meaning JMP searches for tables on the URL and then presents options URL is opened as a text file URL opens in a built-in JMP browser; specific tables can then be selected

2.3.1 Opening as Data If you select the default option, JMP will open the webpage as data. An dialog box like that in Figure 2.9 will appear. This dialog box shows the various items JMP sees as tables:

1. Select the desired table or tables. 2. Click OK. This example shows only the first available table selected. The resultant data tables will appear, as shown in Figure 2.10.

Chapter 2: Data Wrangling: Data Collection 15 Figure 2.9 Importing a Table When Opening as Data

Figure 2.10 Imported Data Table

2.3.2 Opening as a Webpage After JMP opens the URL as a webpage, you can use the page, Figure 2.11, as you could any browser. If you investigate this webpage in the JMP browser, you will find that there are two tables of possible interest on this webpage.

16 Biostatistics Using JMP: A Practical Guide Figure 2.11 Wikipedia Page for Reported Road Casualties in the JMP Browser

After you become familiar with the webpage, perform the following steps to create a data table:

1. 2. 3. 4.

Select File ► Import Table as Data Table. An dialog box like that in Figure 12.2 will appear. Select the desired table or tables. Click OK.

If you select both options, JMP will instantiate two data tables: the table seen in Figure 2.10, and the table seen in Figure 2.12, which has additional information. However, you should not begin to investigate the data in Figure 2.12 yet. First, you should review the URL; if you examine the webpage as shown in Figure 2.11, you will find that the data in Figure 2.12 is for the year 2008 only. Second, the Ref. and Note columns of both Figure 2.10 and Figure 2.12 could be either very troublesome or very useful, depending on what you plan to do next.

Chapter 2: Data Wrangling: Data Collection 17 Figure 2.12 Types of Casualties

2.4 Data Modeling Types JMP data tables can contain columns with continuous values, categorical (nominal or ordinal) values, and additional types (such as pictures). How these columns are modeled within JMP relates to the data type and modeling type. A variety of available data modeling types and their associated symbols are presented in Figure 2.13 with Figure 2.13a showing modeling types for numeric data and Figure 2.13b showing additional modeling types. Basic data types include numeric, character (for text), row state (as for a column that shows the states of individual rows), and expression (such as pictures). Figure 2.13 Data Modeling Types in JMP13

The numeric data type should be used for continuous random variables, ordinal and nominal should be used for discrete values (both numeric and text), and none should be used for columns that fit none of these categories. Characters are frequently seen in categorical groups (e.g., a nominal column with gender) and with unstructured text (e.g., raw observational data from a lab technician). Expressions can be useful for contextual data (e.g., you want to include a picture from a stained slide to link the graphical data to the numeric results that you will analyze). Although the focus of this book will be on numeric data and nominal character data, an example

18 Biostatistics Using JMP: A Practical Guide will be seen in Section 2.4.1 for the built-in Big Class Families data set. Additional discussion on included expression data will also be presented in Section 5.6.2 for the built-in Big Class Families data set. Discussion of editing and formatting column properties and data or modeling types can be found in Section 3.2.3.

2.4.1 Incorporating Expression and Contextual Data If you want to add in a picture for each column, you could increase the value of the data table by providing contextual information, such as pictures of each subject or picture of the slide that is associated with a given sample. An example is presented in Figure 2.15 for the built-in Big Class Families data set, which includes pictures of the hypothetical students in the hypothetical middle school class under analysis. Figure 2.14 Big Class Families Data Table

However, how to create such a data table might not appear straightforward at first. Two general approaches can be used to bring in such contextual information. The first one is to open a webpage that included figures associated with observations using the Internet open features (See Section 2.3.2.). However, you might want to add pictures from a folder into the data table. To do this, you would create a data table and then add in a column with the contextual data in the form of a picture. To add such contextual information:

1. Specify the column of interest as the Expression modeling type: a. Right-click on the column. b. Select Column Info. c. Select Expression from the Data Type drop-down menu. d. Click OK. 2. Drag and drop the desired figure to each individual cell. Further editing and formatting of column properties and data or modeling type can be found in Section 3.2.3. In addition, incorporating such contextual expression data will be presented in Section 5.6.2 for the built-in Big Class Families data set.

Chapter 2: Data Wrangling: Data Collection 19

2.5 References Bihl, T. J., Young II, W. A., and Weckman, G. R. (2016). Defining, understanding, and addressing big data. International Journal of Business Analytics (IJBAN), 3(2), 1-32. Boehmke, B. (2016). Data Wrangling with R. Springer. Lohr, S. (2016). For big-data scientists, ‘janitor work’ is key hurdle to insights. New York Times. Retrieved from http://nyti.ms/1Aqif2X. Marx, V. (2013). Biology: The big challenges of big data. Nature, 498(7453), 255-260.

20 Biostatistics Using JMP: A Practical Guide

Chapter 3: Data Wrangling: Data Cleaning 3.1 Introduction ....................................................................................... 21 3.2 Tables ................................................................................................ 21 3.2.1 Stacking Columns ....................................................................................... 24 3.2.2 Basic Table Organization ........................................................................... 26 3.2.3 Column Properties ...................................................................................... 31 3.3 The Sorted Array................................................................................ 32 3.4 Restructuring Data ............................................................................ 34 3.4.1 Combining Columns ................................................................................... 35 3.4.2 Separating Out a Column (Text to Columns) ............................................ 36 3.4.3 Creating Indicator Columns ....................................................................... 36 3.4.4 Grouping Inside Columns ........................................................................... 38 3.5 References ........................................................................................ 41

3.1 Introduction After data has been collected and imported into JMP, as described in Chapter 2, you would obviously like to analyze this data. However, raw data can have various issues that require cleaning. For example, a variable (a column in a JMP data table) might contain both haemaglobin and Haemaglobin. Although a person can tell that these are both the same, computers pay attention to case and would treat both as separate groups. Thus, data cleaning (step 2 of the datawrangling framework in Figure 2.1) must often be considered. This chapter will present various data-cleaning methods to sort data, combine vectors, and restructure data. JMP capabilities go beyond these presented, and thus this chapter is intended to be a springboard for helping you become familiar with these tools.

3.2 Tables All data that is processed in JMP is handled via tables. Tables can contain columns with continuous and categorical (nominal or ordinal) values. The best description of tables in JMP comes from an example. For the first example in this chapter, consider two variables: subject number and age for 366 subjects in the dermatological data set (Demiroz, Govenir, and Ilter, 1998), as available from the UCI Machine Learning Repository (Lichman, 2013). In this chapter, a subset of the data will be considered with two columns of data: subject number and subject ages. As originally organized, the ages are sorted based on the order number (1 to 366) in which they appeared in the original data set; however, organizing based on age makes it possible to more logically sort the data. To load this data set, begin by opening DermatologyAge.jmp.

22 Biostatistics Using JMP: A Practical Guide Figure 3.1 shows the results of the data being loaded. Notice that there are four columns (data features) with the following names: SUBJ, AGE, SUBJ 1, and AGE 1. As initially considered, the SUBJ and AGE columns were split into two columns, each with 183 observations. For analysis, these columns need to be appropriately combined. When considering this data table, note that SUBJ is the subject number of each observation and ranges sequentially from 1 to 366. As initially considered, SUBJ’s indices correspond with the row numbers in the table. Although this example was contrived, such examples can be seen in real-world data sets such as MNIST (LeCun, Cortes, and Burges, 1998) and in many data sets presented in another source (Hand, Daly, Lunn, McConway, and Ostrowski, 1994). Figure 3.1 Example Data from Dermatological Data Set

Chapter 3: Data Wrangling: Data Cleaning 23 On the left of the table are three small windows. The first, DermatologyAge, describes what files and script are associated with this JMP file. This currently considers only the date file itself and is described as a locked file. The second window, seen in Figure 3.2, describes the types of data columns that are presented with symbols. These were discussed in Section 2.4. It currently indicates (4/1), which refers to there being four data columns and one column currently selected. If you were to select a column or single observation, this would change to indicate the number of associated columns selected. Inside this window are the column names with graphical annotation to the type of data feature. The SUBJ and SUBJ 1 columns are currently defined as nominal (with a symbol of red bars), and the AGE and AGE 1 columns are currently defined as continuous (with the symbol being a blue right triangle). If you right-click on either symbol, options to change the modeling type to nominal or ordinal appear. However, the SUBJ and SUBJ 1 columns cannot be changed to continuous as their initial data type is non-numeric. The next two sections cover concatenating columns and editing the modeling type. Figure 3.2 Modeling Types of Data Columns in DermatologyAge Data

The final window contains properties for the rows of the data table. The total number of rows is the first quantity, 183. The subsequent quantities refer to the number of currently selected rows, labeled rows, and any hidden or excluded rows from analysis. Excluding, hiding, and labeling rows permits user-centric abilities such as considering the removal of selected points (e.g., outliers), sequestering points for training/testing, and labeling points for annotation. These abilities further allow you to use JMP for analysis without deleting rows from the data table.

24 Biostatistics Using JMP: A Practical Guide

3.2.1 Stacking Columns To combine columns by stacking one on top of another, it is possible to copy and paste the desired cells. However, this would involve storing the copied cells on your computer’s clipboard, which might be memory-intensive for a large data table. One alternative is to use the Utilities tools available in the Cols toolbar:

1. Selected Tables ► Stack. The following dialog box will appear, as shown in Figure 3.3. Figure 3.3 Stack Controls in JMP

2. 3. 4. 5. 6. 7.

Click on SUBJ and SUBJ 1. Select the Stack Columns button. Clear Stack By Row (this would interleave the data). Type DermatologyAge into Output table name. Type SUBJ into the New Column Names field of Stacked Data Column. Click OK.

The table should look like the data table in Figure 3.4. Notice that the title is DermatologyAge 2. This is because the original data table is already named DermatologyAge, and thus JMP will not use the name twice for opened files. It is also necessary to repeat the process exactly for AGE and AGE 1. Moreover, it is critical that you follow the steps identically for subsequent columns. For example, if you failed to deselect Stack By Row in the second iteration, you would inadvertently make the data table out of order.

Chapter 3: Data Wrangling: Data Cleaning 25 Figure 3.4 Dermatological Data After Stacking SUBJ Columns

After following these directions, notice that the resultant data table, Figure 3.5, includes a single column for SUBJ and a single column for AGE. However, there are two minor issues. First, there are additional columns of both Label and Label 2. These are used to identify which SUBJ column this SUBJ number came from, and similarly for Age. Secondly, the SUBJ column is still in a nominal modeling type format. The following sections show you how to delete columns and how to use the Column Properties platform to fix data property issues.

26 Biostatistics Using JMP: A Practical Guide Figure 3.5 Dermatological Data After Stacking AGE Columns

3.2.2 Basic Table Organization Adding and reordering columns are two common tasks for spreadsheets. Although it is straightforward to do these in JMP, it is important to preserve the data structure since it is statistical software and not merely a spreadsheet. Although you can highlight, cut, and copy cells and columns, you can also initialize columns to random numbers and reorder as needed.

3.2.2.1 Reordering Columns In JMP, you can also reorder columns. To reorder columns, drag and drop columns within the columns panel, as shown in Figure 3.2. Alternatively, use built-in tools:

1. 2. 3. 4. 5.

Click on a given column name (e.g., AGE). Select Cols ► Reorder Columns ► Move Selected Columns. Click on To first. Click on AGE. Click OK.

Now AGE appears as the first column in the data table, as shown in Figure 3.6.

Chapter 3: Data Wrangling: Data Cleaning 27 Figure 3.6 Dermatological Data After Reordering Columns

For the rest of this chapter, the data was reordered with SUBJ as the first column and AGE as the second column.

3.2.2.2 Deleting Columns If you find that you have unwanted blank columns or if you want to delete a column, here is how to delete a column:

1. Right-click on a given column (e.g., Label). 2. Select Delete Column. Figure 3.7 shows the result: the Label column has been removed.

28 Biostatistics Using JMP: A Practical Guide Figure 3.7 Dermatological Data After Deleting Label Column

Alternatively use the Cols menu:

1. Click on a given column or columns (e.g., Label 2). 2. Select Cols ► Delete Columns. The Label 2 column no longer appears in the data table.

3.2.2.3 Adding Columns New columns can be added via a few different approaches in JMP. First, you can double-click on an empty column and it will be populated with blank values (●) and a generic column name (e.g., Column 12). You can edit the column name and properties as described above.

Chapter 3: Data Wrangling: Data Cleaning 29 You can also add columns with more control by using the Cols drop-down menu:

1. Select Cols ► New Column. 2. A column properties box will appear, as shown in Figure 3.8, and you can specify the characteristics as above. 3. Change settings as needed (e.g., change the value in Number of columns to add if you want multiple columns). 4. Click OK. Now an individual, or multiple, column will appear that can be instantiated with control over the location, types of values, and names on new columns. Figure 3.8 New Column Commands

To add random values, possibly for sorting data randomly, in the New Column platform:

1. Select Initialize Data ► Random. 2. Select the desired random number type (e.g., Random Integer was selected here to facilitate randomly sorting).

3. Change the random number settings as needed (e.g., 1 for minimum and 366 for maximum was used here since the data has 366 observations, as shown in Figure 3.9). 4. Make any other changes to the new column properties. 5. Click OK.

30 Biostatistics Using JMP: A Practical Guide Figure 3.9 New Column Commands with Random Data Options

The resultant additions, Figure 3.10, enable you to sort the data randomly, if you want. For such a process, follow the methodology in Section 3.3, “The Sorted Array.”

Chapter 3: Data Wrangling: Data Cleaning 31 Figure 3.10 Dermatological Data with Added Random Column

3.2.3 Column Properties When loading data from spreadsheets, it is common to have formatting issues. Formatting issues, such as numbers having text properties when a table is read in, or placeholder values, such as NAN or MISSING, can make analysis difficult since one text string in a column will define the entire column as non-numeric. To view or change column properties, consider SUBJ first.

1. Right-click on SUBJ. 2. Select Column Info. Figure 3.11 shows the dialog box.

32 Biostatistics Using JMP: A Practical Guide Figure 3.11 Column Details of SUBJ Column

Here you can change the name of the column (you can also do so by double-clicking on a column name), change the data table (choices include character, numeric, and row state), and change the modeling type (continuous, nominal, and ordinal). When you are handling numeric data that is misidentified as character data, change data type to numeric and then change modeling type to the appropriate setting (e.g., continuous for both columns of the DermatologyAge data table). To change SUBJ to be continuous:

1. 2. 3. 4.

Click on the drop-down menu for Data Type. Select Numeric. Click on the drop-down menu for Modeling Type. Select Continuous.

To make additional column properties available for viewing or changing:

1. Click on the Column Properties drop-down menu. 2. Select the desired appropriate field. When you are finished with Column Info, click OK. After following these directions, notice that the data type of SUBJ has changed to continuous and both data columns are continuous.

3.3 The Sorted Array The ordered array, or sorted array, refers to a data structure that has been organized in some manner. Common organization methods include sorting from smallest to largest of a specific variable or alphabetical (for categorical data). Sorting data is a primary task to help you become familiar with data. Although this is very helpful in personally understand the data, you do not need to sort the data to compute statistical values or to produce a histogram. For an example in sorting, consider the DermatologyAge data table, as above:

1. Open DermatologyAge.jmp. 2. Select Tables ► Sort.

Chapter 3: Data Wrangling: Data Cleaning 33 You are presented with the sorting dialog box shown in Figure 3.12. Figure 3.12 The Sort Platform

3. Select AGE for By. This tells JMP which column of data to sort the table by. 4. Select Replace Table. This will replace the original table with an age-sorted table. The sorting dialog box should appear as shown in Figure 3.13. Figure 3.13 The Sort Platform, Set Up to Sort by AGE

5. Click OK. After you’ve clicked OK, look at the data table, as shown in Figure 3.14. You can instantly tell that it has been reorganized because the SUBJ row is no longer organized sequentially starting at 1. In addition, all of the observations with missing AGE values have been sorted to appear first (rows 1-8); thus, the locations of the missing values were easily found. If you consider the nonmissing data, the first data value appears in row 9, which was SUBJ number 120, who has an age of 0. When JMP performs further analysis on the data, it ignores missing values and thus uses an implicit deletion imputation (which is similar to Microsoft Excel).

34 Biostatistics Using JMP: A Practical Guide Figure 3.14 AGE Sorted Dermatological Data

3.4 Restructuring Data In addition to the basic column operations, JMP enables you to heavily restructure a data table. For this example, consider the Consumer Preferences.jmp file included in JMP as a sample data set. This data set considers many fields. Of most interest will be the biomedically relevant columns related to teeth care for flossing and the columns related to brushing (which record when or whether a subject flosses and brushes). These columns are located about in the middle of the data table.

Chapter 3: Data Wrangling: Data Cleaning 35

3.4.1 Combining Columns In some cases, numbers and variables are spread across columns after being loaded from the original file. To alleviate this, combine columns by selecting them and tell JMP how you want to combine them. Here’s an example in combining columns:

1. Select the columns Floss After Waking Up, Floss After Meal, Floss Before Sleep, and 2. 3. 4. 5. 6. 7.

Floss Another Time. Select Cols ► Utilities ► Combine Columns. Type in the new column name in the Combine Columns dialog box, as shown in Figure 3.15. Determine the Delimiter you want. The default of a comma will result in data that looks like 1,0,0,0 as shown in the Floss 2 column of Figure 3.16. Having no delimiter will result in data that looks like 1000, as shown in the floss0 column of Figure 3.16. Click OK.

The resultant column’s appearance greatly depends on the delimiter used. For an example, Figure 3.16 shows the result using a comma (consistent with a CSV file) or no delimiter. Although you might have a preference for one type (no delimiter might be helpful for some tasks), be careful because it would be difficult to separate a non-delimited column. Figure 3.15 The Combine Columns Dialog Box

36 Biostatistics Using JMP: A Practical Guide Figure 3.16 Combined Floss Columns

3.4.2 Separating Out a Column (Text to Columns) The opposite action might be necessary. For example, you might load in a data table and see that a variable has many comma-separated fields. To break such a column out into separate columns:

1. 2. 3. 4.

Select the columns of interest (e.g., Floss 2 from Section 3.4.1). Select Cols ► Utilities ► Text to Columns. Specify the delimiter. Click OK.

In this example, four columns will now be extracted. These are currently text columns; if the underlying data is numeric, this could easily be changed by changing the column properties.

3.4.3 Creating Indicator Columns For methods where a numeric value is easier to interpret than a categorical or text value, you might want to create an indicator column (a column of 1s and 0s). Here is an example of creating an indicator column:

1. Select the columns of interest (e.g., I come from a large family). 2. Select Cols ► Utilities ► Make Indicator Columns. 3. Select the indicator column features needed, as shown in Figure 3.17. Append Column Name is very useful and will include the original column name to help bookkeeping later. 4. Click OK.

Chapter 3: Data Wrangling: Data Cleaning 37 Figure 3.17 The Indicator Column Dialog Box

How the resultant column will look depends on the options selected as shown in Figure 3.17. For an example, Figure 3.18 shows the resultant two indicator columns (one where a 1 indicates the original column said Agree, the other where a 1 indicates the original column said Disagree). These can now be used when on needs a numeric indicator, rather than a categorical indicator. After you select Append Column Name in Figure 3.18, the data table appears as seen in Figure 3.19 with indicator columns for both agreement and disagrement (binary, 0 or 1 values) with the I come from a large family variable. Figure 3.18 The Make Indicator Dialog Box

Figure 3.19 Data Table with Indicator Columns

38 Biostatistics Using JMP: A Practical Guide

3.4.4 Grouping Inside Columns In some cases, case (upper/lower) issues, similar responses, and other differences result in there being many similar values in a data column. For example, you can examine the last column, Reasons Not to Floss, of the Consumer Preferences.jmp data. If you examined the data table, you would see that this column is a modeling type called Unstructured Text and that there are 398 unique entries in a data set that has 448 rows. To condense this data column into a more parsimonious set of outputs, examine the data with the Recode platform:

1. Select the columns of interest. 2. Select Cols ► Recode. 3. The Recode dialog box will appear, as shown in Figure 3.20. At the top of Figure 3.20, notice that there are counts of each unique response in both the Original (Old Values) column and the New Values column. Initially, both columns have the same count of unique values, 398 for this data.

Chapter 3: Data Wrangling: Data Cleaning 39 Figure 3.20 The Recode Dialog Box

40 Biostatistics Using JMP: A Practical Guide To analyze the data:

1. 2. 3. 4.

Click the red triangle next to Reasons Not to Floss. Select Group Similar Values. The dialog box as shown in Figure 3.21 will appear. Depending on the nature of the data, you might want to change some settings. For this example, you can safely leave all options checked. 5. The difference ratio is what percentage difference JMP is allowed to group. The default is 0.25, which means that JMP will group responses that are 25% similar. 6. Click OK. Figure 3.21 The Grouping Options Dialog Box

At this point, notice that very similar responses have been grouped together, as shown in Figure 3.22, and that you have 345 unique responses with some very similar responses grouped together and highlighted in gray. If you were to change the difference ratio to 50% or less, you would have 293 unique columns. Continuing this process, you could find a good balance of difference ratio and grouping options. Thus, you would want to find a reasonable difference ratio, one that benefits the final analysis (having a sufficiently few number of responses to understand) while not losing resolution in the data (by having a sufficiently unique set of responses). To complete the example:

1. Click Done. 2. Select the type of output. In general you would want something like this:

◦ ◦

New Column to place a new column next to the original column

In Place to overwrite the original column 3. This column will now appear in the data table.

Chapter 3: Data Wrangling: Data Cleaning 41 Figure 3.22 Initial Recoded Results

3.5 References Demiroz, G., Govenir, H. A., and Ilter, N. (1998). Learning differential diagnosis of EryhematoSquamous diseases using voting feature intervals. Aritificial Intelligence in Medicine, 13(3), 147-165. Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J., and Ostrowski, E. (1994). A Handbook of Small Data Sets. London: Chapman & Hall. LeCun, Y., Cortes, C., and Burges, C. J. (1998). The MNIST database of handwritten digits. Lichman, M. (2013). Dermatology Data Set. Retrieved February 5, 2016, from UCI Machine Learning Repository: http://archive.ics.uci.edu/ml.

42 Biostatistics Using JMP: A Practical Guide

Chapter 4: Initial Data Analysis with Descriptive Statistics 4.1 Introduction ....................................................................................... 43 4.2 Histograms and Distributions ............................................................ 43 4.2.1 Histograms .................................................................................................. 44 4.2.2 Box Plots ...................................................................................................... 52 4.2.3 Stem-and-Leaf Plots ................................................................................... 54 4.2.4 Pareto Charts .............................................................................................. 55 4.3 Descriptive Statistics......................................................................... 60 4.3.1 Sample Mean and Standard Deviation ...................................................... 62 4.3.2 Additional Statistical Measures ................................................................. 63 4.4 References ........................................................................................ 65

4.1 Introduction When you begin to analyze a data set, both descriptive statistics and visualizations are among the first methods to use. Both methods provide a brief overview about a data set and both lend themselves to interpretations by an audience or reader. To fully understand your data, you should always start by examining it with these methods. Descriptive statistics involve computing basic quantitative information about data and should be what first come to mind when referring to statistics. Basic descriptive statistics include measures such as means and variances. However, there are many finer details in both operation and JMP13 implementation that need exploration and understanding before considering more advanced statistical computations. This chapter discusses how to use descriptive statistics in JMP to analyze data. Histograms and Pareto charts are introduced to show you how to examine the distribution of data. Descriptive statistics in JMP are then presented to teach you how to provide quantifiable information about distributions. The JMP Formula Editor is also introduced to aid in determining the number of histogram bins to use.

4.2 Histograms and Distributions It is not always intuitive to present or interpret an ordered array; however, a histogram (a bar chart frequency distribution) can instantly provide a subjective assessment about a data set. Visually presenting data is important to both understand data and effectively describe it to others. A histogram reflects the distribution of data based on counting the number of observations within range bins.

44 Biostatistics Using JMP: A Practical Guide

4.2.1 Histograms Relative frequencies of data can be easily presented using a histogram. This example uses the same DermatologyAge data as in the previous example, using this data to generate a histogram.

1. 2. 3. 4.

Open DermatologyAge.jmp. Select Analyze ► Distribution. Select AGE for Y, Columns. This indicates to JMP which column of data to analyze. Click OK when the dialog box appears as shown in Figure 4.1.

Figure 4.1 Distribution Platform Dialog Box

Figure 4.2 shows the JMP default distribution report. This output includes a histogram, details of the quantiles, and basic summary statistics by default. You can observe in the histogram that it has a bin size of 5 years; in other words, there is a bin for AGE, which contains all ages values 30 ≤ AGE < 35, and so on.

Chapter 4: Initial Data Analysis with Descriptive Statistics 45 Figure 4.2 Distribution for Dermatological Data Set

Below the histogram is the quantile information for this data set. Here the maximum value, minimum value, and median value are presented. In addition, the quantiles provide a reflection of how the data is distributed. Further discussion of quantiles will appear below. Below the quantiles are initial summary statistics for the data set. Summary statistics will be covered in depth at the end of the chapter. Of immediate interest in summary statistic is the number of samples considered, N. While you saw above that the data set has 366 rows, you also saw that there are missing values. Since missing values cannot be used to compute summary statistics, the underlying N considered is 358, the number of samples with nonmissing values. Two further examples of examining categorical data in the Distribution platform are presented in Figure 4.3. Figure 4.3a is from the built-in JMP data Grocery Purchases.jmp. Here, the Product column was selected for the Y, Columns variable in the Distribution Platform dialog box. (See Figure 4.1.) Figure 4.3b uses the file WV Pareto Data.jmp, which can also be typed in manually from Figure 4.18 and that is discussed in Section 4.2.4. Here, the Diagnosis column was selected for Y, Columns and the Number with injury was selected for Freq since each diagnosis is associated with the frequency of occurrence. In both examples presented in Figure 4.3, the

46 Biostatistics Using JMP: A Practical Guide frequency tables do not correspond to quantiles, but rather to how often each category appears in the data. Figure 4.3a Distribution for Categorical Data: Grocery Purchases

Figure 4.3b Distribution for Categorical Data: WV Pareto Data

Chapter 4: Initial Data Analysis with Descriptive Statistics 47

4.2.1.1 Histogram Bin Size – Manual The display of information can be manually changed by changing the size of each bin. For example, rather than using bins of size 5 years, it is possible to change it to 10 years.

1. Click on the red triangle next to AGE and select Histogram Options ► Set Bin Width. 2. Type 10 in the New Bin Width field as shown in Figure 4.4. Figure 4.4 Changing Histogram Bin Width

3. Click OK. 4. The result is now a histogram with coarser resolution, as shown in Figure 4.5. Figure 4.5 Histogram with Bin Width Equal to 10

4.2.1.2 Histogram Visualization Settings Also, the histogram could use additional work to become more descriptive. First, you should always label axes. This histogram has only one axis labeled and there is no conception of the size of the bars. Second, there is an additional box plot (a JMP default); this might or might not be preferable. Finally, the histogram is oriented horizontally, another matter of preference. To change any of these, there are several choices. To more properly label the frequency axis: ●

Click on the red triangle next to AGE and select Histogram Options ► Count Axis.

See the result in Figure 4.6.

48 Biostatistics Using JMP: A Practical Guide Figure 4.6 Histogram with Count Axis

To remove the box plot: ●

Click on the red triangle next to AGE and clear Outlier Box Plot.

The histogram now appears, as shown in Figure 4.7. Figure 4.7 Histogram without Box Plot

To rotate the histogram: ●

Click on the red triangle next to AGE and select Histogram Options ► Vertical.

The histogram now appears, as shown in Figure 4.8.

Chapter 4: Initial Data Analysis with Descriptive Statistics 49 Figure 4.8 Vertical Histogram

Additional histogram options also exist. One simple and useful option is to print the count in each bin. To add the count per bin: ●

Click on the red triangle next to AGE and select Histogram Options ► Show Counts.

Counts now appear over each bar, as shown in Figure 4.9. Figure 4.9 Vertical Histogram with Counts per Bin

4.2.1.3 Sturges’ Heuristic for Histogram Bin Size (Using the Formula Editor) Another alternative to determining bin size is Sturges’ Heuristics (Sturges, 1926), which estimates the number of histogram bins you should use, given the number of samples in your data. Sturges’ Heuristics is defined as follows: 1

3.322 ∙ log

In the equation, k is the number of bins and N is the number of samples under consideration (Sturges, 1926) (Daniel & Cross, 2013, p. 23). To implement Sturges’ Heuristics in JMP, first create a new data table and then use a column formula to compute the number of appropriate bins:

1. 2. 3. 4.

Create a New Data Table. Select File ► New ► Data Table. Type 358 into the first cell of the first column. Double-click on the column name and rename it N.

The data table should look like Figure 4.10.

50 Biostatistics Using JMP: A Practical Guide Figure 4.10 Sturges’ Heuristics Setup

To complete the process to compute Struges’ Heuristic for computing the number of bins:

1. 2. 3. 4. 5. 6. 7.

Double-click on the second column to instantiate a new column. Double-click on the column name and rename it k. Right-click on the second column and select Formula in the menu. Type out the first part of the formula: 1 + 3.322. Click on 3.322. Click on the multiplication button (X). To find the correct function, you have several choices: a. Select Transcendental. i. Scroll down to Log10 and click on it. It should now appear multiplied by 3.322. ii. Click on the box inside Log10. b. Or type in log into the search box i. Select the appropriate logarithm function, Log10. 8. Click on the N inside the “Table Columns” list. 9. Click OK, when the equation looks identical to Figure 4.11.

Chapter 4: Initial Data Analysis with Descriptive Statistics 51 Figure 4.11 Sturges’ Heuristics Equation

As presented in Figure 4.12, notice that the k value is very close to 10. When using this value, you want to round up to 10. Practically, e.g., for age data as in this example, Sturges’ Heuristics might be of limited utility since you can logically interpret age values in units of 5 and 10. For example, if Sturges’ Heuristics suggested a value of 9, you should use a more logical bin size of 10 since it is easier to present and interpret age results that range in 20 ≤ AGE < 30 and 30 ≤ AGE < 40 than it is to discuss ranges like 20 ≤ AGE < 29 and 29 ≤ AGE < 38. However, Sturges’ Heuristics can be more useful and powerful when you have more abstract data. Figure 4.12 Sturges’ Heuristics Computation

52 Biostatistics Using JMP: A Practical Guide

4.2.2 Box Plots Adjacent to histograms, by default, are box plot representations of the data distributions. As in Section 4.2.1, consider DermatologyAge.jmp.

1. 2. 3. 4.

Open DermatologyAge.jmp. Select Analyze ► Distribution. Select AGE for Y, Columns. Click OK.

You are immediately presented with a histogram, as discussed in Section 4.2.1, and an outlier box plot of the data, seen to the right of the histogram in Figure 4.13. The box plot displays a representation of the distribution whereby the box represents the location of the first to third quartiles, a line inside the box to represent the median, and lines extending above and below the box to represent the extent of the data, consistent with Benjamini (1988). The first thing you might notice in Figure 4.13 is a box that contains a diamond and a horizontal line. The diamond represents the location of the upper and lower 95% confidence interval about the mean. The horizontal line represents the median of the data. The top and bottom of the box itself represent the location of first quartile to the third quartile of the data. The length of the box is thus the interquartile range. Extending from the box ends of the box plot are whiskers, which extend to 1.5 times the length of the interquartile range from both the first and third quantiles. If outlying points exist far outside the whiskers, they would be represented as circles. The red bracket (gray scale in this book) to the side of the box plot identifies the shortest half of the data, i.e., the densest 50% of the observations, consistent with Leroy and Rousseeuw (1987). Figure 4.13 Histogram with Box Plot

Chapter 4: Initial Data Analysis with Descriptive Statistics 53 However, the box plot is an outlier box plot, and it relates information about the distribution of the data and not the quantiles. To add a quartile box plot:

1. Click on the red triangle next to AGE. 2. Select Quantile Box Plot. You now see two box plots in Figure 4.14: the first box plot is now the quantile box plot (between the outlier box plot and the histogram). In this newly visible quantile box plot, additional quantile details are annotated, with the horizontal lines corresponding to quantiles from the frequency table. Figure 4.14 Histogram with Quantile and Basic Box Plots

Below the histogram and box plots is a table of quantiles, Figure 4.15, from which you can examine the data. Using this table, it is possible to find the following quantiles: quantile 1 (Q1), quantile 2 (Q2), and quantile 3 (Q3). Q1 is denoted by the word “quartile” at the 25th percentile, and, for this data table, Q1 = 25. Q2 is the data median, 35 for this data table, and Q3 = 50 is denoted by the second word “quartile” at the 75th percentile. Figure 4.15 Quantile Details

54 Biostatistics Using JMP: A Practical Guide You can manually compute the interquartile range as 50 - 25 = 25. However, you can make JMP fill in the interquartile range as follows:

1. 2. 3. 4.

Click on the red triangle next to Summary Statistics. Select Customize Summary Statistics. Check Interquartile Range. Click OK.

The Summary Statistics table is now updated, as shown in Figure 4.16. Figure 4.16 Summary Statistics

4.2.3 Stem-and-Leaf Plots The stem-and-leaf plot is another approach to visualizing the distribution of data and can be used for the same purpose as a histogram (Schriger & Cooper, 2001). The approach uses the same frequency bins of histograms, but retains the underlying quantifiable information from the data table (Bucevska, 2011) (Tukey, 1972). Thus, you could read a stem-and-leaf plot and reproduce the underlying data table (Tukey, 1972). Stem-and-leaf plots consider certain digits as stems (e.g., numbers in the tens place), with leaves representing the second digit. For example, a set of numbers 29, 22, and 25 would be combined with 2 as the stem and 9, 2, and 5 as leaves to yield 2|925 (Bucevska, 2011) (Tukey, 1972). When you apply this method to a larger data set, the resultant graph approximates a histogram. To add a stem-and-leaf plot to the DermatologyAge data:

1. Return to the distribution platform from the previous example. 2. Click on the red triangle next to AGE and select Stem and Leaf. A stem-and-leaf plot now appears, as shown in Figure 4.17.

Chapter 4: Initial Data Analysis with Descriptive Statistics 55 Figure 4.17 Stem-and-Leaf Plot for Example Dermatological Data Set

In Figure 4.17, it is possible to read the values as follows:

1. From the first stem, there is a value of 70, with the leaf being 5. The underlying value is 75. 2. The second stem is also 70, and it is possible to infer that the bin size is in units of 5, consistent with the histogram. The count column represents how many values are accounted for in each stem. Although the histogram is, subjectively, a visually more presentable method, being familiar with the stem-and-leaf plot can provide additional familiarity with how your data set varies (Cooper & Shore, 2008). In addition, the stem-and-leaf plot can show more resolution in the data and enables you to reconstruct the data with some fidelity, depending on the scale presented.

4.2.4 Pareto Charts Pareto charts are problem-solving tools that show causes and quantities in an ordered manner (Burr, 1990). In a Pareto chart, the data is organized by group and quantity from largest to smallest. Further, a Pareto chart usually includes a cumulative count to present the overall contribution various causes have to the total number of occurrences. Pareto charts have various uses in medical data analysis, e.g. (Clark, Cushing, & Bredenberg, 1998), and present opportunities to find the most important effect or to find what needs the most emphasis. A Pareto chart facilitates analysis via the Pareto principle, where 80% of the problems typically occur from 20% of the causes, and the remaining 20% of the causes are typically produced by 80% of the problems (Burr, 1990). While this is heuristic and not a rule, you can see the Pareto principle in many situations. The result presents a graphical means to determine what the largest causes are and thus what problems you should focus on for possibly the greatest return (Burr, 1990). The following data presents one illustration of the Pareto principle. Data from Whiteman et al. (2012) is considered, which presents the injury diagnosis and number for 5,469 reported injuries in older (age 64+) West Virginians for 2010 from the West Virginia State Trauma Registry.

56 Biostatistics Using JMP: A Practical Guide Analyzing this data can directly influence plans for reducing injury in nursing homes and planning for care. To analyze this data in JMP:

1. Create a new JMP data table. 2. Type in the data to make it appear as shown in Figure 4.18, or load WV Pareto Data.jmp. Figure 4.18 West Virginia State Trauma Registry Data from Whiteman et al. (2012)

To create a Pareto plot: ●

Select Analyze ► Quality and Process ► Pareto Plot.

Now you must decide which data feature goes into which field. Causes are factors related to an event. These are nominal attributes.

1. Select Diagnosis for Y, Cause. 2. Select Number with Injury for Freq. 3. Click OK as soon as your dialog box looks like Figure 4.19.

Chapter 4: Initial Data Analysis with Descriptive Statistics 57 Figure 4.19 Pareto Chart Setup

The resultant Pareto Plot, Figure 4.20, visually presents which causes are the most prominent. While you can see that the majority of injury causes result in a relatively small number of injuries, for a decision maker the chart can still be busy. Figure 4.20 Pareto Chart for West Virginia State Trauma Registry Data from Whiteman et al. (2012)

58 Biostatistics Using JMP: A Practical Guide You can group causes together to simplify the presentation. First, you need to determine at what point the notional 80% threshold is reached for the Pareto principle. To do so, it is necessary to add tick marks to the cumulative percentage curve and percentage numbers along the curve:

1. Click on the red triangle next to Pareto Plot. 2. Select Show Cum Percent Points to show the tick marks along the curve. 3. Select Label Cum Percent Points to add cumulative percentages. Clicking on the boxes above each cause will increase the size of the X tick mark on the cumulative percentage curve, while highlighting that row.

4. Find the cause that yields approximately 80% of the injuries, as shown in Figure 4.21. Figure 4.21 Pareto Chart with Cumulative Percentages

You see that 80% is reached between Concussion and Lower Leg STI. To improve understanding of the data you can group Lower Leg STI and all other less numerous causes together:

1. Click on the red triangle next to Pareto Plot. 2. Select Causes ► Combine Causes. 3. Find Lower Leg STI in the table and click on it.

Chapter 4: Initial Data Analysis with Descriptive Statistics 59

4. Hold the shift button and select all causes below it, as shown in Figure 4.22. 5. Click OK. Figure 4.22 Combining Columns for Pareto Chart

The result, shown in Figure 4.23, generally follows the Pareto principle with 78.7% of the injuries resulting from 32% of the causes (8 out of 25 causes), and with the remaining 21.3% of injuries resulting from 68% of the causes (17). Using this information, you now have eight primary causes to focus on as causing the majority of the injuries. Working to determine the underlying reasons for these causes and methods to mitigate them would be one use for this result.

60 Biostatistics Using JMP: A Practical Guide Figure 4.23 Pareto Chart for West Virginia State Trauma Registry Data with Combined Columns

4.3 Descriptive Statistics JMP can compute many simple and advanced statistical values. This chapter primarily uses the Distribution platform and familiarizes you with its built-in statistics. Details of adding these variables to tables can be found throughout this chapter. In summary, Table 4.1 lists and describes the basic statistical computations available, and a basic interpretation of their meanings. All of these quantities are easily available in JMP in the Distribution platform.

Chapter 4: Initial Data Analysis with Descriptive Statistics 61 Table 4.1 Built-in JMP Descriptive Statistics (Lehman et al, 2005, SAS Institute Inc., 2012, 2016)

Name in JMP

Meaning

Name in JMP

Meaning

Mean

CV

Coefficient of Variation

Std Dev

the mean or average of a set of numbers Standard Deviation

N Missing

Std Err Mean

Standard Error Mean

N Zero

Upper and Lower Mean Confidence Limits N

These are the upper and lower 95% confidence limits about the mean. Number of observations

N Unique

Number of missing samples Number of observations equal to zero Number of observations that have a unique value

Sum Weight

The sum of a column assigned to the role of Weight The sum of the elements in the column

Corrected SS

The sample variance, and the square of the sample standard deviation. The measure describing the sidedness of the data about the mean

Median

Kurtosis

The measure of how peaked the data is

Trimmed Mean

Range

The difference between the minimum and maximum value The median of the absolute deviations from the median

Geometric Mean

The uncorrected sum of squares for the data. The mean corrected sum of squares or sum of squares for the data. First autocorrelation that tests if the residuals are correlated across the rows. This test helps detect nonrandomness in the data The middle value of the data; the 50th percentile of the data. The value that occurs most often in the data. If there are multiple modes, the smallest mode appears The mean calculated after removing the smallest p% and the largest p% of the data.1 The nth root of the product of the data

Interquartile Range

The difference between the third and first quartiles

Sum

Variance

Skewness

Median Absolute Deviation

Uncorrected SS

Autocorrelation

Mode

Quantifiable statistical measures provide additional analysis ability. Measures of central tendency describe where data is centered and measures of dispersion describe how data varies. This section and these examples discuss how to consider both types of measures in JMP, beginning with considering how to compute the mean, in addition to other quantities.

1 The value of p is entered in the Enter trimmed mean percent text box at the bottom of the window. The Trimmed Mean option is not available if you have specified a Weight variable

62 Biostatistics Using JMP: A Practical Guide

4.3.1 Sample Mean and Standard Deviation Here, for a small example, consider the first 10 observations from the original DermatologyAge.jmp data set. To create a new data table:

1. Select File ► New ► Data Table.

2. 3. 4. 5. 6. 7.

a. As an alternative, press Ctrl + n. b. As another alternative, press the new Data Table icon. Type the 10 observations into the new data table. Double-click on the Column 1 name. Type in Sample Ages for the new name for the column. Save this table by selecting File ► Save As. Select your folder of choice. Type the name DermatologyAgeSubset for later use of this data.

The final data set should look like Figure 4.24. Figure 4.24 Small Example Age Subset

To continue analyzing the data mean:

1. Select Analyze ► Distribution. 2. Select Sample Ages for Y, Columns. The sample mean for the data is given in the summary statistics table, as shown in Figure 4.25.

Chapter 4: Initial Data Analysis with Descriptive Statistics 63 Figure 4.25 Summary Statistics for Example Age Subset

4.3.2 Additional Statistical Measures Many additional statistical measures are built into JMP, but are turned off by default to avoid cluttering the summary statistics. To turn on various statistical measures:

1. 2. 3. 4. 5.

Return to the data set from DermatologyAgeSubset.jmp. Select Analyze ► Distribution. Select Sample Ages for Y, Columns. Click on the red triangle next to Summary Statistics. Select Customize Summary Statistics.

A new table appears, as shown in Figure 4.26. Here you can select the quantities you want to display.

1. Check Median or any other quantity of interest. 2. Click OK. Now Median appears in the summary statistics table, as shown in Figure 4.27.

64 Biostatistics Using JMP: A Practical Guide Figure 4.26 Customization of Summary Statistics

Figure 4.27 Customized Summary Statistics Table

The same procedure can be followed for any other measure listed in Table 4.1.

Chapter 4: Initial Data Analysis with Descriptive Statistics 65

4.4 References Benjamini, Y. (1988). Opening the box of a boxplot. The American Statistician, 42(4), 257-262. Bucevska, V. (2011). Stem-and-leaf plot. International Encyclopedia of Statistical Science, 15131514. Burr, J. T. (1990). The tools of quality: Pareto charts. Quality progress., 23, 59-61. Clark, D. E., Cushing, B. M., & Bredenberg, C. E. (1998). Monitoring hospital trauma mortality using statistical process control methods. Journal of the American College Of Surgeons, 186(6), 630-635. Cooper, L., & Shore, F. (2008). Students’ misconceptions in interpreting center and variability of data represented via histograms and stem-and-leaf plots. Journal of Statistics Education, 16(2), 1-13. Daniel, W. W., & Cross, C. L. (2013). Biostatistics: A Foundation for Analysis in the Health Sciences (10th ed.). John Wiley & Sons. Demiroz, G., Govenir, H. A., & Ilter, N. (1998). Learning differential diagnosis of eryhematosquamous diseases using voting feature intervals. Artificial Intelligence in Medicine, 13(3), 147-165. Lehman, A., O’Rourke, N., Hatcher, L., & Stepanski, E. (2005). JMP for Basic Univariate and Multivariate Statistics. Cary, NC: SAS Institute. Leroy, A. M., & Rousseeuw, P. J. (1987). Robust Regression and Outlier Detection. New York: Wiley. Lichman, M. (2013). Dermatology Data Set. Retrieved Feb. 5, 2016, from UCI Machine Learning Repository: http://archive.ics.uci.edu/ml. SAS Institute. (2012). JMP 10 Basic Analysis and Graphing. Cary, NC: SAS Institute. SAS Institute. (2016). Basic Analysis. Cary, NC: SAS Institute. Schriger, D. L., & Cooper, R. J. (2001). Achieving graphical excellence: Suggestions and methods for creating high-quality visual displays of experimental data. Annals of Emergency Medicine, 37(1), 75-87. Sturges, H. A. (1926). The choice of a class interval. Journal of the American Statistical Association, 21(153), 65-66. Tukey, J. W. (1972). Some graphic and semi-graphic displays. In T. Bancroft, Statistical Papers in Honor of George W. Snedecor (pp. 293-316). Ames, Iowa. Whiteman, C., Davidov, D., Tadros, A., D'Angelo, J., & Blum, F. (2012). Falls and dilemmas in injury prevention in older West Virginians. West Virginia Medical Journal, 108(3), 1420.

66 Biostatistics Using JMP: A Practical Guide

Chapter 5: Data Visualization Tools 5.1 Introduction ....................................................................................... 67 5.2 Scatter Plots ...................................................................................... 68 5.2.1 Coloring Points ............................................................................................ 70 5.2.2 Copying Better-Looking Figures ................................................................ 72 5.2.3 Multiple Scatter Plots.................................................................................. 74 5.3 Charts ................................................................................................ 76 5.4 Multidimensional Plots ...................................................................... 79 5.4.1 Parallel Plots ................................................................................................ 79 5.4.2 Cell Plots ...................................................................................................... 82 5.5 Multivariate and Correlations Tool ..................................................... 84 5.5.1 Correlation Table ......................................................................................... 85 5.5.2 Correlation Heat Maps ................................................................................ 86 5.5.3 Simple Statistics .......................................................................................... 87 5.5.4 Additional Multivariate Measures .............................................................. 87 5.6 Graph Builder and Custom Figures .................................................... 88 5.6.1 Graph Builder Custom Colors .................................................................... 90 5.6.2 Incorporating Contextual Data................................................................... 92 5.7 References ........................................................................................ 93

5.1 Introduction Creating effective visualizations of data is important in both understanding data yourself and in effectively presenting it to others. Effective visualizations enable you to find patterns that might not be evident if you examined the raw data tables only or relied solely on numerical summaries (Jacoby, 1998). Visualizations use the human’s innate abilities (hard to replicate in computers) to find patterns (Tegarden, 1999). It can show characteristics, shapes, and features that are not evident from relying on statistical tests and summaries. Data visualizations can also help you avoid faulty conclusions. This is best seen in the well-known Anscombe data set that is available in the JMP sample data library as Anscombe.jmp. The Anscombe data set provides four sets of numbers (all an X versus a Y) that all provide an identical analysis of variance table (Chapter 8) and linear regression fit (Chapter 9) (Anscombe, 1973). However, these sets of numbers appear very different when plotted using scatter plots (covered in Section 5.2) because of the presence of outliers and curves. See Figure 5.1 as an illustration; details are provided in each subplot. Ignoring the graphical and subjective interpretation, and relying only on statistical summaries and results, most people would have failed to see the differences in these relationships.

68 Biostatistics Using JMP: A Practical Guide Figure 5.1 Anscombe Data Example, from (Anscombe, 1973)

This chapter shows you how to use JMP to create scatter plots (an X versus Y), scatter plot matrices (multiple X versus Y plots), charts, bar charts, parallel plots, and cell plots. Also, this chapter discusses the Graph Builder platform, which enables you to create many custom figures with various styles and formats. However, beyond the methods covered in this chapter, JMP has many additional data visualization methods: bubble plots, contour plots, 3-D scatter plots, ternary plots, surface plots, and treemaps, to name a few (SAS Institute, 2012) (SAS Institute, 2016). Largely, the dialog boxes used for these data visualization methods will be similar and you can readily learn how to use a new data visualization platform after mastering other platforms. Throughout this chapter, the IronGlutathione.jmp file of example data will largely be used to illustrate data visualization methods. This data is from a study by Mauzy et al. (2012) and examines the relationship between iron and α and π glutathione-s-transferase in humans.

5.2 Scatter Plots Scatter plots enable the researcher to plot two columns, one on the x-axis and the other on the yaxis, with each observation represented by a point. In JMP, there are various tools (including Fit Y by X) to examine how two or more variables might interact. To prepare a scatter plot, the next example uses example data as shown in Figure 5.2.

Chapter 5: Data Visualization Tools 69

1. Open IronGlutathione.jmp. Figure 5.2 Iron Glutathione Data Table

2. Select Graph ► Scatterplot Matrix. 3. Select Age and Alpha GST (ng/L) and click on Y, Columns. 4. Click OK when the dialog box looks like Figure 5.3. Figure 5.3 Scatterplot Matrix Dialog Box

5. Figure 5.4 is the resultant scatter plot.

70 Biostatistics Using JMP: A Practical Guide Figure 5.4 Scatter Plot of Alpha GST versus Age for the Iron Glutathione Data

Here, in Figure 5.4, you can see that the data lies mostly between 0 and 30,000 on the y-axis and between 15 and 55 on the x-axis. Overall, the relationship appears weakly correlated because the data varies mostly in the x dimension and thus a fitted line would be mostly horizontal. However, the possible outlier above 80,000 on the y-axis could be skewing the presentation. You could reasonably remove this outlier (hide/exclude in the data table) and replot results to see how the scatter plot changes. Before removing it permanently, however, you would want to look at Chapter 10 to learn basic methods of quantifying how extreme an outlier might be.

5.2.1 Coloring Points To add more information to your figures, you can use categorical information from a different data feature to add colors to individual points. To add colors based on Gender in the example data: ●

Select Rows ► Color or Mark by Column.

This will bring up the dialog box seen in Figure 5.5, with default colors ranging from blue to red.

Chapter 5: Data Visualization Tools 71 Figure 5.5 Color by Column Dialog Box

1. Click on Gender in the list of variable names. 2. Any changes in color can be made via the drop-down menu Colors. 3. Any desired marker type for each point can be selected via the Markers drop-down menu. The type used here is White to Black, to facilitate gray-scale printing in this book. 4. Click OK. The result is seen in Figure 5.6. Also, the data table itself now has a colored dot on each row and any figure that was previously created now has colored points. To remove the row properties associated with this coloration: ●

Select Rows ► Clear Row States.

72 Biostatistics Using JMP: A Practical Guide Figure 5.6 Scatterplot Matrix with Points Colored by Gender

5.2.2 Copying Better-Looking Figures To get higher quality figures (i.e., without the plotting tool background):

1. Right-click inside the figure. 2. Select Edit ► Copy Graph. 3. Paste where desired. Doing this for the figure above yields Figure 5.7:

Chapter 5: Data Visualization Tools 73 Figure 5.7 Final Figure of Alpha GST versus Age

Some figures have multiple parts. To copy the entire set of figures:

1. Right-click on the gray triangle next to the title bar (options shown in Figure 5.8). 2. Select Edit ► Copy Picture. 3. Paste where desired. Figure 5.8 General JMP Figure Options from Gray Triangle

74 Biostatistics Using JMP: A Practical Guide

5.2.3 Multiple Scatter Plots For multivariate data, you can plot more than one pair of columns at a time. Returning to the Scatterplot Matrix tool, it is possible to highlight more columns for plotting.

1. Select Graph ► Scatterplot Matrix. 2. Select the variables of interest and click on Y, Columns. 3. Use the defaults and click OK, when the dialog box appears as in Figure 5.9. Figure 5.10 displays the resultant scatter plot matrix. Figure 5.9 Scatterplot Matrix Dialog Box, Setup for Multiple Variables

Additional options exist for the visualization dialog box shown in Figure 5.9. By default, JMP will plot a lower triangular scatter plot matrix. However, you can also plot an upper triangular or square scatter plot matrix (a square version can be seen in Figure 5.24. Using lower or upper triangular scatter plot matrices saves space, and these are transposed versions of the pairwise scatter plots of each other (Jacoby, 1998). However, square scatter plot matrices have some benefits: they enable more interpretation and enable you to examine relationships between variables by rows or columns (Jacoby, 1998).

Chapter 5: Data Visualization Tools 75 Figure 5.10 Scatter Plot Matrix of Iron Glutathione Data

By examining the scatter plot matrix in Figure 5.10, you can see various data features. First, in many plots (particularly those with Alpha GST (ng/L)), you can see an apparent outlier in the upper right corner of these plots. If you position the pointer over this point, JMP will tell you which row this corresponds to, row 88 in this case. If you click on this point, JMP will highlight the row in the data table, while simultaneously making the remaining points gray. Clicking in the white space between data points deselects this row, and all the points will return to black.

76 Biostatistics Using JMP: A Practical Guide

5.3 Charts Charts, including bar charts, are very common tools for relaying information. Charts illustrate different aspects to facilitate comparisons. Although bar charts will be considered below, pie charts and other charting styles are available in JMP. To create a basic chart:

1. Open IronGlutathione.jmp. 2. Select Graph ► Chart and the dialog box seen in Figure 5.11 will be presented. Figure 5.11 Chart Dialog Box

3. Click on Alpha GST (ng/L) in the Select Columns box. 4. In the Cast Selected Columns into Roles box, select Statistics ► Data. 5. Click OK and the result seen in Figure 5.12 will be presented. Figure 5.12 Alpha GST Bar Chart

Chapter 5: Data Visualization Tools 77 On the x-axis of Figure 5.12 are the observation numbers, and on the y-axis are the values. For example, if you click on the large bar for number 88, you will see that the data table has line 88 highlighted. By itself, this figure is not that useful, but return to the Chart tool:

1. 2. 3. 4. 5. 6. 7.

Click on the red triangle by Chart. Select Redo ► Relaunch Analysis, and the Chart tool has returned. Select Alpha GST (ng/L) in the Select Columns box. In the Cast Selected Columns into Roles box, select Statistics ► Data. Select Gender in the Select Columns box. In the Cast Selected Columns into Roles box, select Categories, X, Levels. Click OK, and the visualization shown in Figure 5.13 appears.

Figure 5.13 Chart of Alpha GST by Gender

Now you see that Alpha GST is plotted on the y-axis and there are two sets of points associated with female and male subjects on the x-axis. If you want to make this data a bar chart, return to the Chart tool and do the following:

1. 2. 3. 4. 5. 6. 7.

Click on the red triangle by Chart. Select Redo ► Relaunch Analysis, and the Chart tool has returned. Select Alpha GST (ng/L) in the Select Columns box. In the Cast Selected Columns into Roles box, select Statistics ► Mean. Select Gender in the Select Columns box. In the Cast Selected Columns into Roles box, select Categories, X, Levels. Click OK and the bar chart shown in Figure 5.14 appears.

78 Biostatistics Using JMP: A Practical Guide Figure 5.14 Bar Chart of Alpha GST by Gender

The result should look like the following figure where the top of each bar is the mean of the original data. If you want error bars, you have two options for adding them. To add them directly to this figure:

1. 2. 3. 4.

Click on the red triangle by Chart. Select Add Error Bars to Mean (Figure 5.15). Click on the desired error bar type. For this example, Confidence Interval was used. Edit the setting, if needed. Here the default of 0.95 was used to create a 95% confidence interval. 5. Click OK. Figure 5.15 Dialog Box for Adding Error Bars for Chart

The result, Figure 5.16, shows the bar chart with an error bar on each. Of note is that the error bar of F overlaps the mean of M and vice versa. Thus, it is possible to say that these two gender groups are statistically similar with respect to Alpha GST (ng/L).

Chapter 5: Data Visualization Tools 79 Figure 5.16 Final Alpha GST versus Gender Bar Chart with Error Bars

Note that it would have been possible to add them before generating this figure:

1. Select Add Error Bars to Mean in the Options box in the Chart tool. 2. Select the desired error bar and setting. Additional data plotting tools, such as pie charts, are available in the Chart tool. These can be found in drop-down menus in the Options box within the Chart tool.

5.4 Multidimensional Plots Various multidimensional plotting methods exist and are available in JMP. Such methods take multidimensional data sets and manipulate them so that they can be plotted linearly. Although quantifiable resolution is frequently lost through such methods, descriptive and visual information is added. Two plots, parallel plots and cell plots (heat maps), will be presented because of their abilities to condense information. Using other multidimensional plotting methods in JMP is similar.

5.4.1 Parallel Plots Parallel plots are a common approach to multidimensional visualization. Here, each observation is represented as a line that passes through various parallel axes, with each parallel axis being a different variable (Edsall, 2003) (Inselberg, 1985). Because of the different ranges of variables, each parallel axis is scaled so that each axis is the same length, and with the end of each axis being the respective extreme (minimum and maximum) points (Edsall, 2003). When you are creating a parallel plot in JMP, all the variables that you want to plot go into the Y, Response box with any desired grouping (e.g., plot the results for men and women separately) going into the X, Grouping box. To create a parallel plot in JMP:

1. Open IronGlutathione.jmp. 2. Select Graph ► Parallel Plot to see the Parallel Plot dialog box shown in Figure 5.17.

80 Biostatistics Using JMP: A Practical Guide Figure 5.17 Parallel Plot Dialog Box

3. Select the desired columns to plot in Y, Response (all columns were selected for this example). 4. Click OK. The resultant parallel plot appears in Figure 5.18. Gender was included as a response and appears at the third variable in the parallel plot. Here, you see the categorical nature of Gender, with only low values or high values possible. Figure 5.18 Parallel Plot of Iron Glutathione Data

Chapter 5: Data Visualization Tools 81

5.4.1.1 Grouped Parallel Plots If desired, multiple (grouped) parallel plots can be generated based on a categorical feature. Here is an example:

1. 2. 3. 4.

Select Graph ► Parallel Plot. Select the desired features to plot in Y, Response. Select an appropriate categorical feature (e.g., Gender, for X, Grouping). Click OK.

The resultant parallel plot, Figure 5.19, with Gender as the grouping, presents the data in two figures. Gender was kept in the response to illustrate the functionality. With Gender being the third variable in both parallel plots, you can see the categorical nature of this variable with a low value in the left plot for all observations and a high value in the right plot for all observations. Separating the data by gender makes the two graphs individually less cluttered than the single parallel plot, and these figures also permit some comparison between groups. For example, %ISAF is seen as having overall lower values with less variance on the left when compared to the right figure. When interpreting the results in Figure 5.19, notice that the figure on the left, female subjects, appears to have less variability for some variables (e.g., sTfR and Ferritin) when compared to the figure on the right, male subjects. This variability is seen in the more scattered lines across much of the figure on the right. Figure 5.19 Parallel Plot of Iron Glutathione Data, Grouped by Gender

Further data analysis can be performed using parallel plots by reversing the direction of features:

1. Click the red triangle near Parallel Plot. 2. Select Show Reversing Checkboxes. 3. Click on the boxes under Reversed Scale for the desired features to reverse.

82 Biostatistics Using JMP: A Practical Guide

5.4.2 Cell Plots Shaded matrices, such as heat maps and cell plots in JMP, have a long history in data visualization (Loua, 1873) (Gu, et al., 2012). In these methods, values are represented as colors and enable efficient examination of matrix values (Gu, et al., 2012). However, since a cell plot presents each cell in a matrix with a color, cell plots can be of limited utility when you consider a large data set. With the IronGlutathione.jmp data set containing 90 observations, it thus is on the verge of being hard to visualize with a single cell plot. Therefore, the next example will visualize this data set by creating two side-by-side cell plots by grouping on gender. To create a cell plot in JMP:

1. Select Graph ► Cell Plot to see the dialog box presented in Figure 5.20. Figure 5.20 Cell Plot Dialog Box

2. In either, select the desired features to plot in Y, Response. 3. For this example, all columns were selected. 4. If desired, an optional categorical feature, such as Gender, can be placed in X, Grouping.

5. Click OK. However, a legend is necessary to further interpret this output since there is no conception of ranges for values. To make the cell plot more meaningful, add a legend:

1. Click on the red triangle next to Cell Plot. 2. Select Legend. The resultant cell plot, Figure 5.21, presents a heat map of the data. For this data table and settings, notice the two heat maps. The cell plot on the left is annotated at the top with “F”, and the one on the right with “M”, indicating the two groups. Below this annotation are the names of the data features, which are columns for both cell plots. The cell plot contains colors for each cell, indicating relative magnitudes. Missing values are represented as a white cell with an X through them. The legend, at the right, shows that the colors start at blue, for low values, and end at red, for large values. Each data variable has a different scale. Thus, red for age is 55, and red for Alpha GST (ng/L) is 90,000.

Chapter 5: Data Visualization Tools 83 Figure 5.21 Iron Glutathione Cell Plot with Legend

84 Biostatistics Using JMP: A Practical Guide

5.5 Multivariate and Correlations Tool The JMP Multivariate and Correlations tool enables both visualization and basic analysis of numerical data. Along with correlation matrices and other analysis methods, the multivariate tool provides an automatic scatter plot matrix. To use the multivariate tool:

1. Select Analyze ► Multivariate Methods ► Multivariate to see the dialog box presented in Figure 5.22.

2. Select all columns for Y, Columns. 3. Click OK. Figure 5.22 Multivariate Dialog Box

You will receive a warning, shown in Figure 5.23, that only continuous variables are possible to examine using this tool. Figure 5.23 JMP Alert on Data Type for Multivariate Analysis

4. Click OK.

Chapter 5: Data Visualization Tools 85 Note that Gender was automatically removed from Y, Columns list

5. Click OK. The result of these actions yields a window, Figure 5.24, with a variety of outputs. The explanation will go through these one by one. The first output is a correlation matrix and below this is scatter plot matrix. The scatter plot matrix is similar to that shown in Figure 5.10. The correlation matrix provides additional information about how the data variables are related. Figure 5.24 Multivariate Tool Presented for Iron Glutathione Data

5.5.1 Correlation Table The first view in Figure 5.24 is that of the data correlation matrix; this is enlarged in Figure 5.25. The correlation table uses Pearson’s correlation to compute the strength of a linear association between two variables. Correlation values range from -1 to +1, with -1 meaning that two variables move perfectly in opposite directions, +1 meaning that two variables move perfectly in the same direction, and 0 meaning that there is no correlation (e.g., plotting two random vectors against each other). In the JMP correlations table, the colors used range from blue (for perfectly positively correlated) to gray (for lightly positively correlated) to red (for negatively correlated).

86 Biostatistics Using JMP: A Practical Guide Figure 5.25 Correlation Matrix

5.5.2 Correlation Heat Maps For more visual analysis of the correlation matrix, you can present a heat map of the correlation table, organized based on similarity of columns. For this purpose, clustering based on correlations is available:

1. Click on the red triangle next to Multivariate. 2. Select Colormap ► Cluster the Correlations. The resultant visualization, Figure 5.26, presents a representation of the correlation matrix colored by magnitude. Note, the color scheme is reversed in Figure 5.26 when compared to the coloring used for the correlation matrix text in Figure 5.25. Figure 5.26 Correlation Matrix Heat Map

Chapter 5: Data Visualization Tools 87

5.5.3 Simple Statistics Additional comparison measures, such as nonparametric statistics and overall summary statistics, are available in the multivariate tool. To present these:

1. Click on the red triangle next to Multivariate. 2. Select the method of interest. To view summary statistics: ●

Select Simple Statistics ► Univariate Simple Statistics.

The result is the following table at the bottom of the Multivariate window, Figure 5.27. This table provides summary statistics of number of observations (N), degrees of freedom (DF), mean, standard deviation, sum of values, minimum values, and maximum values. Beyond summary statistics, you can also see that the data has 90 observations, except for Ferritin (ng/dL), which has one apparent missing value, as shown in the cell plot example in Figure 5.21. Figure 5.27 Univariate Statistics as in the Multivariate Tool

5.5.4 Additional Multivariate Measures A variety of additional measures are available in the Multivariate platform. The following tools are available: nonparametric correlations (useful for non-normal data), pairwise correlations (which provide additional information and confidence intervals for a Pearson correlation table), and outlier analysis (which has statistical tests for determining if a point is an outlier). To access these:

1. Click the red triangle near Multivariate. 2. Select the desired additional measure.

88 Biostatistics Using JMP: A Practical Guide

5.6 Graph Builder and Custom Figures JMP has a variety of additional figure plotting options. One of the most powerful is Graph Builder, which can create many styles and customizations. Through Graph Builder you can easily create, modify, and adapt data visualizations. Also, you can change plotting styles and change data associations by dragging and dropping columns of data. To use Graph Builder: ●

Click Graph ► Graph Builder.

The following dialog box, Figure 5.28, then appears. Figure 5.28 Graph Builder Platform

When you select a variable and drag it over the plotting window, all of the options for its destination are highlighted, as seen in Figure 5.29.

Chapter 5: Data Visualization Tools 89 Figure 5.29 Graph Builder Platform with Possible Drop Zones Highlighted

Here is an example of using Graph Builder for a basic plotting:

1. 2. 3. 4.

Click on %ISAF (Iron/TIBC) in the Variable list. Drag this to the X drop zone. Click on Iron (Mg/dL) in the Variable list. Drag this to the Y drop zone.

By default, Graph Builder includes a smoothed line, Figure 5.30. To remove this:

1. Go to the toolbar at the top of Graph Builder. 2. Click on the second highlighted map option. 3. Note that this deselects the smoothed line, as shown in Figure 5.31.

90 Biostatistics Using JMP: A Practical Guide Figure 5.30 Default Visualization in Graph Builder for Two Variables

Figure 5.31 Graph Builder with Deselected Smoothed Line

5.6.1 Graph Builder Custom Colors To color a graph by a categorical variable, such as Gender:

1. Select the variable in the Variable list. 2. Drag the selected variable to the Color drop zone on the far right of Graph Builder.

Chapter 5: Data Visualization Tools 91 Following these steps for the example data produces Figure 5.32. Figure 5.32 Graph Builder with Optional Coloring by Points

The title and axis labels are pulled directly from the variable names. To change these labels:

1. Click on the label text. 2. Rename as desired. To add labels where they are missing:

1. Right-click on the x and y-axes. 2. Select Add Axis Label. 3. Add an appropriate label. To finished creating a chart: ● Click Done The final figure should appear as Figure 5.33. This figure is now ready for copying, printing, and displaying. However, to make a change:

1. Click on the red triangle next to Graph Builder. 2. Select Show Control Panel. And now you have returned to Graph Builder.

92 Biostatistics Using JMP: A Practical Guide Figure 5.33 Graph Builder Figure Ready for Use in Presentations

5.6.2 Incorporating Contextual Data If you have additional information in a data set, such as pictures of a subject, you can incorporate this data into the analysis and visualization. The next example considers the Big Class Families data set that comes with JMP. This data set, as first presented in Chapter 2 in Figure 2.14 and discussed in Section 2.4, has demographic and lifestyle characteristics of hypothetical students in a middle school. In addition, a picture of each student is stored in the first column of the data table. After you have this plotted in Graph Builder, you can incorporate the picture as contextual data by using Graph Builder (see help in Sections 5.6 and 5.6.1) and plotting the height by the weight:

1. Drag the Picture column into the drop zone in the middle of the Graph Builder window (where the data is plotted).

2. Now, hold the pointer over a point and you will see the picture of the subject along with the characteristics, Figure 5.34. 3. Click Done when you have completed editing the table as desired.

Chapter 5: Data Visualization Tools 93 Figure 5.34 Big Class Families Data Table

5.7 References Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17-21. Edsall, R. M. (2003). The parallel coordinate plot in action: Design and use for geographic visualization. Computational Statistics & Data Analysis, 43(4), 605-619. Gu, J., Pitz, M., Breitner, S., Birmili, W., von Klot, S., Schneider, A., Soentgen, J., Reller, A., Peters, A., and Cyrys, J. (2012). Selection of key ambient particulate variables for epidemiological studies—Applying cluster and heatmap analyses as tools for data reduction. Science of the Total Environment, 435, 541-550. Inselberg, A. (1985). The plane with parallel coordinates. The Visual Computer, 1(2), 69-91. Jacoby, W. G. (1998). Statistical Graphics for Visualizing Multivariate Data. Sage. Loua, T. (1873). Atlas statistique de la population de Paris. Paris: J. Dejey. Mauzy, C. A., Johnson, N. H., Jacobsen, J. J., Quade, A. G., Betz, J. N., Frey, J. S., Hanes, A., and Kaziska, D. (2012). Correlation between Iron and α and π Glutathione-sTransferase Levels in Humans. 711th Human Performance Wing, Bioeffects Division. Wright-Patterson AFB, OH: Air Force Research Laboratory. SAS Institute. (2012). JMP 10 Basic Analysis and Graphing. Cary, NC: SAS Institute. SAS Institute. (2016). Basic Analysis. Cary, NC: SAS Institute. Tegarden, D. P. (1999). Business information visualization. Communications of the AIS, 1(1).

94 Biostatistics Using JMP: A Practical Guide

Chapter 6: Rates, Proportions, and Epidemiology 6.1 Introduction ....................................................................................... 95 6.2 Rates ................................................................................................. 95 6.2.1 Crude Rates ................................................................................................. 95 6.2.2 Adjusted Rates ............................................................................................ 99 6.3 Geographic Visualizations ................................................................ 101 6.3.1 National Visualizations.............................................................................. 101 6.3.2 County and Lower Level Visualizations................................................... 109 6.4 References ...................................................................................... 113

6.1 Introduction Epidemiology involves determining both how and why health issues arise and where they are distributed (e.g., geographically (van Belle, Fisher, Heagerty, & Lumley, 2004) (Daniel & Cross, 2013)). Central to this task are rates, contingency tables, and proportions. Rates are used to standardize values to a common scale (e.g., incidences per 100,000 people) and are examined across age or geographic ranges or both. Not only are generating and presenting tables important, but creating geographical figures is also important in presenting information. This technique enables you to introduce context to the analysis. Contingency tables can be used to examine the association of two variables. In this chapter, both computing rate values in tables and using builtin JMP tools for geographical figures will be discussed. Contingency tables will be presented in Section 7.6 of Chapter 7.

6.2 Rates Rates enable you to measure the occurrence of a health outcome (disease, deaths, etc.) across population characteristics. For example, if you have raw counts of disease incidences for two cities--one city large and the other small--it is impossible to determine whether there is a notable disease problem until the disease counts for each city are scaled by their respective populations. In order to illustrate how to compute rates and ratios in JMP, a few general examples are provided using the formula editor and a basic rate expression. You can adapt this process for specific rate equations.

6.2.1 Crude Rates One example of a crude rate is an incidence rate that provides a ratio of the number of observed cases of a disease divided by the total population of interest with a multiplier to provide a reference point (e.g., per 1,000). This can be presented as an incidence rate:

96 Biostatistics Using JMP: A Practical Guide ×

=

In the equation, C is the rate multiplier (e.g., “number of cases per C people”), number of cases is the number of cases observed, and total population at risk is the total population of interest (van Belle, Fisher, Heagerty, & Lumley, 2004) (Daniel & Cross, 2013). If you were to consider a death rate per 100,000 people, a reasonable equation would be as follows: ℎ



100,000 =

×100,000

Crude rates for a health incidence can also be computed over a time interval. If a time interval is considered as well (e.g., aggregated data from multiple years), then you can compute the crude rate as follows: =

×

×

.

An example is presented, with data in Table 6.1. Conceptually, this example is similar to an example presented in van Belle et al. (2004). In Table 6.1, cancer rates for metropolitan areas are presented using data from the United States Cancer Statistics, 1999-2012 Incidence Results from the Centers for Disease Control and Prevention (CDC, 2015) and population estimates from the US Census Bureau (Population Division, 2014). Table 6.1 Selected Metropolitan Cancer Data for 2009 - 2012

Metropolitan Area

Cancer Incidence (CDC, 2015)

Albuquerque, NM

15,557

Total Metropolitan Area Population (at risk population) (Population Division, 2014) 900,464

Cincinnati, OH-KY-IN 42,667 Los Angeles-Long Beach-Anaheim, CA

206,851

2,129,309 13,037,045

To compute the crude cancer rate per 100,000 people per year in JMP, it is necessary to create a new data table and then use the Formula Editor. For this data, the C will be 100,000 and the time period will be four years for the 2009 to 2012 range considered. To compute the Incidence per 100,000 per year rate:

1. 2. 3. 4.

Create a new data table. Type the table above into JMP. Double-click on the fourth column to instantiate a new column. Double-click on the column name and rename it Incidence per year. a. This column will present the average number of cancer incidences in each metropolitan area by year. 5. Double-click on the fifth column to instantiate a new column.

Chapter 6: Rates, Proportions, and Epidemiology 97

6. Double-click on the column name and rename it Incidence per 100,000 per year. a.

This column will present a scaling of the incidences of cancer in each metropolitan area each year, scaled by their population.

The data table should now look like Figure 6.1. Figure 6.1 Preliminary MetroCancer Data Table

To add the computations for the columns:

1. Right-click on the fourth column, Incidence per Year, and select Formula in the menu. 2. Access the Formula Editor, Figure 6.2, to create the formula : a. Select Cancer Incidence in the list of columns. b. Select the divide symbol. c. Type 4, for the total number of years of interest. 3. Click OK, when the equation looks identical to Figure 6.2. Figure 6.2 Formula for Cancer Incidence per Year

Right-click on the fifth column, Incidence per 100,000 per year, and select Formula in the menu.

98 Biostatistics Using JMP: A Practical Guide

4. Access the Formula Editor, Figure 6.3, to create the formula a. b. c. d. e. f.

×

×100,000 :

Select Cancer Incidence in the list of columns. Select the divide symbol. Select Population in the list of columns. Select the multiplication symbol. Type 4, for the total number of years of interest. Then, click on the box surrounding the whole equation, so that blue box highlights it. g. Select the multiplication symbol. h. Type in 100,000. 5. Click OK, when the equation looks identical to Figure 6.3. Figure 6.3 Formula for Cancer Incidence per 100,000 per Year

6. Click OK, and your data table should look like Figure 6.4. Figure 6.4 Final MetroCancer Data Table

Chapter 6: Rates, Proportions, and Epidemiology 99 Interpreting the results, you can see that, for the four-year time period, the Los Angeles-Long Beach-Anaheim area at first sticks out with the highest incidences per year. The incidence per year of Los Angeles-Long Beach-Anaheim is in fact 5 times larger than that of Cincinnati and 13 times larger than that of Albuquerque. However, when population is taken into account, notice that Cincinnati has the highest overall cancer rate of the considered metropolitan areas, per 100,000 per year, of 500 per 100,000 per year. Both Albuquerque and the Los Angeles-Long Beach-Anaheim area are seen to have much lower rates per 100,000 per year.

6.2.2 Adjusted Rates The rates presented above were “crude” rates in that they do not take into account the age groups of the observations or other possible indicators. To extend this type of analysis, it is necessary to consider adjusted rates. The next example examines births by the age of the mother. Daniel and Cross (2013) presented such an analysis for the US state of Georgia. This example uses data for the US State of Ohio for 2010, shown in Figure 6.5. Demographic data was collected from both the 2010 Census of Ohio (State of Ohio, 2010) and the number of live births per the age of the mother from the Ohio Department of Health (2010). You could collect similar and additional data by examining data repositories at the CDC, the Ohio Department of Health Bureau of Vital Statistics, and the departments of health of other US states. Figure 6.5 Ohio Births by Age of Mother

6.2.2.1 Cleaning the Data Table Because the data for live births and age of mother does not contain any values below 10 years of age, it is necessary to delete the rows associated with “Under 5” and “5 to 9”. To do so:

1. Select the first and second rows. 2. Right-click on them and select Delete Rows. In addition, one row that is associated with unknown values for the age of mother cannot be used for the analysis. Find row 10 and delete this row as well. The final data table should look like Figure 6.6.

100 Biostatistics Using JMP: A Practical Guide Figure 6.6 Refined Data Table for Ohio Births by Age of Mother

6.2.2.2 Computing Adjusted Rates To add an age-specific fertility rate, consistent with the reference (Daniel & Cross, 2013), it is necessary to create a new column and add the following formula to compute this rate: ℎ

ℎ ′

×1,000 ,

To create this formula in JMP:

1. 2. 3. 4. 5. 6. 7.

Right-click in the empty column box next to Women in Population. Select New Columns. Type Age-Specific Fertility Rate for the Column Name. Click on Column Properties. Select Formula in the menu. Create the formula above and as shown in Figure 6.7. Click OK, when the equation looks identical to Figure 6.7.

Figure 6.7 Inserting a Formula for an Age-specific Fertility Rate

Chapter 6: Rates, Proportions, and Epidemiology 101

1. Click OK in the New Column tool. The final data table should now look like Figure 6.8. From here, you could plot or visualize the data further or interpret and compare age-specific rates against national, regional, or internationally comparable rates. Figure 6.8 Final Data Table for Ohio Births by Age of Mother

6.3 Geographic Visualizations Geographic Information Systems (GIS) are software tools that present spatial data (e.g., maps) and link it with databases. Using GIS in health care builds on the concepts developed over centuries, see Snow (1855), or see these selected references that are related to manually plotting diseases on maps to infer causes (Davenhall & Kinabrew, 2011) (Clarke, McLafferty, & Tempalski, 1996) (Ruankaew, 2005) (Jerrett, et al., 2003). Uses of GIS in health care include mapping genetic variations, food availability, behavioral risks, rates of diseases or incidences, and other applications (Davenhall & Kinabrew, 2011). This section uses the JMP built-in geographical plotting abilities to consider health care problems. Examples of national, US state, and Ohio county-level rates of opioid overdoses will be considered. Also, it is possible to further extend this to international rates as well. Data for these examples was collected from the CDC and the Ohio Department of Health Bureau of Vital Statistics. Additional data on a wide variety of health-related topics, with potential GIS exploration, can be found at these sources; beyond these sources, a researcher could gather more data from the departments of health of other US states.

6.3.1 National Visualizations It is rather straightforward to create national visualizations in JMP; you must merely keep in mind the syntax JMP is expecting. One example is presented in the following data set, US_OpioidOD_dat.jmp. Here, US data for opioid overdose deaths was collated from publications

102 Biostatistics Using JMP: A Practical Guide by the CDC for 2013 and 2014 (Rudd, Aleshire, Zibbell, & Gladden, 2016). To examine this data set: ●

Open US_OpioidOD_dat.jmp and you should see a data table like Figure 6.9.

Figure 6.9 US Opioid Overdose Data for 2013-2014

The data shown in Figure 6.9 contains a column with a US state abbreviation, followed by multiple numeric columns containing rates and numbers of overdoses. In JMP, the two-letter abbreviation is one format that works effectively. However, the format of the State column is critical. If you do not use the correct name or abbreviation for a state or country, JMP will not plot it correctly. To create a GIS plot of this data:

1. Select Graph ► Graph Builder. 2. Select State and drag it to the Map Shape drop zone; Graph Builder should now show something that looks like Figure 6.10.

Chapter 6: Rates, Proportions, and Epidemiology 103 Figure 6.10 Blank Geographical Map of the USA

3. In the Variable list, select 2014Rate. 4. Drag this variable to Color, on the right-hand side of Graph Builder. Figure 6.11 Map of US States Colored by 2014 Opioid Overdose Rate

104 Biostatistics Using JMP: A Practical Guide

5. Click Done when the figure looks like Figure 6.11. The resultant figure should look like Figure 6.12. Figure 6.12 Map of US States Colored by 2014 Opioid Overdose Rate without Axis Labels

6.3.1.1 Editing Titles and Adding Axis Labels Although the resultant Figure 6.12 might look good at first glance, it might still be non-intuitive to your audience. To make the figure more understandable, it is a good idea to edit the figure to add axis labels and to improve the title. To edit the title, return to Graph Builder:

1. Click on the red triangle next to Graph Builder. 2. Select Show Control Panel. Now the original control panel view has returned. 3. Click on the title, State colored by 2014Rate and change it to a more meaningful title (e.g., States colored by 2014 Opioid Overdose Rate). To add axis labels:

1. 2. 3. 4. 5.

Right-click on the y-axis. Select Add Axis Label. The axis label dialog box, Figure 6.13, should appear. Type in Latitude. Click OK. Repeat the step for the x-axis, but label the axis Longitude.

Chapter 6: Rates, Proportions, and Epidemiology 105 Figure 6.13 Axis Label Dialog Box

6.3.1.2 Changing Legend Details and Color Themes Beyond axis and title issues, the legend in Figure 6.12 is lacking in context regarding the rate scale and could be improved for spelling since the legend, and the title, include the raw variable name 2014Rate which needs appropriate spacing.. In addition, the color shading might not be appropriate for all tasks. As mentioned by Allison (2016), diverging shading (from blue to red) can make a graphic hard to evaluate. Furthermore, since this book is printed in gray scale, the reds and blues might not be ideal for the reader. To edit the legend and legend title:

1. Double-click on the Legend to bring up the Legend Settings tool. 2. A Legend Settings dialog box will appear, as shown in Figure 6.14. 3. Change the title to a more meaningful description (e.g., 2014 rate per 100k people) as shown in Figure 6.14. Figure 6.14 Legend Settings Dialog Box with Default Color Theme

106 Biostatistics Using JMP: A Practical Guide To change the color shading format:

1. Click on Color Theme. A variety of options appear, as shown in Figure 6.15. 2. Select Sequential ► White to Black, as shown in Figure 6.16 with the default colors in Figure 6.14.

3. Click OK. 4. Click Done. Figure 6.15 Available Color Themes in JMP

Chapter 6: Rates, Proportions, and Epidemiology 107 Figure 6.16 Legend Settings with Changed Color Theme

5. Change any color settings that you want. For the rest of this example, a gray-scale color theme was selected by clicking on Color Theme and selecting one of interest.

6. Click OK. 7. Click Done in the Graph Builder tool. The resultant figure should look like Figure 6.17. Figure 6.17 Final Map of US States Colored by 2014 Opioid Overdose Rate

108 Biostatistics Using JMP: A Practical Guide The resultant Figure 6.17 now has axis labels, an appropriate title, and an appropriate legend. From this figure you can see that West Virginia, Ohio, Kentucky, New Mexico, Utah, and New Hampshire have the highest opioid overdose rates. From this simple example and visualization, you could infer that regions are a contributing factor, with West Virginia, Ohio, and Kentucky being neighboring states that have similarly high opoid overdose rates. Such similarities could be due to interrelated demographics or cultural similarities. It is possible to branch off from this example and design an epidemiological experiment to investigate key variables contributing to opioid overdose rate. For example, is there a reason that West Virginia, a state with a low population density, has a much higher rate than North Dakota, a state with a similarly low population density? Population density and economics could be other factors.

6.3.1.3 Changing Displayed Data, and a GIS Example of the Importance of Rates for Analysis If you were to repeat the process above, but select the raw count data in the 2014Number column, instead of rate, the result would look like Figure 6.18. Alternatively, you could select Show Control Panel, as discussed in 6.3.1, and do the following to change the displayed data:

1. 2. 3. 4. 5.

Select the column 2014Number. Drag this column to the Color drop zone. Change the legend so that the title is 2014 Number of Opioid Overdoses. Change the title to States Colored by 2014 Opioid Overdoses. Click Done.

The result will again look like Figure 6.18. This figure shows that California has the most opoid overdoses. However, California has approximately 30 times the population as West Virginia. So if you were to merely consider raw numbers and not rates, events happening in West Virginia would be ignored while a possibly less significant problem in California might be over-analyzed.

Chapter 6: Rates, Proportions, and Epidemiology 109 Figure 6.18 Map of US States Colored by 2014 Opioid Overdose Counts

6.3.2 County and Lower Level Visualizations Although national-level information is useful, decision makers are frequently interested in finer resolution. Using JMP, finer resolution geographical epidemiological analysis is possible; however, this involves understanding what JMP is expecting. It is time to learn how to extend the mapping features of JMP to provide finer resolution. Continuing the analysis above, notice that Ohio, West Virginia, and Kentucky had relatively high overdose rates. This example digs deeper by examining drug overdose deaths by Ohio county between 2009 and 2014 (Bureau of Vital Statistics, 2015 ). The data considered is largely unedited prior to analysis and thus the example will look at the various steps needed to produce a graphical understanding of Ohio drug overdose deaths by county. ●

Open OhioDrugOverdoseDeaths2009-2014.jmp

The data table should look like Figure 6.19.

110 Biostatistics Using JMP: A Practical Guide Figure 6.19 Example Data Table of Ohio Drug Overdoses by County

With the data loaded, you can see that none of the columns contain information about the state of interest or the meaning of the county names. To rectify this issue, and thus to create an appropriate map, it is necessary to add both a state designator and a county title to the county names. The steps below will thus create a new column that will have County, OH for all observations. This new column will then be concatenated with the county name columns. Step-by-step, do the following:

1. 2. 3. 4. 5. 6.

Select Cols ► New Columns. Type State into the Column Name field. For the Data Type field, select Character. For the Initialize Data field, select Constant. Type in County, OH, ensuring that there is a blank space before County. Click OK when the result looks like Figure 6.20.

Chapter 6: Rates, Proportions, and Epidemiology 111 Figure 6.20 New Column Dialog Box

To add this information to the county names, it is necessary to concatenate the strings via the Combine Columns dialog box:

1. 2. 3. 4.

Select the County column and the State column. Select Cols ► Utilities ► Combine Columns to see a dialog box like Figure 6.21. Type in a new name, such as Map Name. For the delimiter, delete the comma and add a space. Otherwise, there will be no space between “county name” and the word “county”. 5. Click OK. Figure 6.21 Combine Column Dialog Box

After performing these steps, you are finally ready to plot the data.

1. Return to Graph Builder by selecting Graph ► Graph Builder. 2. Drag Map Name to Map Shape. Notice that JMP shows the state of Ohio with county lines drawn, as presented in Figure 6.22; thus the data editing steps were a success.

112 Biostatistics Using JMP: A Practical Guide Figure 6.22 Blank Map of Ohio Counties

To color each county by drug overdose rates, select and display the appropriate information:

1. Drag Age Adjusted Rate to Color. 2. Click on the white space outside the state of Ohio to deselect any observations. 3. Click on the title Map Name and rename it Ohio Prescription Drug Overdoses 20092014. 4. Right-click on the X and Y-axes. For both, select Add Axis Label, and respectively add Longitude and Latitude. 5. Click Done. The resultant figure should like Figure 6.23. For a final product, you would likely now want to follow the process in Section 6.3.1.2 to change the color theme and adjust the legend title.

Chapter 6: Rates, Proportions, and Epidemiology 113 Figure 6.23 Ohio Prescription Drug Overdose Rates, 2009-2014

The end result, Figure 6.23, shows an interesting story: regions of Ohio see high prescription drug overdose deaths with the largest issues being in southern Ohio. Geographical associations of drug addictions are known (Thomas, Richardson, & Cheung, 2008), and with the knowledge from this analysis, you could aid in determining cause, allocating resources, considering regions to examine with finer resolution, and introducing additional data sources.

6.4 References Allison, R. (2016, Mar. 30). Drug Overdose Deaths Are on the Rise in the US. Retrieved from SAS Learning Post: http://blogs.sas.com/content/sastraining/2016/03/30/drug-overdosedeaths-are-on-the-rise-in-the-us/. Bureau of Vital Statistics. (2015 ). Ohio Drug Overdose Data: General Findings. Ohio Department of Health. CDC. (2015). United States Cancer Statistics: 1999-2012 Incidence, WONDER Online Database. Centers for Disease Control and Prevention. Clarke, K., McLafferty, S., & Tempalski, B. (1996). On epidemiology and geographic information systems: A review and discussion of future directions. Emerging Infectious Diseases, 2(2). Daniel, W. W., & Cross, C. L. (2013). Biostatistics: A Foundation for Analysis in the Health Sciences (10th ed.). John Wiley & Sons. Davenhall, W. F., & Kinabrew, C. (2011). GIS in health and human services. In Springer Handbook of Geographic Information (pp. 557-578). Springer Berlin Heidelberg.

114 Biostatistics Using JMP: A Practical Guide Jerrett, M., Burnett, R., Goldberg, M., Sears, M., Krewski, D., Catalan, R., . . . Finkelstein, N. (2003). Spatial analysis for environmental health research: Concepts, methods, and examples. Journal of Toxicology and Environmental Health Part A, 66(16-19), 17831810. Ohio Department of Health. (2010). Birth-Data and Statistics. Columbus: Ohio Department of Health. Population Division. (2014). Annual Estimates of the Resident Population: April 1, 2010 to July 1, 2013 . U.S. Census Bureau. Ruankaew, N. (2005). GIS and epidemiology. Journal of the Medical Association of Thailand, 88(11), 1735-1738. Rudd, R. A., Aleshire, N., Zibbell, J. E., & Gladden, R. M. (2016). Increases in drug and opioid overdose deaths—United States, 2000-2014. Centers for Disease Control and Prevention, Morbidity and Mortality Weekly Report (MMWR), 64(50&51), 1378-1382. Snow, J. (1855). On the Mode of Communication of Cholera. London: John Churchill. State of Ohio. (2010). Census 2010 Complete SF-1 Content Profile. Columbus: Ohio Development Services Agency. Thomas, Y. F., Richardson, D., & Cheung, I. (2008). Geography and Drug Addiction. Berlin: Springer. van Belle, G., Fisher, L. D., Heagerty, P. J., & Lumley, T. (2004). Rates and proportions. In Biostatistics: A Methodology for the Health Sciences (pp. 640-653). John Wiley & Sons Inc.

Chapter 7: Statistical Tests and Confidence Intervals 7.1 Introduction ..................................................................................... 115 7.1.1 General Hypothesis Test Background..................................................... 116 7.1.2 Selecting the Appropriate Method........................................................... 117 7.2 Testing for Normality ....................................................................... 118 7.2.1 Histogram Analysis ................................................................................... 118 7.2.2 Normal Quantile/Probability Plot ............................................................. 120 7.2.3 Goodness-of-Fit Tests .............................................................................. 122 7.2.4 Goodness-of-Fit for Other Distributions ................................................. 123 7.3 General Hypothesis Tests ................................................................ 124 7.3.1 Z-Test Hypothesis Test of Mean .............................................................. 124 7.3.2 T-Test Hypothesis Test of Mean .............................................................. 126 7.3.3 Nonparametric Test of Mean (Wilcoxon Signed Rank) .......................... 127 7.3.4 Standard Deviation Hypothesis Test ....................................................... 131 7.3.5 Tests of Proportions ................................................................................. 132 7.4 Confidence Intervals ........................................................................ 135 7.4.1 Mean Confidence Intervals....................................................................... 135 7.4.2 Mean Confidence Intervals with Different Thresholds ........................... 135 7.4.3 Confidence Intervals for Proportions ...................................................... 136 7.5 Chi-Squared Analysis of Frequency and Contingency Tables........... 137 7.6 Two Sample Tests ........................................................................... 141 7.6.1 Comparing Two Group Means ................................................................. 141 7.6.2 Paired Comparison, Matched Pairs ......................................................... 145 7.7 References ...................................................................................... 147

7.1 Introduction Statistical tests involve using sampled data to test a hypothesis about a population. It is largely impossible to sample an entire population. Thus, you must infer characteristics about a population, given sampled observations. For example, it is impossible to measure the petal lengths and widths of all Iris Setosa flowers ever grown, but you can collect measurements on a large number of samples (Fisher, 1936) and infer properties about the entire population. The process of inferring population characteristics from samples is referred to as statistical inference. A few methods will be considered. Hypothesis testing methods are the procedures used for statistical inference; these involve using sample data to statistically make inferences about a

116 Biostatistics Using JMP: A Practical Guide population. Confidence intervals construct an interval within which the population mean is expected to be. Procedurally, confidence intervals and hypothesis tests use the same computations and are closely related (Cumming & Finch, 2005) (Cumming, 2009) (Cumming, 2007). Thus, both hypothesis tests and confidence intervals are discussed in this chapter with application to mean, variance, proportion, and comparison approaches.

7.1.1 General Hypothesis Test Background Hypothesis tests are generally presented as follows: : = : ≠ In the equation, Ho refers to the null hypothesis, HA refers to the alternative hypothesis, μ is the population parameter, and is the testing value (e.g., a scalar real number). The equality must always be in the null hypothesis, and an inequality (≠, ) must appear in the alternative in order to make “failing to reject” the null the stronger decision. In terminology, you do not “accept” the null, but “fail to reject” it; this is because the null is assumed to be true until sufficient evidence indicates that it should be rejected (Daniel & Cross, 2013). With JMP, you do not select which type of inequality to assume because JMP will test and provide results for all three possible options. Four outcomes are possible in a hypothesis test, as conceptualized in Figure 7.1. In summary, the four outcomes are failing to reject the null hypothesis when it is really true, rejecting the null when it is really false, rejecting the null when it is really true, and failing to reject the null when it is really false. The third and fourth possible outcomes are not ideal and are respectively known as Type I and Type II errors. Table 7.1 provides examples of the types of errors that are possible in hypothesis tests. Figure 7.1 Confusion Matrix of Hypothesis Tests Results and Possible Errors (Fawcett, 2006)

Chapter 7: Statistical Tests and Confidence Intervals 117 Table 7.1 Examples of Type I and Type II Errors, Examples Adapted from (Reames & Kemeny, 2011)

Type I Errors Convict an innocent person Tell a patient that they have cancer when they do not Deny boarding to a safe passenger Make a process change that was not needed Fail to show that an effective drug is different from placebo

Type II Errors Set a guilty person free Miss a patient’s cancer Allow a dangerous passenger to board Fail to make a needed process change Approve an ineffective drug

It is largely impossible ever to know the actual “truth” in these outcomes, but it is possible to use probabilistic models to aid in this decision. However, it is important to select the appropriate method since the probabilities of these decisions are the result of making various assumptions.

7.1.2 Selecting the Appropriate Method There are two main types of hypothesis test and confidence interval methods: parametric and nonparametric. Parametric methods have an underlying distributional assumption, but nonparametric methods are distribution free (which does not mean they are assumption free). However, parametric methods generally have tighter confidence intervals than nonparametric methods; and within parametric methods (e.g., for means) the z-score provides a tighter confidence interval than a t-score. Care must be taken because using an inappropriate method can result in incorrect and unethical results. In determining which method to use, here are some general guidelines to follow. The first criterion is based on knowledge of the population’s normality. If it is possible to assume that the data came from a normally distributed population, then it is possible to use a parametric method (e.g., z-score/test or t-score/test) that will provide a tighter confidence interval than a nonparametric method (e.g., Wilcoxon signed rank). However, a nonparametric method will be more conservative, and ethical, if there are doubts about the distribution--that is, if there are extreme values or very small sample sizes (Aczel, 1993) (Mason, Lind, & Marchal, 1999). Second, sample size is a key component of which test to use. If you have non-normally distributed data, you might reasonably use a nonparametric method. However, the Central Limit Theorem can be applied for large samples, and you can thus use a z-score/test. Generally, it is necessary to have 30 samples or more to appropriately invoke the Central Limit Theorem in such situations (Hayslett, 1968). The Central Limit Theorem can be described as follows (Hayslett, 1968): when considering a random sample of n observations with a sample mean ̅ , which was taken from a population with a mean of μ and variance of σ2, the distribution of ̅ − / /√ approaches a normal distribution, with mean of 0 and variance of 1, as n increases. The last criterion relates to knowing the population’s variance, when deciding whether to use a zscore/test (for known population variance) or t-score/test (when the population variance is unknown). The advantage to using a z-score/test is that it provides a tighter confidence interval/decision region; however, incorrectly using it can introduce questionable results. Since it is generally possible to know about the population only through the sample measurements, it is often impossible to assume that the sample’s variance computation is consistent with the population unless the Central Limit Theorem is applied. However, in situations where there is an

118 Biostatistics Using JMP: A Practical Guide assumption that can be used about the population variance (a known standard, publication, etc.), it might be possible to correctly use the z-score/test for such data. Certain situations do imply the use of nonparametric methods over parametric methods. For example, you might need to use a nonparametric test if a median is of interest (such as in a skewed distribution), if you have a very small sample size, if the data is ordinal, or if there are influential outliers. One application area of concern is related to Likert data, where there are discrete survey outcomes (such as none, mild, moderate, severe, very severe); although both parametric and nonparametric methods can be valid, it can be beneficial to compare both methods and look for disagreement in the result (de Winter & Dodou, 2010). In general, try to use a parametric test if possible since these have more statistical power than nonparametric methods. Although parametric methods are built around a normal distribution, they are generally robust to departures of normality (Young & Veldman, 1977); however, nonnormally distributed large sample size data sets can still have distribution issues (Bradley, 1980). Nonparametric tests might not be valid if the dispersion of groups differs much. Thus, strive to understand the data, explore different methods, and aim to provide a meaningful interpretation and not follow a flow chart approach to selecting a method.

7.2 Testing for Normality Many statistical methods have an underlying assumption of normality (e.g., residuals in regression), and thus it is a very common and necessary task to evaluate data for normality. This topic is discussed further in Chapter 10. The starting point here is examining how to address this; however, the results do not prove that the data is from a normal distribution, merely that there is evidence that the data could be from a normal distribution. This is a concern in parametric tests, which have an assumption that the examined data is from a normal distribution, and these tests might not be valid if the data is non-normal. Before examining parametric and nonparametric tests, the next topic explains how to adequately check for normality.

7.2.1 Histogram Analysis You can visually check for normality via a histogram. However, this can be questionable, especially for small data sets, and thus histograms cannot readily verify normality. Consider having the following blood pressure data, collected by the author: = 125, 110,108,130,120,111,105,122,124,115 . Using this data, this example examines whether it is possibly normally distributed or not:

1. 2. 3. 4. 5. 6.

Open a blank JMP table. Type in values from . Change the column name to Systolic. Select Analyze ► Distribution. Select Systolic for Y, Columns. Click OK.

You now see a histogram of the data, Figure 7.2, as discussed in Chapter 4. But, it is not apparent if the histogram represents data that was likely collected from a normal distribution or not. At first

Chapter 7: Statistical Tests and Confidence Intervals 119 glance, Figure 7.2 could almost be uniformly distributed since there are six bins with most bins having two observations. Thus, a histogram by itself is not sufficient to answer this question. Figure 7.2 Histogram of Example Blood Pressure Data

To add a normal distribution curve to aid in interpretation: ●

Click on the red triangle next to Systolic and select Continuous Fit ► Normal.

Figure 7.3 Histogram of Example Blood Pressure Data with a Continuous Normal Fit

120 Biostatistics Using JMP: A Practical Guide This added a continuous distribution on top of the histogram, overlaid on the data in Figure 7.3 with properties found in Figure 7.4. From both the histogram and continuous fit, it is not apparent if this data is normally distributed or not. Thus, analyzing merely the histogram is likely not sufficient for this problem. Figure 7.4 Fitted Normal Distribution Details

7.2.2 Normal Quantile/Probability Plot To improve the quality of this type of analysis, it is possible to use normal quantile plots, also known as a normal probability plots, for this purpose. A brief example and steps for the normal , will illustrate its operation. quantile plot, using To compute normal quantile plots:

1. Sort the data in ascending order. = 105, 108, 110, 111, 115, 120, 122, 124, 125, 130 .

2. Compute cumulative distribution values via

= − 0.5 / , where j is the index number and N is the total number of samples. Here is an example:

T

=

:

1 2 : 105 108 : 0.05 0.15

3 4 5 6 7 8 9 110 111 115 120 122 124 125 0.25 0.35 0.45 0.55 0.65 0.75 0.85

10 130 0.95

3. Find z-score values for the cumulative distribution values by using a standard z-table. Here is an example for

=

: :

= 0.05,

= −1.64:

0.05 0.15 0.25 0.35 −1.64 −1.04 −0.67 −0.39

4. Plot observations (

0.45 0.55 0.65 0.75 0.85 −0.13 0.13 0.39 0.67 1.04

) against the cumulative distribution (both z and c).

JMP can compute the normal quantile plot via:

1. 2. 3. 4. 5.

Continue with the data table created in Section 7.2.1. Select Analyze ► Distribution. Select Systolic for Y, Columns. Click OK. Click on the red triangle next to Systolic and select Normal Quantile Plot.

0.95 1.64

Chapter 7: Statistical Tests and Confidence Intervals 121 You are now presented with Figure 7.5. With the final normal quantile plot, you are now ready to interpret the figure. In general, if the paired numbers form a straight line, you can reasonably assume normality. As shown in this figure, the points closely follow the center line, and do not go outside the confidence intervals or show a shape. Thus, it is reasonable to assume that this data is from a normal distribution. Figure 7.5 Normal Quantile Plot of with Histogram of Example Blood Pressure Data with a Continuous Normal Fit

When analyzing a normal quantile plot, it is necessary to watch for behaviors and shapes. Figure 7.6a presents a normal quantile plot that shows no apparent normality issues as discussed above. However, various issues can arise that manifest in various shapes. Figure 7.6a shows the influence of an outlier (e.g., the value of 130 was changed to 150). If presented with Figure 7.6a, you could click on the apparent outlier and investigate which row was causing this issue. Figure 7.6b presents the effects of a skewed distribution; here, two values (20% of the BP data) were moved higher in magnitude. The result is that the distribution is skewing toward these higher values while the majority of the distribution is still forming a straight line. Finally, Figure 7.6c presents the effects of a tailed distribution; here both the top two values were moved higher in magnitude and the bottom two values were moved lower in magnitude. The result is that the ends of the distribution are “twisting” the overall normal quantile plot. All three issues would need to be addressed by removing outliers or performing a data transformation (Kutner, Nachtsheim, Neter, & Li, 2005).

122 Biostatistics Using JMP: A Practical Guide Figure 7.6 Examples of Issues in Normal Quantile Plots

7.2.3 Goodness-of-Fit Tests To more effectively test if the sample data can be assumed as coming from a normal distribution, it is possible to perform a goodness-of-fit test whereby the sample data is compared to a normal distribution with similar centering and range to the sample data. The hypothesis test of interest is thus: : :



To perform this test:

1. 2. 3. 4. 5. 6.

Continue with the data table created in Section 7.2.1. Select Analyze ► Distribution. Select Systolic for Y, Columns. Click OK. Click on the red triangle next to Systolic and select Continuous Fit ► Normal. Click on the red triangle next to Fitted Normal and select Goodness of Fit.

You are now presented with Figure 7.7. The top half of Figure 7.7 is identical to Figure 7.4; however, the goodness-of-fit test now appears below the normal distribution parameters. Here you are presented with a few key details: first, the test used is a Shapiro-Wilk W test, consistent with research (Conover, 1980) , and the W-test statistic value and the p-value are next presented. Following these details, interpretation cues are presented with JMP. This is a reminder of what the null is: it is that the data comes from a population that is normally distributed. You are also reminded that small p-values point toward rejecting the null.

Chapter 7: Statistical Tests and Confidence Intervals 123 Figure 7.7 Fitted Normal and Goodness-of-Fit Test

By examining the results in Figure 7.7, notice that the goodness-of-fit test failed to reject the null; thus, we can conclude that the data has no serious departures from normality. While it is impossible to say that this data is from a normal distribution, it is reasonable to assume that a ttest would be robust and give useful results. This is consistent with the interpretation found in Section 7.2.2, but differs from what you might have inferred by examining only the histograms, as discussed in Section 7.2.1. In reporting the results, it would be unnecessary to show the entire table, but it would be possible to write a simple sentence: “Data was assumed to be from a normal distribution after performing a Shapiro-Wilk W test (W = 0.95, p = 0.69).”

7.2.4 Goodness-of-Fit for Other Distributions There are many distributions other than normal that can be assumed in a statistical model. To compare samples versus a reference distribution, it is necessary to follow the general procedure in 7.2.3 to test a given data set. However, rather than select a normal distribution, you would need to select from a variety of common statistical distributions. Briefly:

1. 2. 3. 4. 5. 6.

Open the data set. Begin with an underlying distribution assumption to test. This is left blank as __ below. Select Analyze ► Distribution. Select Variable of Interest for Y, Columns. Click OK. For continuous distributions: a. Click on the red triangle next to Variable of Interest and select Continuous Fit > __. b. Where __ is a placeholder for the distribution of interest. 7. For discrete distributions: a. Click on the red triangle next to Variable of Interest and select Discrete Fit > __. b. Where __ is a placeholder for the distribution of interest. 8. Click on the red triangle next to __ and select Goodness of Fit.

124 Biostatistics Using JMP: A Practical Guide From here, you would interpret the results consistent to Section 7.2.3 with the following in mind: : :





7.3 General Hypothesis Tests Beyond the specific example in Section 7.2, there are a variety of tests and methods for performing a hypothesis test. Many tests revolve around the mean of a given data set since the mean indicates where a data distribution lies, its central tendency. Beyond testing a mean, you can test other parameters, such as the variance, which relate the spread of the distribution. This topic will consider both mean and variance tests. The majority of the tests differ based on whether you know the population and can use a z-distribution/z-test or must use a sampled normal test (e.g., tdistribution).

7.3.1 Z-Test Hypothesis Test of Mean A z-test can be safely used when you have a large sample size and can thus pretend that the sample variance and standard deviation is a known fixed quantity. For a large sample size such as that seen in the DermatologyAge data set, this is a reasonable assumption since the sample size is N = 358. For an age data set, a reasonable hypothesis test would be to compare against an assumption of a given population. Assume that the interest is in using data from this dermatological study to infer whether a population has an average age of 35 years. To test the hypothesis that the dermatology data has a similar average, the hypothesis test is formalized as follows: : :

= 35 ≠ 35

There are three general options to test a mean in JMP: t-test, z-test, or Wilcoxon nonparametric test. Given that there are 358 samples, the Central Limit Theorem applies and, thus, it is possible to perform a z-test using the sample standard deviation found in the summary statistics. However, all three test start similarly. To test this hypothesis using JMP:

1. Open the DermatologyAge.jmp files edited and corrected in Section 3.2.2. 2. Select Analyze ► Distribution. 3. Select AGE for Y, Columns. This indicates to JMP which column of data you want to analyze. 4. Click OK. 5. Click on the red triangle next to AGE and select Test Mean. 6. The test mean dialog box, as shown in Figure 7.8, appears.

Chapter 7: Statistical Tests and Confidence Intervals 125 Figure 7.8 Test Mean Dialog Box

To test this hypothesis: ●

Type 35 into the Specify Hypothesized Mean field.

Since it is possible to apply the Central Limit Theorem, the sample standard deviation can be used:

1. Type 15.324557, found in a summary statistics table of the data, into the Enter True Standard Deviation to do z-test rather than t test field. 2. Click OK. You are now presented with the test of the mean as shown in Figure 7.9. Figure 7.9 Z-Test Result

The first part of Figure 7.9 shows summary statistics and values for the hypothesis test: the hypothesized mean, the actual sample mean, the degrees of freedom in the data, the computed standard deviation, and the standard deviation provided for the z-test. The results in the second half of Figure 7.9 are p-values for the three types of distributional assumptions: two-sided, lower, and upper. The first result in the z-test table in Figure 7.9 is interpreted as the following equation:

126 Biostatistics Using JMP: A Practical Guide > | | = 0.1095 This is the p-value for HO: μ = 35 and HA: μ ≠ 35. Since this p-value is above 10%, you would fail to reject the null at both the 5% and 10% level of significance and conclude that the hypothesized mean and the estimated mean are statistically equivalent. The next relationships are thus: >

= 0.0548


35 and HO: μ = 35 and HA: μ < 35, respectively. The p-values seen in these results are consistent with the sample mean being 36.2961 and the test mean being 35 with the > being just slightly over 5% and on the verge of rejecting the null in favor of the altenrative (HA: μ > 35). The reason for this is that the sample mean is larger than the hypothesized population mean. The result for < is also similarly logical since it implies that the sample mean is unlikely if the null were true.

7.3.2 T-Test Hypothesis Test of Mean When considering that data from a normal distribution has a smaller amount of data, the tdistribution is likely more relevant and more accurate to use (Daniel & Cross, 2013). The tdistribution is a normal distribution that appears more compressed than the z-distribution since it is related to sample data (Daniel & Cross, 2013). As you have more and more degrees of freedom, the t-distribution approaches the shape of the z-distribution (Daniel & Cross, 2013). In operation and interpretation, the t-test is very similar to the z-test. Consider the blood pressure data from Section 7.2.1. Suppose you want to test a hypothesis that the mean blood pressure of the distribution is 110. Section 7.2 determined that it is reasonable to assume that this data came from a normal distribution. However, since there is a small sample size, N = 10 samples, and nothing is known about the underlying variance, the z-test cannot be used for this data. To test this hypothesis using JMP:

1. 2. 3. 4. 5. 6. 7.

Consider the data from Section 7.2.1. Select Analyze ► Distribution. Select Systolic for Y, Columns. Click OK. Click on the red triangle next to Systolic and select Test Mean. The test mean dialog box, as shown in Figure 7.11, appears. Type 110 into the Specify Hypothesized Mean field.

You are now presented with the t-test result shown in Figure 7.10. Here the interpretation is similar to that discussed in Section 7.3.1, namely: > | | = 0.0267

Chapter 7: Statistical Tests and Confidence Intervals 127 This is the p-value for HO: μ = 110 and HA: μ ≠ 110 indicating that the hypothesized mean is significantly different from the sampled mean and that you should reject the null. The situation is similar with: > = 0.0133 and < = 0.9867 These are interpreted as the p-values for HO: μ = 110 and HA: μ > 110 and HO: μ = 110 and HA: μ < 100, respectively. The p-values seen in these results are consistent with the sample mean being 117, which is apparently significantly larger than the hypothesized mean in this sample data. Thus, the p-value for > is much less than 5%. Therfore, it is likely, given this sample data, that the population mean is above the hypothesized mean of 110. Similarly, the p-value for < is almost 1 indicating that the sample mean would be unlikely if the null were true. Figure 7.10 T-Test Result

7.3.3 Nonparametric Test of Mean (Wilcoxon Signed Rank) Although the z-test and t-test methods are very useful in many circumstances, these tests fail to be useful when the data and underlying population are non-normal and not sufficiently large enough to invoke the Central Limit Theorem. For these situations, a nonparametric method is of interest. Nonparametric methods are “distribution free” in that they do not rely on an underlying reference distribution for computing p-values. For an example of a non-normality problem, the next example will consider the data in Table 7.2. Here hypothesized values for the length of stay in a medium sized, midwestern hospital are considered. Naturally, you would assume that many patients are released the same day, with successively fewer patients staying longer and longer. Suppose you wanted to test a hypothesis that the average stay is three days with the alternative that it is more than three days? Using the

128 Biostatistics Using JMP: A Practical Guide common understanding that the distribution will not be normal, it is not possible to easily test this hypothesis using the methods covered up to this point. Table 7.2 Example Data for Length of Stays

Length of Stay (days)

N

0

557

1

175

2

202

3

145

4

130

5

133

6

112

7

117

8

102

9

95

10

70

To plot a distribution of the data in Table 7.2, you can use the distribution tool discussed in Chapter 4. Begin by telling JMP that the output is length of stay, but that frequencies are associated with each observation. Then set up the Distribution dialog box as shown in Figure 7.11. When the distribution is plotted with a normal quantile plot, you get the results in Figure 7.12 where you can see that the data is heavily biased toward 0 and is not normally distributed.

Chapter 7: Statistical Tests and Confidence Intervals 129 Figure 7.11 Distribution Dialog Box for Frequency Data

Figure 7.12 Normal Probability Plot of Length of Stay Data

130 Biostatistics Using JMP: A Practical Guide To test the hypothesis that the average length of stay in this hospital is three days or more, this example will largely follow the same steps as those in Section 7.3.2:

1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Consider the data set in Table 7.2. Select Analyze ► Distribution. Select Length of Stay (days) for Y, Columns. Select N for Freq. Click OK. Click on the red triangle next to AGE and select Test Mean. The test mean dialog box, as shown in Figure 7.8 in Section 7.3.1, appears. Type 3 for Specify Hypothesized Mean. Click the box for Wilcoxon Signed Rank. Click OK.

You are now greeted with the output shown in Figure 7.13. Here are the summary statistics and summary of the test in the upper half of the result table. Below this result are the test statistics and p-values for both the t-test and Signed-Rank test. The p-values are interpreted as discussed in Sections 7.3.1 and 7.3.2; however, disagreement exists between the t-test and Signed-Rank test results. Here in Figure 7.13 you have a similar interpretation as discussed in Sections 7.3.1 and 7.3.2, > is the p-value associated namely: > | |is the p-value for HO: μ = 3 and HA: μ ≠ 3, < is the p-value with HO: μ = 3 and HA: μ > 3 (the primary test under consideration), and associated with HO: μ = 3 and HA: μ < 3. If you were to use the t-test (whose assumptions have been violated), then you would reject the null for the first and second tests and conclude that it is likely that the average is greater than three. However, when the correct p-values are used from the signed rank test, you would make a more nuanced decision. First, for the primary tests (HO: μ = 3 and HA: μ > 3), you would reject the null at the 5% level since the p-value is 3%. However, if you were to consider the two-sided test (HO: μ = 3 and HA: μ ≠ 3), you would fail to reject the null at the 5% level since the p-value is 6%. Thus, the average is statistically significant depending on what you are testing. An even more nuanced conclusion would describe how the result is statistically significant, but possibly not practically significant because of the hypothesized value and sample value being very close. This outcome is due to the two-sided tests considering that the value could be both greater than and less than the test value. In this case, the p-value is based on both sides and thus only half of the p-value is from the greater than side of the distribution. If the incorrect t-test were used, you would not make this conclusion and would not have as much insight into the data.

Chapter 7: Statistical Tests and Confidence Intervals 131 Figure 7.13 Wilcoxon Signed Rank (Nonparametric) and T-Test (Parametric) Results

7.3.4 Standard Deviation Hypothesis Test Testing how a population varies is another common statistical test. Rather than testing where the population might be centered (a mean), suppose you are interested in how the population’s distribution varies (Daniel & Cross, 2013). For an example, consider the DermatologyAge.jmp data set with the hypothesis that the standard deviation is 15 years. To test this hypothesis using JMP:

1. Open DermatologyAge.jmp. 2. Select Analyze ► Distribution. 3. Select AGE for Y, Columns. This indicates to JMP which column of data you want to 4. 5. 6. 7. 8.

analyze. Click OK. Click on the red triangle next to AGE and select Test Std Dev. You are now presented with the dialog box shown in Figure 7.14. Type 15 into the field Specify Hypothesized Standard Deviation. Click OK.

Figure 7.14 Default Standard Deviation Test Dialog Box

132 Biostatistics Using JMP: A Practical Guide You are now presented with the results shown in Figure 7.15, for the general hypothesis test HO: σ = 15 and HA: σ ≠ 15. Here, the result dialog is similar to that discussed in Sections 7.3.1 to 7.3.3, but this example uses a Chi-Squared parametric test. The first half of the result in Figure 7.15 shows the hypothesized value, the computed estimate from the samples, and the degrees of freedom. Below this are the Chi-Squared test statistic, computed from the data and the test value, and the resultant p-values. Thus, in Figure 7.15, Min PValue is the p-value for HO: σ = 15 and HA: σ ≠ 15, Prob < ChiSq is the p-value for HO: σ = 15 and HA: σ < 15, and Prob > ChiSq is the pvalue for HO: σ = 15 and HA: σ > 15. Figure 7.15 Chi-Squared Test for Standard Deviation

7.3.5 Tests of Proportions A proportional test is used when considering proportional quantities (e.g., percentages) and testing if a proportion is within expected values or a specification. Proportional tests are related to the ztest and in JMP involve a few different steps compared to testing a mean. For an example problem, this example considers data in Table 7.3, which regards physician and resident rosters at the Southern Ohio Medical Center in Portsmouth, OH. This data originated from the 2015 Ohio Department of Health Annual Report for the Southern Ohio Medical Center (Ohio Department of Health, 2016) and has an N of 103. Of interest would be testing whether the observed values are statistically similar to a hypothesized value. This example will test the hypothesis that both groups are of equal probability (0.5). Table 7.3 Southern Ohio Medical Center Data

Title

Number

Contract Physician

83

Salaried Physician

20

To test this hypothesis using JMP:

1. 2. 3. 4. 5.

Type in values as they appear in Table 7.3. Select Analyze ► Distribution. Select Title for Y, Columns. Select Number for Freq. Click OK.

Chapter 7: Statistical Tests and Confidence Intervals 133 You are greeted with Figure 7.16. Although the process appears similar to that in Section 7.3.3, the title variable is nominal and, thus, these are categorical groups. To test that the population proportion of both groups is 0.5, it is necessary to perform a proportional hypothesis test. Figure 7.16 Bar Chart for Proportional Data

To perform the hypothesis test:

1. Click on the red triangle next to Title and select Test Probabilities. 2. A dialog box, Figure 7.17, will appear where you must fill in the hypothesized probability for each group and select the alternative hypothesis.

134 Biostatistics Using JMP: A Practical Guide Figure 7.17 Proportional Hypothesis Test Dialog Box

A typical hypothesis would be that both groups are equally likely, thus a hypothesized probability of 0.5 for each. The alternative hypothesis selection options are similar to those you saw previously: not equal (two-sided test) or greater than/less than.

1. For this example, fill in 0.5 for both Hypoth Prob fields. 2. Select Probabilities not equal to hypothesized value. 3. Click Done. You are now greeted with the results shown in Figure 7.18. Inherently, the hypothesis test considered HO: p1 = p2 =0.50 and HA: p1 ≠ p2 ≠ 0.50. Here, p-values were computed using two methods: a likelihood ratio and Pearson. Notice from the p-values that the null would be rejected in both cases. This result is logical since the proportions were very far from the hypothesis. Figure 7.18 Proportional Hypothesis Test Results

In addition, if you want to test a different hypothesis, you can run another test:

1. 2. 3. 4.

Click on the red triangle next to Title and select Test Probabilities. Fill in the appropriate probability for both Hypoth Prob fields. Select the desired sided-test alternative hypothesis. Click Done.

Chapter 7: Statistical Tests and Confidence Intervals 135

7.4 Confidence Intervals Confidence intervals provide a region within which it is reasonable to expect a population parameter to lie. Confidence intervals can be described by the following very general expression: . In the expression, the estimator is the sample quantity of interest (e.g., a mean), the reliability coef. is the test statistic value associated with the reference distribution and size of the data, and the standard error of estimator is the standard error of the data (Daniel & Cross, 2013).

7.4.1 Mean Confidence Intervals Mean confidence intervals place an interval around the mean and use either the z-distribution or tdistribution for the reliability coef. with expectations similar to a mean hypothesis test. To add a mean confidence interval, the next example will largely follow the directions shown in Sections 7.3.1 to 7.3.3, which were used to perform a hypothesis test. For an example of adding a 95% confidence interval, consider the DermatologyAge data set:

1. Open DermatologyAge.jmp. 2. Select Analyze ► Distribution. 3. Select AGE for Y, Columns. This indicates to JMP which column of data you want to analyze.

4. Click OK. 5. Click on the red triangle next to AGE and select Confidence Interval > 0.95. The confidence interval report, as shown in Figure 7.19, appears. Here, you are given both the upper and lower confidence intervals on both the mean and standard deviation. To use the confidence interval as a hypothesis test, look for the hypothesized value within the confidence interval. If it is within the confidence interval, then it would be reasonable to rule in favor of the null at the 95% level. If it is outside the interval, then it would be reasonable to reject the null in favor of the alternative, at the 95% level. When hypothesized values appear on the edge of an interval, the p-value is approaching the reference alpha value. Figure 7.19 Confidence Interval on Mean for DermatologyAge Data Set

7.4.2 Mean Confidence Intervals with Different Thresholds If you wanted to create a confidence interval with a different threshold, do the following: ●

Click on the red triangle next to AGE and select Confidence Interval.

The options possible are as shown in Figure 7.20. By default, you could select 90%, 95%, or 99%. To select a different significance threshold: ●

Click on the red triangle next to AGE and select Confidence Interval > Other.

136 Biostatistics Using JMP: A Practical Guide That would give you the dialog box shown in Figure 7.20. Here you can specify a custom significance level that will change the size of any resultant confidence interval. Also, if you wanted a z-distribution based confidence interval, you could select Use known Sigma in the dialog box shown in Figure 7.21. Figure 7.20 Settings to Change Confidence Interval

Figure 7.21 Confidence Interval Customization Dialog Box

7.4.3 Confidence Intervals for Proportions Proportional confidence intervals are computed in a similar way to z-score confidence intervals and the proportional hypothesis test discussed in Section 7.3.5. The next example reconsiders the data shown in Table 7.3.

1. Type in values from Table 7.3. 2. Select Analyze ► Distribution.

Chapter 7: Statistical Tests and Confidence Intervals 137

3. 4. 5. 6.

Select Title for Y, Columns. Select Number for Freq. Click OK. Click on the red triangle next to Title and select Confidence Interval > 0.95.

This creates a 95% confidence interval on the proportions, as shown in Figure 7.22. Figure 7.22 Proportional Confidence Intervals

From Figure 7.22, you can tell a few things about the data: you can see the anticipated range and you can see that it is possible to perform the hypothesis test directly from the confidence intervals. If the hypothesis is that each is equally likely, a probability of 0.50 since there are two groups or H0: p1 = p2, then you are looking for any groups whose confidence interval contains 0.50. Since none of the confidence intervals contain this value, it is reasonable to reject the null in favor of the implicit alternative that the groups are not 0.50 in likelihood.

7.5 Chi-Squared Analysis of Frequency and Contingency Tables Chi-squared tests enable you to consider goodness-of-fit whereby a sample data set’s distribution is compared against a hypothesized distribution. The chi-squared test extends from the z-score and is computed as follows: =



In the equation, O are the observed values for a cell and E are the expected values for that cell based on the sample size (Daniel & Cross, 2013). The computed value of is then used against a reference value found based on the degrees of freedom in the data. An example is presented in Table 7.4, where hypothetical data was collected on the physician selected by patients. You might hypothesize that there is no bias in preference (a uniform distribution); however, you might observe a preference for one physician over others, as shown in Table 7.4. A chi-squared test could be used with the general hypothesis that H0: d1 = d2 = d3; you would reject H0 if the test statistic value is equal or greater than the reference (critical) value (Daniel & Cross, 2013).

138 Biostatistics Using JMP: A Practical Guide Table 7.4 Example Goodness-of-Fit Data for Expected Uniform Distribution

Physician Selected

Number of Patients Seen ( Observed)

Expected

1

15

30

2

45

30

3

30

30

One variant of goodness-of-fit is contingency tables that enable you to examine independence of responses (Daniel & Cross, 2013). As an example, Table 7.5 presents a hypothetical contingency table of two factors: whether a subject has diabetes (the columns) and whether a subject was exposed to some material (rows). Of interest is whether the diabetes diagnosis is independent of the exposure. Table 7.5 Example Contingency Table

Diabetes Exposure

Yes

No

Yes

30

12

No

10

50

The most effective way to perform a contingency table analysis on the data in Table 7.5 is to type the data into a JMP data table that looks like Figure 7.23. In Figure 7.23, you have three columns, Diabetes, Exposure, and N. The Diabetes and Exposure columns are the pairs of Yes and No seen in Table 7.5, with the respective number of subjects for each pair appearing in the N column. After the table is typed into JMP, ensure that the Diabetes and Exposure variables are the nominal model type to enable a contingency analysis. In JMP, to perform a contingency table analysis, the variables must be nominal or ordinal. Figure 7.23 Example Contingency Table Data in JMP

Chapter 7: Statistical Tests and Confidence Intervals 139 To analyze the data seen in Figure 7.23:

1. 2. 3. 4.

Select Analyze ► Fit Y by X. Select Diabetes for Y, Response. This is the possible outcome of the exposure. Select Exposure for X, Factor. a. This is the input/action that might have led to the outcome. 5. Select N for Freq. a. This is the total number of observations per pair. b. If the data were represented as 79 rows for all of the combinations in this data set, then this field would be empty. 6. Click OK when the dialog box looks like Figure 7.24. The results will then appear, as shown in Figure 7.25. In Figure 7.25, notice that you first see a color-coded representation of the frequencies in the mosaic plot. More exact is the data seen in the contingency table itself (immediately below the mosaic plot). Here, the columns represent levels of one factor (Exposure) and the rows represent levels of the other factor (Diabetes). The top corner of the contingency table provides the key to understanding each cell: the first observation is the number in that cell, the second is the percentage of the total, and then the percentages by column and row are seen. Rows and columns for the total count and total percent are then provided. Figure 7.24 Fit Y by X for Contingency Table Analysis

Below the contingency table, in Figure 7.25, are the statistical results. First, a brief summary of the data is provided. Following this, results from two statistical tests, Likelihood Ratio and Pearson, are presented with p-values for the general test of whether these two distributions are statistically similar. Below these values, there are values for one- and two-sided tests. If you have

140 Biostatistics Using JMP: A Practical Guide JMP Pro, then exact versions of three tests are also available: Fisher’s Exact test, Cochran Armitage Trend test, and the Agreement test. Figure 7.25 Contingency Table Result for Data in Table 7.5

Chapter 7: Statistical Tests and Confidence Intervals 141

7.6 Two Sample Tests Briefly, this section will consider extensions of the hypothesis tests on means. However, because of Cochran’s theorem (Kutner, Nachtsheim, Neter, & Li, 2005), a two-sample test is equivalent to performing an analysis of variance (ANOVA). See Chapter 8 for further discussion. When considering two groups (e.g., a placebo and a treatment group), you have a few options for analysis. If there are matched pairs (e.g., before-treatment and after-treatment measurements), you can use Matched Pairs analysis to compute the strength of the relationships (between groups and between measurements).

7.6.1 Comparing Two Group Means When you have two groups to compare, you can organize the data simply with one column for the subject identifier, the second column for the measurement, and the third column for the group membership (e.g., Table 7.6). Two simple analysis options exist: pooled t-test and t-test. Technical differences exist in how these methods compute the sample variance and degrees of freedom. Practical differences exist in the type of results you get; test statistic values and p-values will differ. Table 7.6 Example Hypothetical Data on Reaction Time to a Task by Gender

Subject

Reaction Time (ms)

Reaction Time (ms)

Gender

1

10.5

M

7

10.2

F

2

11.1

M

8

10.3

F

3

9.8

M

9

9.9

F

4

10.6

M

10

10.9

F

5

10.2

M

11

10.6

F

6

10.1

M

12

10.2

F

Gender Subject

When deciding which method to use, it is reasonable to do the following:

1. Use the pooled t-test if the sample sizes are equal and standard deviations are similar. 2. Use the pooled t-test if the sample sizes are different, but the standard deviations are very similar. 3. Use t-test (unpooled) if the sample sizes are different and the standard deviations are very different. 4. Use matched pairs if there are repeated measures between groups.

142 Biostatistics Using JMP: A Practical Guide

7.6.1.1 Assuming Equal Variances If you can assume that the two groups have the same population variances, then you can use a pooled t-test. A good example is shown in Table 7.6 where reaction time for a common task was measured along with gender and controlled for age and experience. Thus, you would expect the male (M) and female (F) groups to come from populations with similar variances. Further, you could reasonable assume equal variances. ,

1. Create a new data table and type in the data from Table 7.6. a.

2. 3. 4. 5. 6. 7.

Reaction Time (ms), the response, must be continuous, and Gender, the factor, must be discrete (nominal or ordinal). Since there is no hierarchy on gender, nominal is a logical choice. Select Analyze ► Fit Y by X. Select Reaction Time (ms) for Y, Response. Select Gender for X, Factor. Click OK. Click the red triangle next to Oneway Analysis of Reaction Time (ms) By Gender. Select Means/Anova/Pooled t.

Results in Figure 7.26 show the points plotted for each group with associated 95% confidence intervals. Of particular interest is whether the confidence intervals overlap or not, which would indicate that there is no statistical difference between the groups. Here, it is apparent that the confidence intervals do overlap and thus no difference is measurable. Below the plot are summary characteristics about the data and an understanding of the quality of fit, in a linear regression/ANOVA sense. See Chapters 8 through 10 for discussion of ANOVA, regression, and curve-fitting. The t-test table and ANOVA table show equivalent results, consistent with Cochran’s theorem (Kutner, Nachtsheim, Neter, & Li, 2005). This is best seen in that the t-ratio (t-test table) and F-ratio (ANOVA table) are related as ≈ and in that the ANOVA table’s p-value and the p-value for > | | are the same. The t-test table in Figure 7.26 presents values for the mean difference between groups, confidence interval on this mean difference and associated t-test ratio and p-values. Although presented similarly to tables seen in Sections 7.3.1 and 7.3.2, here, > | | is the p-value for HO: μ1 = μ2 and HA: μ1 ≠ μ2, where μ1 is the mean of column 1 and μ2 is the mean of column 2. And, also, < presents the p> is the p-value associated with HO: μ1 = μ2 and HA: μ1 > μ2, and value associated with HO: μ1 = μ2 and HA: μ1 < μ2. Since the confidence interval contains 0, and the plot shows that there is significant overlap between groups, the results seen in the p-values for the three hypothesis tests are all very large and point to failing to reject the null. Thus, it is reasonable to conclude that there is no statistically measurable difference between males and females for the study data presented in Table 7.6.

Chapter 7: Statistical Tests and Confidence Intervals 143 Figure 7.26 Two Sample T-Test by ANOVA

144 Biostatistics Using JMP: A Practical Guide

7.6.1.2 Assuming Unequal Variance When it is impossible to assume equal population variances, the general t-test must be used, and it does not pool variance. However, in using this method, a “pooled” degrees of freedom will be computed that will likely not round to a whole number (e.g., in the following example suppose you used the pooled t-test; then you would have 10 degrees of freedom, but with the general t-test you would have 5.465 degrees of freedom). One example is presented in Table 7.7 for brainderived neurotrophic factor (BDNF) serum levels for two groups: those with the wild-type BDNF allele (Wild Type) and those who do not have the wild-type BDNF allele (Mutant). The hypothetical values seen in Table 7.7 are consistent with presented group values reported in Hawkes et al. (2017). Table 7.7 Example Hypothetical Data on BDNF Serum Level versus BDNF Allele

Subject

BDNF Serum

BDNF Allele

Subject

BDNF Serum

BDNF Allele

1

15575.6

Mutant

7

27501.8

Wild Type

2

17541.1

Mutant

8

33070.9

Wild Type

3

22711.5

Mutant

9

45112.2

Wild Type

4

19288.3

Mutant

10

6712.3

Wild Type

5

21671.2

Mutant

11

29719.2

Wild Type

6

21016.1

Mutant

12

31734.1

Wild Type

To analyze the data seen in Table 7.7:

1. Create a new data table and type in the data from Table 7.7. a.

2. 3. 4. 5. 6. 7.

BDNF Serum, the response, must be continuous, and BDNF Allele, the factor, must be discrete (nominal or ordinal). Since there is no hierarchy on allele, nominal is a logical choice. Select Analyze ► Fit Y by X. Select Reaction Time (ms) for Y, Response. Select Gender for X, Factor. Click OK. Click the red triangle next to Oneway Analysis of Reaction Time (ms) By Gender. Select t Test.

The results are then shown in Figure 7.27. Here, you have points plotted for each group and a table of results for the t-test. Like the results in Section 7.6.1.1, the t-test table here in Figure 7.27 presents values for the mean difference between groups, confidence interval on this mean difference, and associated t-test ratio and p-values. Again, as discussed in Section 7.6.1.1: > | | is the p-value for HO: μ1 = μ2 and HA: μ1 ≠ μ2, where μ1 is the mean of column 1 and μ2 is the mean of column 2. And, also, > is the p-value associated with HO: μ1 = μ2 and HA: μ1 > μ2, and < presents the the p-value associated with HO: μ1 = μ2 and HA: μ1 < μ2.

Chapter 7: Statistical Tests and Confidence Intervals 145 Since the confidence interval contains 0, the results seen in most of the p-values are large and point to failing to reject the null. However, the p-value for > is near to 5% (baseline confidence interval used was 95%); thus, the results show that there is possibly a measurable difference between the two groups, with μ1 > μ2. Figure 7.27 Paired T-Test with Unequal Variances

7.6.2 Paired Comparison, Matched Pairs An alternative to computing the differences is to perform a matched pairs analysis in JMP (Lehman, O'Rourke, Hatcher, & Stepanski, 2013). This will provide a paired t-test for two columns of data. To perform a matched pairs analysis:

1. 2. 3. 4. 5.

Open BMI_Example.jmp. Select Analyze ► Specialized Modeling ► Matched Pairs. Select Title for Y, Columns. Select Number for Freq. Click OK when looks like Figure 7.28.

146 Biostatistics Using JMP: A Practical Guide Figure 7.28 Matched Pairs Dialog Box

The result shown in Figure 7.29 shows results for each observation with the difference between values of the two columns on the y-axis and the mean value of each observation on the x-axis. Dotted lines indicate the 95% confidence interval around the mean difference. If this confidence interval includes 0, then there is no statistical difference between the two group means. In Figure 7.29 the confidence interval does not include 0. Thus, you can conclude that there is a statistically significant difference between the two sets. Additional statistical values are presented in the table shown at the bottom of Figure 7.29. The table presents the mean value for both sets, the mean difference, the number of samples, and the confidence interval limit values. In addition, the correlation between the variables is also presented. The right side of the table presents the t-ratio for this experiment and p-values for the general hypothesis tests. In Figure 7.29 you have a similar interpretation as discussed in Sections 7.3.1 and 7.3.2, but with > | | being the p-value for HO: μ1 = μ2 and HA: μ1 ≠ μ2, where μ1 is > being the p-value the mean of column 1 and μ2 is the mean of column 2. Also, < being the p-value associated with HO: associated with HO: μ1 = μ2 and HA: μ1 > μ2, and μ1 = μ2 and HA: μ1 < μ2. Thus, it is possible to interpret the results in the plot shown in Figure 7.29 consistent with the table results. The null for > | | is rejected because there is obviously a difference between the groups. This is seen in the confidence interval not including 0. It is necessary to fail to reject > because μ1 is greater than μ2 and thus the difference between means is negative, as seen in the plot. Finally, the null is rejected in favor of the alternative in < because μ1 is greater than μ2, as seen in the prior hypothesis tests.

Chapter 7: Statistical Tests and Confidence Intervals 147 Figure 7.29 Matched Pairs Result

7.7 References Aczel, A. (1993). Complete Business Statistics (2nd. ed.). Homewood, IL: Irwin. Bradley, J. V. (1980). Nonrobustness in Z, t, and F tests at large sample sizes. Bulletin of the Psychonomic Society, 16(5), 333-336. Conover, W. J. (1980). Practical Nonparametric Statistics. John Wiley & Sons. Cumming, G. (2007). Inference by eye: Pictures of confidence intervals and thinking about levels of confidence. Teaching Statistics, 29(3), 89-93. Cumming, G. (2009). Inference by eye: Reading the overlap of independent confidence intervals. Statistics in medicine, 28(2), 205-220. Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals and how to read pictures of data. American Psychologist, 60(2), 170-180. Daniel, W. W., & Cross, C. L. (2013). Biostatistics: A foundation for analysis in the health sciences (10th ed.). John Wiley & Sons.

148 Biostatistics Using JMP: A Practical Guide de Winter, J., & Dodou, D. (2010). Five-Point Likert Items: t test versus Mann-WhitneyWilcoxon. Practical Assessment, Research and Evaluation, 15(11). Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188. Hawkes, T., Eveland, E., Bihl, T. J., Frey, J. S., & Mauzy, C. A. (2017). Influence of BDNF Genotype and Exercise Type on Post-Training, Post-Acute Exercise Bout BDNF Serum Levels and VO2Max. AFRL Technical Report. Hayslett, H. T. (1968). Statistics Made Simple. New York: Doubleday. Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied Linear Statistical Models. New York: McGraw Hill/Irwin. Lehman, A., O'Rourke, N., Hatcher, L., & Stepanski, E. J. (2013). JMP for Basic Univariate and Multivariate Statistics. Cary, NC: SAS Institute. Mason, R., Lind, D., & Marchal, W. (1999). Statistical Techniques in Business and Economics (10th. ed.). Boston: Irwin McGraw Hill. Ohio Department of Health. (2016). Southern Ohio Medical Center, Annual Hospital Registration and Planning Report. Columbus: Ohio Department of Health . Reames, M., & Kemeny, G. (2011, Aug. 31). A Layperson’s Guide to Hypothesis Testing. Retrieved Aug. 1, 2016, from ProcessGPS: http://www.processgps.com/2011/08/31/alayperson%E2%80%99s-guide-to-hypothesis-testing/. Young, R. K., & Veldman, D. J. (1977). Introductory statistics for the behavioral sciences. Holt McDougal.

Chapter 8: Analysis of Variance (ANOVA) and Design of Experiments (DoE) 8.1 Introduction ..................................................................................... 149 8.2 One-Way ANOVA .............................................................................. 151 8.2.1 One-Way ANOVA with Fit Y by X .............................................................. 151 8.2.2 Means Comparison, LSD Matrix, and Connecting Letters .................... 155 8.2.3 Fit Y by X Changing Significance Levels ................................................. 157 8.2.4 Multiple Comparisons, Multiple One-Way ANOVAs ............................... 158 8.2.5 One-Way ANOVA via Fit Model ................................................................ 160 8.2.6 One-Way ANOVA for Unequal Group Sizes (Unbalanced) ..................... 165 8.3 Blocking ........................................................................................... 167 8.3.1 One-Way ANOVA with Blocking via Fit Y by X ........................................ 167 8.3.2 One-Way ANOVA with Blocking via Fit Model ........................................ 170 8.3.3 Note on Blocking ....................................................................................... 171 8.4 Multiple Factors ............................................................................... 171 8.4.1 Experimental Design Considerations ...................................................... 172 8.4.2 Multiple ANOVA ......................................................................................... 176 8.4.3 Feature Selection and Parsimonious Models ......................................... 179 8.5 Multivariate ANOVA (MANOVA) and Repeated Measures .................. 183 8.5.1 Repeated Measures MANOVA Background ........................................... 183 8.5.2 MANOVA in Fit Model ............................................................................... 184 8.6 References ...................................................................................... 188

8.1 Introduction Analysis of variance (ANOVA) extends concepts seen in hypothesis testing to test means of multiple groups, in addition to the interaction of multiple factors (Montgomery D. C., 2008). The ability of ANOVA to examine the significance of interactions between factors makes it possible to find effects that are not obvious individually (e.g., a drug that has a power effect only if the dosage and age combinations are correctly matched). ANOVA considers response data, Y, from an experimental design that had a factor X with a different treatment levels (or magnitudes). For example, you could have one dosage factor with 3 treatment levels, which include 50mg drug dose, 100mg drug dose, and placebo. If you had five subjects per group, a total of 15 observations are present. In addition, you could have a similar study with the same dosage factor, but also a gender factor to determine whether the response is due to the treatment (dosage), gender, or the interaction between these factors (dosage X gender).

150 Biostatistics Using JMP: A Practical Guide Consistent with many conventions (e.g., (Montgomery D. C., 2008)), a common format for an example simple ANOVA matrix is presented in Table 8.1. In this table, you can see that there are N = an x n observations (e.g., N = 3 x 5 = 15 for the above example), with a total treatment levels and n observations per treatment. To manipulate the matrix for use in JMP, it would would be necessary to consider the methods in Chapter 3. ANOVA is not limited to square designs either, as discussed in Section 8.2.2, and ANOVA can consider multiple treatments and their interactions, Sections 8.3 to 8.5. Table 8.1 Simple ANOVA Matrix of Observations versus Treatments for a Single Factor Design

Treatment

Observations

1

y11

y12



y1n

2

y21

y22



y2n











a

ya1

ya2



yan

ANOVA operates by creating a linear statistical model between a dependent variable, Y, and independent variable or variables X, such as this: =

+

In the equaton, is the ith treatment mean, and is the random error inherent in data (due to measurement variation, natural variation, missing/unknown variables) (Montgomery D. C., 2008). The random error, , is assumed to be normally distributed, N(0,σ2), and thus you can assume that the errors are due to random variations. The treatment mean inherently includes both the overall mean of the response and the treatment effect (Montgomery D. C., 2008). The underlying test being performed in ANOVA is as follows: : :

=

=⋯= n

Thus, the hypothesis test in ANOVA reports only if one group mean is different from the others. To understand which group mean is different, it is helpful to use statistical tests for pairwise comparisons of group means using a T-test. See Chapter 7. The ANOVA mathematical model is also directly related to linear regression (Chapter 9) and uses similar mathematical operations to create linear models of data. ANOVA is not limited to the simple case presented in Table 8.1; it is possible to consider different sizes per group and/or multiple treatments and their interactions. Limits do exist in meaningful group sizes (Stevens, 2002). This chapter first presents the simple one-way ANOVA method and then builds up to multiple factors, interactions, and special cases. Repeated measures data and its analysis are also discussed.

Chapter 8: Analysis of Variance (ANOVA) and Design of Experiments (DoE) 151

8.2 One-Way ANOVA One-way ANOVA refers to the simple case where one treatment variable (one X) is considered with one response variable (Y). One-way ANOVA extends the concepts seen in t-tests (Chapter 7) to consider multiple groups; t-tests were limited to comparing two treatment groups. If you were to apply one-way ANOVA to the simple problem seen in Chapter 7 for the t-test, the results would be equivalent per Cochran’s theorem (Kutner, Nachtsheim, Neter, & Li, 2005). For an illustrative example, this topic will consider the data set RatHesperidin.jmp, which was collected by Banji (2016) and Kumar (2015), as is similar to their published study (Banji, Banji, Dasaroju, & Kumar, 2013). This data set was from a study to test the effects of Hesperidin on diabetes (induced by a single dose of streptozotocin, 90 mg/kg body weight, dissolved in a citrate buffer, 0.1 mol/L pH 4.2, and administered intraperitoneally) on male Wistar rats. This data set has six columns of data: Group, a nominal variable with four types: Control, Diabetic, Dose 1, and Dose 2; and then five continuous output variables: GSH, the Glutathione level; SOD, the superoxide dismutase level; TP, total protein levels; Creatinine, and Insulin. For treatment, Dose 1 corresponds to 40mg/kg of Hesperidin and Dose 2, 80mg/kg of Hesperidin.

8.2.1 One-Way ANOVA with Fit Y by X Fit Y by X, as briefly discussed in Chapter 7, performs four modeling methods based on the modeling type of the inputs and outputs. The general Fit Y by X platform automatically determines which method to use based on the modeling type of the inputs and outputs. Figure 8.1 shows the four options available in Fit Y by X. Chapter 8 considers analysis when you have a discrete (nominal or ordinal) variable for the X field and a continuous variable for the Y field. Thus, Fit Y by X will automatically use one-way ANOVA as default. Figure 8.1 Fit Y by X Quadrant

For simple one-way ANOVA problems, the Fit Y by X platform can be very expedient, and thus it is a good idea to consider that method first. To perform ANOVA in JMP using the Fit Y by X tool, considering the RatHesperidin data:

1. Open RatHesperidin.jmp. 2. Select Analyze ► Fit Y by X. The Fit Y by X dialog box now appears, as shown in Figure 8.2. This initial example considers only the Group and GSH columns.

152 Biostatistics Using JMP: A Practical Guide Figure 8.2 Fit Y by X

3. Select GSH for Y, Response. 4. Select Group for X, Factor. 5. The word “Oneway” now appears over the method quadrant, indicating a one-way ANOVA model will be used.

6. Click OK. The preliminary results shown in Figure 8.3 are now displayed in a plot of X to Y. Here, notice that the GSH data is binned into the four groups (Control, Diabetic, Dose 1, and Dose 2) on the xaxis, and the y-axis is continuous between 10 and 50 (the range of the data). First, observe that GSH values are highest for the Control group and lowest for the Diabetic group. The two treatment groups, Dose 1 and Dose 2, show an improvement over the Diabetic group. To avoid making possibly meaningless inferences based on the raw numbers on the y-axis, of interest is the statistical significance of these groups. This will be accomplished through ANOVA.

Chapter 8: Analysis of Variance (ANOVA) and Design of Experiments (DoE) 153 Figure 8.3 Preliminary Results from Fit Y by X

To further analyze this data with ANOVA:

1. Click on the red triangle next to Oneway Analysis of GSH by Group. 2. Select Means/Anova. The ANOVA results shown in Figure 8.4 are then displayed. First, notice that diamonds now surround the three groups, the upper and lower points of these diamonds display the confidence interval on the group means, and the center line of each box is the group mean. The horizontal lines at the top and bottom of the diamonds are overlap marks that can be used to interpret statistical significance between groups. Below the plot, between X and Y as seen in Figure 8.3, is the Summary of Fit table. This table presents R2, the adjusted R2, root mean squared error, the overall data mean, and the number of observations. For ANOVA, R2 is the proportion of variance in Y that is explained by the X variable; for = ANOVA, this is computed as the ratio of sum of squares model over sum of squares total, or / (Kutner, Nachtsheim, Neter, & Li, 2005). R2 values range from 0--the model is no better than the mean--to 1 if the model perfectly accounts for all variation with no error. Adj Rsquare is the adjusted R2 where the R2 value is adjusted to account for the number of features in the model since R2 can be artificially inflated when there are too many variables in a model.

154 Biostatistics Using JMP: A Practical Guide Figure 8.4 ANOVA Output for GSH by Group

Below the Summary of Fit table is the Analysis of Variance table. For this example problem, in Figure 8.4 you can find both sum of squares values in the Analysis of Variance table, with the sum of squares model being the sum of squares due to formulation, 1575.30, and sum of squares

Chapter 8: Analysis of Variance (ANOVA) and Design of Experiments (DoE) 155 total being 1846.02. The ratio is thus 0.85. Therefore, this ANOVA model explains 85% of the variance in the data, and consequently most of the variation in the data is explained by this model. The final table, Means for Oneway Anova, presents the means of each group along with the upper and lower 95% confidence intervals on each group mean. Here, it is possible to dig deeper than just the ANOVA table and compare individual groups for overlapping confidence intervals. Whereas the ANOVA table is reporting on only whether at least one group differs from the others, here you can see which groups are statistically similar or different. For this example data, Dose 1 and Dose 2 are statistically similar since their confidence intervals overlap greatly, but they do not overlap each other’s group means. Control and Diabetic are statistically different from each other and both of these groups are statistically different from both Dose 1 and Dose 2. Of note: the standard error (Std Error) is the same for each group; this is due to there being an equal number of observations per group (Montgomery D. C., 1991).

8.2.2 Means Comparison, LSD Matrix, and Connecting Letters To further analyze this data, it is possible to perform statistical tests on the group means. This topic will consider t-tests. However, additional tests are available in JMP (e.g., Tukey, Hsu, and Dunnett’s). To perform these tests:

1. Click on the red triangle next to Oneway Analysis of GSH By Group. 2. Select Compare Means ► Each Pair, Student’s t. You are presented with the results shown in Figures 8.5. The intersection plot is the first addition to the model, Figure 8.5. This is interpreted graphically as shown in Figure 8.6. Essentially, this approach puts all of the confidence intervals on one axis, represented by a circle. Circles that overlap a lot, >90° between them, do not have statistically significant group mean differences; circles with some overlap, approximately 90° between them, are groups that are possibly statistically significant in difference; and circles with little to no overlap, F) differently, depending on their magnitude. The p-values again relate to the hypothesis test that a particular feature explains the response. Notice also that p-values above 5% are in black (these are notionally nonsignificant features), p-values between 1% and 5% are orange (indicating these are significant), and p-values below 1% are in red (indicating that these are very significant). In interpreting Figure 8.29 and deciding whether removing non-significant features is warranted, you have two issues: the third-order interaction between Exercised/Sedentary, Percent, and Diet is highly significant, and little difference practically exists between a p-value of 4% and 6% (Halsey, Curran-Everett, Vowler, & Drummond, 2015). Thus, Exercised/Sedentary and Exercised/Sedentary by Percent can be viewed as statistically significant. In addition, only Diet by itself is not statistically significant; however, it is necessary to retain Diet in the model because it is involved in higher order interactions. To include the significant higher order interaction, the final model, therefore, is the one described in Figures 8.28 and 8.29.

8.4.3 Feature Selection and Parsimonious Models The BoivinMouse data was one data set where all variables and interactions were either significant or necessary to retain because they were used in significant higher order interactions. However, frequently, more variables are collected than necessary to explain the data. Thus, feature selection is important to both avoid retaining features that do not add to the model (thus reducing the R2 adjusted) and to aid in explaining and interpreting the model. This example will consider the data in Barley_Moisture.jmp, which is a fabricated data set that was created using the general relationship seen in several studies (Brookes, Lovett, & MacWilliam, 1976) and (Molina-Cano, et al., 1995). In Barley_Moisture, six variables are considered: Cultivar (factors of barley cultivar, 0 or 1 for the hypothetic cultivars), site (1 or 2 for two hypothetical designs), Year (A or B for two years being considered), Replicate (which

180 Biostatistics Using JMP: A Practical Guide collected replicate a given observation is, with there being two replications for each design point), Steeping Time (0 to 40 hours in units of 10), and Moisture Content (the response for barley moisture content in percent). To analyze this data, create an ANOVA model as in Section 8.4.2 and repeated below. However, there is one additional step, selecting Keep dialog open.

1. 2. 3. 4. 5. 6.

Open Barley_Moisture.jmp. Select Analyze ► Fit Model. Select Moisture Content. Select Y. Highlight all other columns (e.g., hold down the CTRL key while clicking on them). Select Add in the Construct Model Effects. a. This example will not consider interactions since there are five factors and thus many possible interactions. 7. Click Keep dialog open. a. This will keep the dialog box for Fit Model open so it is possible to quickly return to make changes for a new analysis. 8. Click Run. The resultant ANOVA table is the combination of the Analysis of Variance table and the Effects Tests shown in Figure 8.30 and as discussed in Section 8.4.2. Here, notice that the model is significant; from the R2 notice that it explains most of the variance in the data, but there are three insignificant factors: Cultivar, Site, and Replicate. Figure 8.30 ANOVA Table and Effects Test for Barley Moisture Data

Chapter 8: Analysis of Variance (ANOVA) and Design of Experiments (DoE) 181 To avoid adding noise, it is necessary to remove one variable at a time and then redevelop the model. Logically, two approaches are viable: remove the largest p-value (least significant variable) and use experimental knowledge to be somewhat selective. In addition, since p-values can all be listed as