Statistical Modelling in GLIM4 [2 ed.] 9780191545702, 9780198524137

This new edition of the successful multi-disciplinary text Statistical Modelling in GLIM takes into account new developm

153 60 143KB

English Pages 30 Year 2005

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Statistical Modelling in GLIM4 [2 ed.]
 9780191545702, 9780198524137

Citation preview

Contents

1 Introducing GLIM4

1

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8

Statistical packages and statistical modelling Getting started in GLIM4 Reading data into GLIM Assignment and data generation Displaying data Data structures and the workspace Transformations and data modification Functions and suffixing 1.8.1 Structure functions 1.8.2 Mathematical functions 1.8.3 Logical functions and operators 1.8.4 Statistical functions 1.8.5 Random numbers 1.8.6 Suffixes in calculate expressions 1.8.7 Recoding variates and factors into new factors 1.8.8 Extracting subsets of data 1.9 Graphical facilities 1.10 Macros and text handling 1.11 Sorting and tabulation

2 Statistical modelling and inference 2.1 2.2 2.3 2.4 2.5 2.6

Statistical models Types of variables Population models Random sampling The likelihood function Inference for single parameter models 2.6.1 Comparing two simple hypotheses 2.6.2 Information about a single parameter ix

GELM: “fm” — 2005/2/19 — 17:06 — page ix — #9

1 2 3 5 7 8 10 11 11 11 12 13 13 13 14 14 15 19 21 25 25 27 28 43 43 45 46 48

contents

x

2.6.3

Comparing a simple null hypothesis and a composite alternative 2.7 Inference with nuisance parameters 2.7.1 Profile likelihoods 2.7.2 Marginal likelihood for the variance 2.7.3 Likelihood normalizing transformations 2.7.4 Alternative test procedures 2.7.5 Binomial model 2.7.6 Hypergeometric sampling from finite populations 2.8 The effect of the sample design on inference 2.9 The exponential family 2.9.1 Mean and variance 2.9.2 Generalized linear models 2.9.3 Maximum likelihood fitting of the GLM 2.9.4 Model comparisons through maximized likelihoods 2.10 Likelihood inference without models 2.10.1 Likelihoods for percentiles 2.10.2 Empirical likelihood 3 Regression and analysis of variance 3.1 3.2 3.3 3.4

3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16

An example Strategies for model simplification Stratified, weighted and clustered samples Model criticism 3.4.1 Mis-specification of the probability distribution 3.4.2 Mis-specification of the link function 3.4.3 The occurrence of aberrant and influential observations 3.4.4 Mis-specification of the systematic part of the model The Box–Cox transformation family Modelling and background information Link functions and transformations Regression models for prediction Model choice and mean square prediction error Model selection through cross-validation Reduction of complex regression models Sensitivity of the Box–Cox transformation The use of regression models for calibration Measurement error in the explanatory variables Factorial designs Unbalanced cross-classifications 3.16.1 The Bennett hostility data 3.16.2 ANOVA of the cross-classification

GELM: “fm” — 2005/2/19 — 07:53 — page x — #10

53 58 59 62 65 67 73 78 80 81 81 82 83 86 87 88 91 96 96 107 110 113 115 118 118 121 122 124 136 138 140 141 143 153 155 157 159 166 166 168

contents 3.16.3 Regression analysis of the cross-classification 3.16.4 Statistical package treatments of cross-classifications 3.17 Missing data 3.18 Approximate methods for missing data 3.19 Modelling of variance heterogeneity 3.19.1 Poison example 3.19.2 Tree example 4 Binary response data 4.1 4.2 4.3

4.4 4.5 4.6 4.7 4.8

4.9

Binary responses Transformations and link functions 4.2.1 Profile likelihoods for functions of parameters Model criticism 4.3.1 Mis-specification of the probability distribution 4.3.2 Mis-specification of the link function 4.3.3 The occurrence of aberrant and influential observations Binary data with continuous covariates Contingency table construction from binary data The prediction of binary outcomes Profile and conditional likelihoods in 2 × 2 tables Three-dimensional contingency tables with a binary response 4.8.1 Prenatal care and infant mortality 4.8.2 Coronary heart disease Multidimensional contingency tables with a binary response

5 Multinomial and Poisson response data 5.1 5.2 5.3 5.4 5.5 5.6 5.7

5.8

The Poisson distribution Cross-classified counts Multicategory responses Multinomial logit model The Poisson-multinomial relation Fitting the multinomial logit model Ordered response categories 5.7.1 Common slopes for the regressions 5.7.2 Linear trend over response categories 5.7.3 Proportional slopes 5.7.4 The continuation ratio model 5.7.5 Other models An example 5.8.1 Multinomial logit model 5.8.2 Continuation-ratio model

GELM: “fm” — 2005/2/19 — 07:53 — page xi — #11

xi

171 173 175 177 178 181 187 191 191 193 196 201 202 203 203 203 217 227 234 238 238 241 247 259 259 261 268 274 277 283 288 288 290 293 293 296 298 300 307

contents

xii

5.9

Structured multinomial responses 5.9.1 Independent outcomes 5.9.2 Correlated outcomes

6 Survival data 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9

6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 6.19 6.20 6.21 6.22 6.23

Introduction The exponential distribution Fitting the exponential distribution Model criticism Comparison with the normal family Censoring Likelihood function for censored observations Probability plotting with censored data: the Kaplan–Meier estimator The gamma distribution 6.9.1 Maximum likelihood with uncensored data 6.9.2 Maximum likelihood with censored data 6.9.3 Double modelling The Weibull distribution Maximum likelihood fitting of the Weibull distribution The extreme value distribution The reversed extreme value distribution Survivor function plotting for the Weibull and extreme value distributions The Cox proportional hazards model and the piecewise exponential distribution Maximum likelihood fitting of the piecewise exponential distribution Examples The logistic and log–logistic distributions The normal and lognormal distributions Evaluating the proportional hazard assumption Competing risks Time-dependent explanatory variables Discrete time models

7 Finite mixture models 7.1 7.2 7.3 7.4

Introduction Example—girl birthweights Finite mixtures of distributions Maximum likelihood in finite mixtures

GELM: “fm” — 2005/2/19 — 07:53 — page xii — #12

316 316 323 330 330 330 332 336 343 345 346 349 358 359 363 365 368 370 374 377 378 380 383 384 388 391 394 401 408 408 414 414 415 415 416

7.5 7.6 7.7 7.8 7.9

contents

xiii

Standard errors Testing for the number of components 7.6.1 Example Likelihood “spikes” Galaxy data Kernel density estimates

418 421 423 428 430 437

8 Random effect models 8.1

Overdispersion 8.1.1 Testing for overdispersion 8.2 Conjugate random effects 8.2.1 Normal kernel: the t-distribution 8.2.2 Poisson kernel: the negative binomial distribution 8.2.3 Binomial kernel: beta-binomial distribution 8.2.4 Gamma kernel 8.2.5 Difficulties with the conjugate approach 8.3 Normal random effects 8.3.1 Predicting from the normal random effect model 8.4 Gaussian quadrature examples 8.4.1 Overdispersion model fitting 8.4.2 Poisson—the fabric fault data 8.4.3 Binomial—the toxoplasmosis data 8.5 Other specified random effect distributions 8.6 Arbitrary random effects 8.7 Examples 8.7.1 The fabric fault data 8.7.2 The toxoplasmosis data 8.7.3 Leukaemia remission data 8.7.4 The Brownlee stack-loss data 8.8 Random coefficient regression models 8.8.1 Example—the fabric fault data 8.9 Algorithms for mixture fitting 8.10 Modelling the mixing probabilities 8.11 Mixtures of mixtures 9 Variance component models 9.1 9.2 9.3 9.4 9.5

Models with shared random effects The normal/normal model Exponential family two-level models Other approaches NPML estimation of the masses and mass-points

GELM: “fm” — 2005/2/19 — 07:53 — page xiii — #13

440 440 442 444 445 450 455 456 457 457 459 460 460 460 462 465 465 467 467 470 471 471 474 476 477 480 482 485 485 485 488 490 491

contents

xiv

9.6 9.7

9.8 9.9 9.10

9.11 9.12 9.13

Random coefficient models Variance component model fitting 9.7.1 Children’s height development 9.7.2 School effects on pupil performance 9.7.3 Multi-centre trial of beta-blockers 9.7.4 Longitudinal study of obesity Autoregressive random effect models Latent variable models 9.9.1 The normal factor model IRT models 9.10.1 The Rasch model 9.10.2 The two-parameter model 9.10.3 The three-parameter logit (3PL) model 9.10.4 Example—The Law School Aptitude Test (LSAT) 9.10.5 Quadratic regressions Spatial dependence Multivariate correlated responses Discreteness of the NPML estimate

491 492 493 499 505 511 518 524 524 525 525 526 528 528 535 536 536 537

References Dataset Index Macros Index

538 550 551

Subject Index

552

GELM: “fm” — 2005/2/19 — 17:06 — page xiv — #14

1 Introducing GLIM4

1.1

Statistical packages and statistical modelling

Statistical modelling is an iterative process. We have a research question of interest, the experimental or observational context, the sampling scheme and the data. Based on these data and the other information, we develop an initial model, fit it to the data and examine the model output. We then plot the residuals, check the model assumptions and examine the parameter estimates. Based on this model output, we may modify the first model possibly through some transformation or the addition or subtraction of model terms. Fitting the second model may in turn lead to further models for the data, with the form of the current model being based on the information provided by the previous models. This interactivity between the data and the data modeller is a crucial feature of the process of statistical modelling. While automatic model selection procedures provided in many software packages can provide guidance in large problems, we take a different route in this book. We focus on generalized linear models and extensions of these models, which can be fitted using an iterative reweighted least squares (IRLS) algorithm. The computational extensions which we cover here (into such areas as survival analysis and random effects models) are developed by building on the standard IRLS algorithm, incorporating it into more complex procedures or macros. There are numerous interactive and programmable statistical packages on the market which can be used to fit generalized linear models. In no particular order, these include R, S-PLUS, GENSTAT, GLIM, SAS, STATA and XLISPSTAT, and, with appropriate procedure development, all could be used to fit the models introduced in this book. All can be used interactively and allow the user to develop models in the way outlined above. In addition, there are other packages such as SPSS and MINITAB which can fit specific types of generalized linear model although not in a unified framework. In this book we have chosen to use GLIM, which is a compact program with a single weighted least squares algorithm for generalized linear models. We have written this book around GLIM in preference to the other packages mentioned for two reasons. The first is historical; the first edition of this book used GLIM, and two of the authors of this book helped to develop the current version of GLIM. Second, while other packages such as R now provide most of the facilities of GLIM, the command set of GLIM is compact and can be learnt

GELM: “chap01” — 2005/2/19 — 07:50 — page 1 — #1

introducing glim4

2

relatively quickly. We therefore begin this book by providing an introduction to the main facilities of GLIM4 (Francis et al., 1993), the most recent release of GLIM. GLIM is a command driven package. Commands for data manipulation, data transformation, data display and model fitting may be entered, with a few restrictions, in any order. In addition, sequences of frequently used commands (known in GLIM as macros) may be named and saved for later use, providing a basic facility for users to define their own procedures. We have assumed that GLIM is being used interactively. GLIM can also be run in batch mode, that is, by submitting a file of previously assembled commands. Batch use of the package may be appropriate for time-consuming tasks such as the fitting of complex models to large sets of data, but GLIM is primarily designed for interactive use. 1.2

Getting started in GLIM4

On loading GLIM, an introductory message appears: GLIM 4, update 9 for Microsoft Windows Intel on date at time (copyright) 1992 Royal Statistical Society, London ?

The question mark which follows the message is the GLIM prompt, which tells you that GLIM is waiting to receive instructions or commands from you. We use the convention of representing input to and output from GLIM in typewriter typefaces, to distinguish them from ordinary text. Instructions are issued to GLIM by using directives. Directives all begin with a dollar ($). (If the dollar symbol is not available on your keyboard, try another currency symbol such as £.) The dollar sign provides the means by which GLIM recognizes directive names. Notice that commands are terminated by a $—the terminating $ can be the start of the next directive. All GLIM directive names can be abbreviated—always to the first four characters, and sometimes further. For example, $print %pi $ will print the value of π; $print can be abbreviated to $pri and also to $pr. Unlike many other packages, directives do not need to start on a new-line; there can be many directives on one line. If we omit the terminating dollar, then GLIM responds with a prompt, but preceding the prompt symbol is the abbreviated directive name. $PRI ?

GLIM is indicating that the end of the $print directive has not been reached. Nothing has been printed as the list of items following the $print directive has not been terminated by a dollar. In response to this prompt, we can either enter a dollar (to indicate the end of the list of items to be printed) or enter further items.

GELM: “chap01” — 2005/2/19 — 07:50 — page 2 — #2

reading data into glim

3

An on-line help system exists in GLIM through the $manual $ directive. So, for example, $manual print $

provides help on the $print directive. The $manual $ directive on its own provides a general help screen giving various help options. 1.3

Reading data into GLIM

We start by introducing a simple set of data. Table 1.1 presents a small set of data, given in Erickson and Nosanchuk (1979), extracted from a larger set collected by Atkinson and Polivy (1976) to examine the effects of unprovoked verbal attack. Ten male and nine female subjects were asked to fill out a questionnaire which mixed innocuous questions with questions attempting to assess the subject’s selfreported hostility. A hostility score for each individual was calculated from these responses. After completing the questionnaire, the subjects were then left waiting for a long time, and were subjected to insults and verbal abuse by the experimenter when the questionnaire was eventually collected. All subjects were told that they had filled out the questionnaire incorrectly, and were instructed to fill it out again. A second hostility score was then calculated from these later responses. For each of the 19 individuals involved in this psychological experiment we have three items of information, or variables: (1) the hostility score of the individual before being insulted; (2) the hostility score of the individual after being insulted; (3) the sex of the individual. One of the variables (the sex of the individual) is a categorical variable with only two values: male and female. The remaining two variables are continuous— that is, variables which are measured on an underlying continuous scale to a certain measurement accuracy (Section 2.2 contains a further discussion of types Table 1.1. The insult data set Hostility

Sex

Before

After

51 54 61 54 49 54 46 47 43

58 65 86 77 74 59 46 50 37

Female Female Female Female Female Female Female Female Female

Hostility

Sex

Before

After

86 28 45 59 49 56 69 51 74 42

82 37 51 56 53 90 80 71 88 43

Male Male Male Male Male Male Male Male Male Male

GELM: “chap01” — 2005/2/19 — 07:50 — page 3 — #3

4

introducing glim4

of variables). The data are arranged in a data matrix, with the rows of the matrix representing individuals, units or cases and the columns representing the variables recorded for the individuals. There are no missing data; we have complete information on all the variables for each individual. We define and read this set of data into GLIM by setting up three variates to hold the data columns. Any variate, once defined, has a fixed length (that is, it can hold a fixed number of data values). A variate also has a name or identifier, and this name may be chosen by the user. The data values stored in a variate must be numerical; GLIM cannot read non-numeric characters as data. The two continuous variables will cause no problem as their values are numeric, but the categorical variable will need to be coded, associating a different number with each category. There are many ways of coding categorical values. For the categorical variable, we could code males as 0 and females as 1, or females as −1 and males as 1: any two different numbers will do. Here, we code females as 1 and males as 2; the reason for using 1 and 2 will become clear later in this section. We need to issue directives to GLIM to define the length and names of the three vectors that we wish to define. Once this is done, we need to instruct GLIM to read the data values into these vectors. The directives to do this are as follows: $slength 19 $data hbefore hafter 51 58 1 54 65 1 61 86 1 49 74 1 54 59 1 46 46 1 43 37 1 86 82 2 28 37 2 59 56 2 49 53 2 56 90 2 51 71 2 74 88 2 42 43 2 $

sex $read 54 77 1 47 50 1 45 51 2 69 80 2

The directive $slength defines the standard or default length of all variates—this does not however prohibit other vectors being defined with lengths other than the standard length. The $data directive has two functions—it defines the names of the three variates, and tells GLIM that data values following the $read directive should be assigned to these variates. The three variates are called hbefore, hafter and sex. Variate names (and names for all user structures in GLIM) start with a letter and can have up to eight characters; truncation to eight characters occurs if a name is longer. In addition, names are not case-sensitive—HBEFORE is the same as hbefore. The reading of the data values is initiated by the $read directive. In our example, the $read directive name will be followed by 57 numbers; the data values are read in parallel, one case at a time. For each case, the data values for the vectors in the data list are read in order. Thus, the first data value of hbefore is followed by the first data value of hafter, the first value of sex, the second value of hbefore, and so on. We order the data with the female cases first and the male cases second. Spaces, tabs, commas or new-lines separate the data values, and there is no need to start a new-line for a new case. Note that we do not enter a dollar immediately

GELM: “chap01” — 2005/2/19 — 07:50 — page 4 — #4

assignment and data generation

5

after the $read directive, as it takes as items the data values of the variates being read in. The terminating dollar follows the last data value. For the fitting of statistical models and certain other applications, GLIM makes a distinction between two kinds of vectors: variates and factors. All vectors are initially defined as variates; vectors representing categorical variables may be explicitly redefined as factors by using the $factor directive. For a categorical variable with k categories to be represented as a factor, the categories must be coded using the integers 1, 2, . . . , k. We declare sex to be a factor: .

$factor sex 2 $

The $factor directive defines a factor sex with two values (1 and 2) referred to in GLIM as levels. Here, we have coded female subjects as level 1 and male subjects as level 2. We have so far assumed that the data have been typed in interactively, and this method is obviously suitable only for small data sets. In most applications, the data would be entered into a file and stored on disk, allowing the dataset to be edited. The simplest way of inputting data from a file is to use the $input directive. This simply reads a sequence of GLIM directives stored in a file into GLIM. So a file insult might contain the above sequence of directives for reading the insult data, with the addition of a $return directive as the last directive in the file. This file would be read into GLIM by $input ‘insult’ $

The input file may contain data, or more generally can contain any sequence of GLIM directives terminated with a $return directive. This is useful for developing code and running it interactively. To view the GLIM code on the screen as it is being read in, we precede the $input command by $echo on $. This will echo commands and data from the file to the screen. 1.4

Assignment and data generation

An alternative to the $read directive is the $assign directive, which is used to assign data values to a variate. $assign takes the length of the variate from the number of data values entered and this is often a more flexible way of inputting data. If the variate already exists, then its contents are overwritten and its length reset. $assign hbefore = 51,54,61,54,49,54,46,47,43,86,28,45, 59,49,56,69,51,74,42 $ $assign hafter = 58,65,86,77,74,59,46,50,37,82,37,51, 56,53,90,80,71,88,43 $ $assign sex = 1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2 $ $factor sex 2 $

GELM: “chap01” — 2005/2/19 — 07:50 — page 5 — #5

introducing glim4

6

Here, as before, we have declared the vector identifier sex to be a factor after the values have been assigned. sex is declared by default to be a variate by the $assign directive, then redeclared as a factor by the $factor directive. The $assign directive is in fact a general command to assign values and the contents of variates to a variate. On the left of the equals sign (=) is the destination variate; on the right hand side are elements separated by commas or spaces; the elements can be lists of source variates, numbers and number sequences. The destination variate can be the same as one of the source variates, allowing the length of variates to be changed. For example, $assign x = 1,4 $

will define a variate called x of length 2; $assign x = x,3,6,x $

will increase the length of x to 6, and will now contain the values 1, 4, 3, 6, 1, 4. Number sequences are available, and take the form value1, value2,...,value3. The increment is taken to be the difference between value1 and value2. So, for example, $assign z = 1,4,...,23 $

will assign the values 1, 4, 7, 10, 13, 16, 19, 22, 23 to z, whereas $assign w = 1,...,6,x $

will assign the values 1, 2, 3, 4, 5, 6, 1, 4, 3, 6, 1, 4 to w. Negative increments are permitted. Factors which have a regular structure can be generated automatically using the $gfactor or ‘generate factor’ directive. So, for example, we might have a table of counts as follows:

B

GELM:

1

2

C

C

A

1

2

3

1 2 3 4

10 4 0 3

0 0 6 2 3 8 4 11

1

2

3

2 1 0 3 7 1 2 4 8 1 7 12

“chap01” — 2005/2/19 — 07:50 — page 6 — #6

displaying data

7

We read the values of the table into a variate called count, and then use the $gfactor directive to generate the cross-classifying factors. $assign count = 10,0,0,2,1,0, 0,3,8,2,4,8, $gfactor 24 a 4 b 2 c 3 $ $tprint count a,b,c $ B C A 1 2 3 4

1 1 10.000 4.000 0.000 3.000

2

3

0.000 6.000 3.000 4.000

0.000 2.000 8.000 11.000

4,6,2,3,7,1, 3,4,11,1,7,12 $

2 1

2

2.000 3.000 2.000 1.000

1.000 7.000 4.000 7.000

3 0.000 1.000 8.000 12.000

The variate count has length 24, so in the $gfactor directive we first specify a length of 24 for the three factors a, b and c which are to be created. The factor varying least rapidly a is specified first, followed by the desired number of levels of a. This will assign the first six values of a to be 1, the next six to be 2, the following six to be 3, and the final six to be 4. The factor varying next least rapidly is b, which has two levels. The factor b will vary within a, and so the first three values of b will be 1, the next three 2, and then the values of b repeat. Finally, the factor varying most rapidly is c with three levels—this will take the values 1, 2, 3, 1, 2, 3, . . ., etc. The $tprint or tabular print directive provides a way of examining the data in tabular form—the factors a, b and c can be rearranged into any order. 1.5

Displaying data

We now return to the $print directive, which was introduced in Section 1.2. This directive can also be used to display the contents of variates and factors, and indeed of any structure in GLIM. $print hbefore $ 51.00 86.00 42.00

54.00 28.00

61.00 45.00

54.00 59.00

49.00 49.00

54.00 56.00

46.00 69.00

47.00 51.00

43.00 74.00

The data values are printed across the screen, to an accuracy of four significant figures and with nine data values on each line. The accuracy of four significant figures is the default accuracy; this can be altered by introducing the * symbol followed by a positive number in the range 1–9. For example, *6 will give six significant figures. In addition *-1 specifies that

GELM: “chap01” — 2005/2/19 — 07:50 — page 7 — #7

introducing glim4

8

no decimal places are to be printed. This method holds for the duration of the $print directive—global accuracy settings can be specified using the $accuracy directive. More than one variate (or factor) may be specified in a $print directive, but care must be taken, as the values are printed serially and are concatenated; that is, all the values of the first variate are printed first, followed immediately by the values of the next vector in the list, and so on. A new line is not started for each new variate. It is thus often hard to tell from the output where one vector ends and another begins. We can circumvent this by using separate $print statements, as the directive always starts a new-line. $print hbefore $print hafter $print sex $

This is rather tedious to type; we can improve the above code by using a colon (:) to repeat the previous directive. $print hbefore : hafter : sex $

To display the values of vectors in parallel, the $look directive can be used. $look hbefore hafter sex $

The values of the three vectors are printed in columns, with each row corresponding to one unit. The unit number is displayed as an additional column on the left. Subsets of the vector values may be displayed in parallel by specifying the lower and upper unit numbers of the range of units immediately following the directive name. For example, $look 5 8 hbefore hafter sex $

displays the values of the three variables for cases 5 to 8 only and $look 15 hbefore hafter sex $

displays the values of the three variables for case 15. 1.6

Data structures and the workspace

So far, we have seen that GLIM has facilities for dealing with vectors, which are initially declared as variates, and which can be redeclared as factors. We now describe the other data structures available in GLIM—scalars, lists and macros. We will also highlight the difference between user structures and system structures, the latter having pre-assigned names starting with the % symbol. Scalars are used to store single numeric values, and are declared using the $number directive (not the $scale directive, which has a special meaning in model fitting). Thus $number nseasons = 4 $

GELM: “chap01” — 2005/2/19 — 07:50 — page 8 — #8

data structures and the workspace

9

declares the scalar nseasons and assigns the value 4 to it. There are also a number of system-defined structures available in GLIM. First, there are 26 ordinary scalars with names %a, %b up to %z. These are available for use as scalar storage without the need to declare them. Second, there are 74 system scalars, whose names also begin with a % symbol and which are pre-assigned. We have already seen one of these— the system scalar %pi which contains the value of the mathematical constant π . These system scalars will be introduced and described where necessary—a full list can be obtained through the command $manual scalars $. Similarly, vectors in GLIM are either user- or system-defined. System vectors are mainly defined after a model fit—for example, there are system vectors %fv (which contains the fitted values from the model) and %rs (which contains the residuals). These are discussed in detail in later chapters—a full list can be obtained through the command $manual vectors $. Lists in GLIM are used to name ordered list of names. So $list mylist = hbefore,hafter,nseasons $

will define a list mylist with three elements—the first two being variates and the last a scalar. They are used primarily in GLIM macros. Finally, a macro is an additional type of GLIM structure which is used for storing text or for storing sequences of GLIM directives. We mention them only briefly in this section as they are not normally used to store data values—Section 1.10 contains a more detailed discussion. The current contents of the workspace can be discovered through the directive $environment d $, which provides information on all user-defined structures and system vectors existing in the workspace. GLIM has upper limits on both the number of user identifiers which may be defined and on the amount of workspace which may be used. These limits vary according to the size of the GLIM program which is being run, and the directive $environment u $ may be used to find out these limits. For some large problems, the standard version of GLIM is not large enough, and a larger version is needed—this is achieved through the use of the SIZE parameter when loading GLIM. The $delete directive saves space by deleting user-defined structures from the workspace. So $delete mylist $

will delete the list structure mylist but will leave the structures hbefore, hafter and nseasons untouched. These can be deleted individually if required. There are two further directives which are useful for managing a GLIM session. The $newjob directive clears the workspace of all defined structures, and will also reset all internal settings back to the GLIM default values. It is used for

GELM: “chap01” — 2005/2/19 — 07:50 — page 9 — #9

introducing glim4

10

starting a new piece of work or job while still remaining within the same GLIM session. The GLIM session is completely ended by using $stop. Output from the GLIM session will be stored in the file glim.log which is in the GLIM working directory. In addition, all commands issued by the user are stored in the journal file glim.jou which is found in the same directory. As these files are overwritten at the start of a new GLIM session, we strongly recommend that they are renamed or moved to another location immediately after finishing a GLIM session if they are to be saved. 1.7

Transformations and data modification

We now describe the directives which are used for general calculation, transformation and modification of variates. The $calculate directive provides facilities in GLIM for assigning, transforming and recoding GLIM variates. However, the simplest use of the $calculate directive (which we abbreviate to $calc) is as a calculator. The GLIM command $calc 5 + 6 $

returns the value 11.00; the command $calc 29*9/5 + 32 $

(used to convert 29◦ C to ◦ F) returns 84.20. The symbols +, -, * and /, when used in arithmetic expressions, are known as arithmetic operators, and represent addition, subtraction, multiplication and division, respectively. Multiplication and division have a higher precedence than addition and subtraction, and parentheses may be used to force evaluation of some expressions before others: $calc (84 - 32)*5/9 + 32 $ forces evaluation of 84 − 32 first of all. The results of calculations, instead of being displayed on the screen, may be assigned to a scalar or vector. The equals sign is used to assign the result of a calculation to a GLIM data structure. We first declare a new standard length for vectors by using the $slength directive (Section 1.3) $slen 8 $calc %d = 5 $calc d = 5 $

The first $calc statement assigns the single number 5 to the scalar %d. The second $calc statement declares d to be a variate of length 8 (the standard length), and then assigns the number 5 to all elements of d. Transformations of existing variates can also be carried out. For example $assign x = 1...10 $calc y = 2**(x - 1) $

GELM: “chap01” — 2005/2/19 — 07:50 — page 10 — #10

functions and suffixing

11

creates a variate y with length 10, with each element of y containing the transformation 2x−1 of the corresponding element of x. Assigning the result to a scalar will put the last element of the transformed x into the scalar; thus $calc %e = 2**(x - 1) $print %e $

will display 128.0. The contents of defined vectors may be overwritten at any time by using the $calc directive. Vectors may be transformed ($calc y = (y+1)/2 $) and reassigned ($calc y = x**3 $). All previously defined vectors appearing in the same calculate expression must have the same length. Thus the expression $calc d = x/2 $ produces an error message, as the vector x has length 10, whereas d has length 8. 1.8

Functions and suffixing

A range of functions is provided in GLIM—these can be categorized into structure, mathematical, statistical, logical and other functions. A full list can be found through $manual functions $—help on individual functions is available by issuing the $manual directive followed by the function name, for example, $manual %log $. 1.8.1 Structure functions Structure functions return the characteristics of an identifier, for example, its length (%len), number of levels (%lev) and type (%typ). So $assign x = 1,4...,19 $num lenx $calc lenx = %len(x) $print lenx $

will store the value 7 in the scalar lenx. 1.8.2 Mathematical functions GLIM provides a small set of mathematical functions, including the natural log (%log), exponential (%exp), absolute value (%abs), truncation (%tr) and incomplete gamma, digamma and trigamma functions. Thus, $assign x = -1, 0,...,4 $calc lx = %log(x) $

will assign the natural log of all units of x to the equivalent units of lx. Where x = 0 or −1 and the natural log is undefined, GLIM continues with the calculation, and assigns the value 0 to all elements where the expression is undefined. A warning message is printed: -- invalid function/operator argument(s)

This behaviour of the $calculate directive applies to all other functions and expressions that give an invalid result, such as dividing by zero, or taking the square root of a negative number.

GELM: “chap01” — 2005/2/19 — 07:50 — page 11 — #11

introducing glim4

12

A particularly useful function is the cumulative sum function %cu. The GLIM statement $assign x = 1,2...,6 $calc %y = %cu(x) $calc cux = %cu(x) $

will calculate the sum of the values of the vector x and store the result in the scalar %y. If the result is assigned to a vector, rather than a scalar, the vector will contain the partial sums of x. $print cux $ 1.000

3.000

6.000

10.00

15.00

21.00

The function therefore provides one simple method for the calculation of sums and sums of squares of a vector. For example, $calc %m = %cu(x)/%len(x) : %v = %cu((x - %m)**2)/(%len(x) - 1) $

will store the mean of the vector x in the scalar %m and its variance in the scalar %v. The index (%ind) function gives a vector of unit or observation numbers for a variate. For example, $calc index = %ind(hbefore) $

1.8.3 Logical functions and operators Logical functions are a particular type of function in GLIM which return only two values, zero (false) and one (true). We give a simple example. $assign x = 13,21,11,36,15,19 $ $calc xgt20 = %gt(x,20) $ $print xgt20 $ 0.000 1.000 0.000 1.000 0.000 0.000

The function %gt is used to test whether one item is greater than the other. Equivalently, we could use the logical operator > and the above expression can be written as $calc xgt20 = x>20 $. There are six logical functions available: equality (%eq or ==), inequality (%ne or ?), greater than or equal to (%eq or >=), greater than (%gt or >), less than or equal to (%le or = 65) $tab (style = l) for sex, raised $ -- the table contains empty cell(s) +----------------+ RAISED | 0.000 1.000 | SEX | | +--------+----------------+ | 1 | 9.000 0.000 | | 2 | 7.000 3.000 | +--------+----------------+

A table with four cells is produced, preceded by a warning message, as one of the counts is zero. Three of the males have raised initial hostility score, and none of the females. Note that it is not necessary for raised to be declared as a factor, or even to be coded as a factor. An output variate can be specified: the using output-phrase is matched to the for input-phrase. $tab for sex, raised

using cells$

The variate cells is constructed to be of length 4. However, we may be unsure in which order the vector is formed (in fact, the first two values of cell correspond to low and raised scores for females)—and usually we need vectors to cross-classify or label the output counts. We may achieve this by using the by output-phrase to contain a list of new vectors which will contain classification labels for the table. These vectors, which we name tsex and traised, will also be of length 4. $tab for sex, raised using cells by tsex,traised $ $look cells tsex traised $

1 2 3 4

cells 9.000 0.000 7.000 3.000

tsex traised 1.000 0.000 1.000 1.000 2.000 0.000 2.000 1.000

We see that the vector tsex contains the values 1 and 2, whereas traised contains the values 0 and 1. As sex is a factor, tsex is automatically declared as a factor. raised, however, is declared as a variate. The types of the vectors specified in the for phrase are transferred to the output vectors specified in the by phrase. We may use the $tprint directive to print a more informative version of the above table. We specify a style of 1 in the option list. $tprint (style = 1) cells

tsex,traised $

GELM: “chap01” — 2005/2/19 — 07:50 — page 23 — #23

24

introducing glim4

Combinations of input phrases may be used to produce tables of summary statistics. For example, to produce a table of means of the variate hafter rather than cell counts, we use $tab the hafter mean for sex,raised $

Output phrases may additionally be used to store the vector of means, the associated vector of counts and the cross-classifying vectors. All the output vectors will again be of length 4. We may then use $tprint to print both the cell means of hafter and the associated cell counts in one table. $tab the hafter mean for sex,raised into tmean by tsex,traised $ $tprint (style = 1) tmean,cells tsex,traised $ +----------------+ TRAISED | 0.000 1.000 | TSEX | | +------------+----------------+ | 1 TMEAN | 61.33 0.00 | | CELLS | 9.000 0.000 | ------------------------------| 2 TMEAN | 57.29 83.33 | | CELLS | 7.000 3.000 | +------------+----------------+

Thus, the $tabulate directive provides a flexible way of summarizing individual level data into aggregate data suitable for subsequent analysis. It is used extensively in Chapters 4 and 5 to aggregate individual level data into contingency tables and to collapse existing contingency tables.

GELM: “chap01” — 2005/2/19 — 07:50 — page 24 — #24